* remove ExecItem and merge it with ScheduleItem
* less diff
* fix issues
* min diff
* don't change bufs in _lower
* min diff
* update
* revert
* fixes
* diff
* nak works
* TestOps::test_add works
* testop has no crashes
* fix bool casts
* fix typo
* add disassemble
* RANGE and locals/regs
* simplify NAKCompiler
* disass cleanup
* cleanup nir codegen
* almost all tests passing
* cleanup notes in extra/
* old notes
* only import nak if NIR=1
* fix new SPECIAL syntax
* fix local/shared memory
* more tests passing
* add DEFINE_VAR support
* llvmpipe kinda works
* diskcache
* some mypy stuff
* lvp passing test_ops.py
* fix imports
* actually fix imports
* remove 'stdout'
* fix llvm import
* fix mypy issues
* nicer errors
* simpler test_dtype skips
* test lvp in CI
* fix github action syntax
* fix more actions typos
* switch to mesa 25.1.0
* diskcache_put
* better generation for lvp nir_options
* b64encode shader blobs
* Revert diskcache changes
This reverts commits 930fa3de8a and 8428c694b3.
* general cleanup
* better error messages
* fix llvm import
* fix windows tests
* link with libm and libgcc_s
* fix some errors
* dont check for 'float4'
* NIR uses pointer arithmetic
* use tinymesa
* bump tinymesa
* bump tinymesa again
* update lvp nir_options
* print nir shader with DEBUG
* simplify LVPCompiler
* more tests
* "gated" STORE
* NAK is cacheable
* more tests
* all tests pass locally for NAK
* test autogen in CI
* autogen deps
* more deps
* fix uop_gc
* fix macos
* mypy
* save 2 lines
* save two more lines
* save 1 line
* save 4 lines
* save more lines
* Revert "save more lines"
This reverts commit dd3a720c5a.
* save more lines
* fix LVP on windows
* refactor
* reorganize some code
* refactor lib_gpu
* move LVP check
* out of order loads
* remove support.mesa
* bump tinymesa version
* simplify LVP jit
* macos
* macos ci
* shell: bash
* testing
* more testing
* compute brew prefix
* stupid typo
* actually fix
* lib
* stdout on macos
* inline gallivm_compile_module
* Revert "inline gallivm_compile_module"
This reverts commit b65983b151.
* elf macos
* semicolon
* inherit from CPULLVMCompiler
* ruff
* disas test
* fix libm linking
* default is fine actually
* arm works
* add elf loader link test
* fix NAK beam
* pylint is too smart by half
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* test case for a long rand chain
currently failing with RANGEIFY because device propogates too deep
* skip
* ops: n^2 .device property fix
* unskip
---------
Co-authored-by: Chen-Yu Yang <chenyu@fastmail.com>
* remove del spam from CI
* more
* preconstruct default buffer spec
* ignore those errors
* check exception
* more exception check
* skip stuff
* smaller tests mean faster tests
* a few more
* failed test case for threefry
not sure if it's always like this, but increment before _threefry_random_bits is incorrect. the counts should start with random numbers generated so far.
use jax to generate 20 + 20 + 10 random numbers, the first 20 + 20 matches and the last 10 are different. just moving increment after _threefry_random_bits matches the number but jit test failes
* workaround
* why is this different?
* revert those
* and that
* WebGPU f16 support
* Don't enable f16 yet
* dtype tests passing after bitcast fix
* Maybe all WebGPU green?
* Require shader-f16 in examples
* Minor wgsl touchup
* 1 line shorter
* Simpler
* Add transcendetal support
* log2 nan location mismatch on Vulkan
* Nan skips
* init mockcuda
* run gpu ocelot
* fix
* sfixes
* disable broken tests
* linter
* these fails as well
* pylint
* myypy
* this fails on real platforms as well
* mypy please
* add tiny test for randomness
* Tensor._device_seeds is a Tuple
* no tuple, just a 2 element tensor
* no more longs
* fix tests, and maybe ocelot works now
* NV still doesn't work. cleanup rules
* test + two more rules
* always use f32 for source of randn
fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too
* HSA is broken
* test case Tensor.randn should be finite
there's a hack to fix float16, need a generic solution that works with bf16 and threefry
* skip not supported
* bfloat16 local is wrong
* skip RHIP
* feat: initial xor
* feat: initial threefly
* feat: remove custom random
* fix: really need to install precommit
* feat: lmao forgot that this is rotate not a shift
* clean: put that there
* feat: numpy xor
* feat: quick test for xor
* feat: llvm xor
* feat: slightly working xor in torch
* feat: rand works in jit
* clean: save a line
* feat: match jax
* feat: maybe test against jax
* feat: requires_grad
* fix: fix test_symbolic_ops
* feat: lower alpha
* feat: just pad
* fix: maybe fix training tests?
* fix: fix some llvm stuff
* feat: cursed realize on the way out
* feat: testing jax
* fix: why is the jax install process not simple
* fix: maybe passing test
* fix: symbolic workarounds
* clean: still need that precommit
* fix: aaaa
* fix: more test fixes
* fix: quick fix for wgsl
* feat: need to set requires_grad on the final tensor
* feat: one more tensor
* feat: don't take forever
* feat: seeing y ci is brok
* feat: can't allocate 64GiB lmao
* fix: fix this
* feat: hope this doesn't break smth before i go to bed
* feat: don't destroy ram
* feat: int
* feat: remove jax
* feat: properish workaround?
* feat: skip slow webgpu tests
* feat: no longer fails
* feat: use dtypes
* feat: real number
* fix: torch
* fix: don't test against reference for torch
* feat: to device
* feat: fix advanced indexing
* feat: correct casting
* feat: even rng_counter
* feat: match master
* feat: this was actually bad
* fix: maybe?
* feat: store
* feat: remove realizes
* feat: somehow this is important
* feat: somehow this is also important
* feat: save a line
* fix: don't need that anymore
* feat: restore this
* fix: linter
* feat: remove realizes
* fix: realized is in base now
* fix: add back cast
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: :(
* fix: :(
* fix: not being dumb
* feat: try changing less tests
* feat: shouldn't have to change that
* feat: contiguous bumps it by one
* fix: hmm
* fix: numpy memory moment
* fix: cl_khr_fp16
* fix: torch has different tensor count
* fix: missing contiguous
* hmm: hmm
* fix: some fixes
* fix: typing
* feat: dont do that
* feat: typing fixes
* feat: why is this realize required?
* feat: ngl kinda odd typing
* feat: oh
* feat: remove realizes
* feat: why is this realize required?
* fix: hacky patch for cudacpu
* fix: without this realize pytest crashes?????
* fix: shorter line
* fix: cudacpu fixes
* fix: cudacpu fixes
* feat: real buffer
* feat: don't search when searching lmao
* fix: can't use contiguous things
* fix: no more 100GB arrays
* fix: revert
* fix: skip 7 and 10
* feat: working ish beam
* feat: minimize changes
* feat: seed 0 stable diffusion example changed
* fix: different on ci
* fix: no beam
* feat: make threefry optional
* fix: check value
* fix: unused import
* feat: threefry default
* fix: 5d
* feat: allow non upcast div
* fix: 5d better
* fix: 5d better
* fix: save all dtype
* feat: proper error
* feat: lazyop key
* fix: check float
* feat: try removing this realize now
* feat: disable threefry for uops hip tensor cores
* feat: don't need that
* feat: only check upcast
* fix: disable threefry for some metal tests
* feat: disable for metal tensor uops as well
* feat: disable for most uops
* fix: disable threefry for new uops tests
* feat: multitensor
* fix: typing
* feat: threefry default off
* feat: skip threefry half rand
* feat: restore old
* fix: bad git
* clean: ruff
* feat: bfloat16 fix
* fix: :|
* feat: restore old
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* feat: initial xor
* feat: initial threefly
* feat: remove custom random
* fix: really need to install precommit
* feat: lmao forgot that this is rotate not a shift
* clean: put that there
* feat: numpy xor
* feat: quick test for xor
* feat: llvm xor
* feat: slightly working xor in torch
* feat: rand works in jit
* clean: save a line
* feat: match jax
* feat: maybe test against jax
* feat: requires_grad
* fix: fix test_symbolic_ops
* feat: lower alpha
* feat: just pad
* fix: maybe fix training tests?
* fix: fix some llvm stuff
* feat: cursed realize on the way out
* feat: testing jax
* fix: why is the jax install process not simple
* fix: maybe passing test
* fix: symbolic workarounds
* clean: still need that precommit
* fix: aaaa
* fix: more test fixes
* fix: quick fix for wgsl
* feat: need to set requires_grad on the final tensor
* feat: one more tensor
* feat: don't take forever
* feat: seeing y ci is brok
* feat: can't allocate 64GiB lmao
* fix: fix this
* feat: hope this doesn't break smth before i go to bed
* feat: don't destroy ram
* feat: int
* feat: remove jax
* feat: properish workaround?
* feat: skip slow webgpu tests
* feat: no longer fails
* feat: use dtypes
* feat: real number
* fix: torch
* fix: don't test against reference for torch
* feat: to device
* feat: fix advanced indexing
* feat: correct casting
* feat: even rng_counter
* feat: match master
* feat: this was actually bad
* fix: maybe?
* feat: store
* feat: remove realizes
* feat: somehow this is important
* feat: somehow this is also important
* feat: save a line
* fix: don't need that anymore
* feat: restore this
* fix: linter
* feat: remove realizes
* fix: realized is in base now
* fix: add back cast
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: bump deadline
* fix: :(
* fix: :(
* fix: not being dumb
* feat: try changing less tests
* feat: shouldn't have to change that
* feat: contiguous bumps it by one
* fix: hmm
* fix: numpy memory moment
* fix: cl_khr_fp16
* fix: torch has different tensor count
* fix: missing contiguous
* hmm: hmm
* fix: some fixes
* fix: typing
* feat: dont do that
* feat: typing fixes
* feat: why is this realize required?
* feat: ngl kinda odd typing
* feat: oh
* feat: remove realizes
* feat: why is this realize required?
* fix: hacky patch for cudacpu
* fix: without this realize pytest crashes?????
* fix: shorter line
* fix: cudacpu fixes
* fix: cudacpu fixes
* feat: real buffer
* feat: don't search when searching lmao
* fix: can't use contiguous things
* fix: no more 100GB arrays
* fix: revert
* fix: skip 7 and 10
* feat: working ish beam
* feat: minimize changes
* feat: seed 0 stable diffusion example changed
* fix: different on ci
* fix: no beam
* feat: make threefry optional
* fix: check value
* fix: unused import
* feat: threefry default
* fix: 5d
* feat: allow non upcast div
* fix: 5d better
* fix: 5d better
* fix: save all dtype
* feat: proper error
* feat: lazyop key
* fix: check float
* feat: try removing this realize now
* feat: disable threefry for uops hip tensor cores
* feat: don't need that
* feat: only check upcast
* fix: disable threefry for some metal tests
* feat: disable for metal tensor uops as well
* feat: disable for most uops
* fix: disable threefry for new uops tests
* feat: multitensor
* fix: typing
* feat: threefry default off
* feat: skip threefry half rand
* feat: restore old
* fix: bad git
* clean: ruff
* feat: bfloat16 fix
* fix: :|
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* fix: make Tensor.rand produce correct values for float16
Due to precision loss when casting to float16, the data distribution created by custom_random isnt correctly in the interval ]0, 1[, but instead in the interval ]0, 1], which causes the Tensor.randn to incorrectly generate values of infinity.
The solution uses a scaling value to make sure the values stay under 1, when using half precision.
Closes#3611
* update implementation to truncate to closest f16 value to 1
* chore: fix whitespace
* test larger distribution
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* ww/Fixed Tensor.randint() to accept shape tuples ()
* ww/Wrote a test to cover this typo
* ww/Updated Tensor random objects to optionally take (,) or *() to be more consistent
* ww/no lint no worries
* ww/Made peace with linter
* ww/Added new line can't reduce line size without reducing readablitity
* ww/reverted to using .mul