* Fix sm89 PTX=1 compilation
The minimum PTX version that supports sm89 is 7.8 (same version also
supports sm90); without this ptxas fails when running tinygrad with
PTX=1 on RTX 4090.
* Use int(arch[3:]) for forward compat with SM10.0 if that happens
* env var to change default float to fp16 or bf16
looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.
working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
__bf16 cast0 = (nv_bfloat16)(val0);
```
remove that in cifar
* DEFAULT_FLOAT
* default of default
* unit test
* don't check default
* tests work on linux
* infra
* track mutations
* assign levels
* add seen back
* add test
* infra 2.0
* add assign targets
* dont need levels
* delete
* Update test_assign.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* Adjust adds between WHERE and PHI
* Not much better
* undo recursive change
* hm
* iterate over where, not factored op
* oo
* consts only for loop
* UNdo var name change
* update
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* Fix permutation of result indices in einsum.
* Delete stray line used for breaking tests
* Fix linter error by renaming twice-used variable
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* initialize Tensor grad same type as self
* also test different default float
* check dtype + try/finally
* don't test_gradient_dtype if f16 is not supported
* fix bad merge
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* debug: add optional detailed BEAM_LOG logging
show uop count, compile and run times for each candidate in search
also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts
* fix linter
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.
70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.
`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`
13B on 6 gpus uses 47 GB v.s. 34 GB quantized
* init multidevice cuda graph
* cuda just works!
* clean
* linter happier
* liners happy
* update transfer inputs
* do not change free
* useless check for cuda
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* diverse test value in test_dtype DATA based on dtype
* eh fix typo
* that too?
* PTX does not support i8 and s8
* skip that
* unused line
* pus the hack back
* remove that
* ptx float4 implementation
* remove from cache when trimming uops
* Gate for float4
* Linting fix
* disable test reasonable time for ptx
* import getenv
* Update uops.py
* linter
* Add div test for half
* upcast if op does not support operation
* fix offset
* Run only if dtype supported
* zero out registers when accessing by pred + cleanup
* Remove trailing whitespace
* revert
* spacing fix
* move cache clearing outside loop
* did this suddenly start working?
* unused import removed
* Remove cast
* Use pattern matching
* linting
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>