* fp16 resnet
* cast running mean and var back to default float
* extra cast
* check symbolic no overflow
* add linearizer failure
* loss scaler after grad contig
* oops
* i think this works
* don't loss scale fp32
* remove overflow test case
* remove symbolic bounds check
* loss scaler should be float
* temporarily disable padto cuz bug
shruggie
* make running stats in batchnorm float32?
* calculate lars stuff in fp32?
* oops
* remove most changes
* move loss scaler out of optimizer
* no more FP16 var
* oops
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* fuzz_linearizer: reduce debug verbosity and make easier for CI usage
* rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset)
* skip simple ASTs (easier to use with LOGOPS output)
* don't fuzz a previously seen AST
* add options to allow non-zero --expected-failures
* clean up naming and use set
* Track pointer provenance in load/store through ALU
Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.
* Simplify the load provenance workaround
The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.
* Add a couple tests for basic local use
* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
* Fix sm89 PTX=1 compilation
The minimum PTX version that supports sm89 is 7.8 (same version also
supports sm90); without this ptxas fails when running tinygrad with
PTX=1 on RTX 4090.
* Use int(arch[3:]) for forward compat with SM10.0 if that happens
* env var to change default float to fp16 or bf16
looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.
working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
__bf16 cast0 = (nv_bfloat16)(val0);
```
remove that in cifar
* DEFAULT_FLOAT
* default of default
* unit test
* don't check default
* tests work on linux
* infra
* track mutations
* assign levels
* add seen back
* add test
* infra 2.0
* add assign targets
* dont need levels
* delete
* Update test_assign.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>