for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose)
error:
```
Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
*((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
^
4 errors detected in the compilation of "<null>".
```
* validate that FC exists before loading pretrained weights
* add test case for ResNet pretrained model without FC layer
* remove extra newline
* rename test case
* reraise exception if not handled by check
* switch symbolic from old to uops, final PR
* two wrong answers
* not needed resolves
* symbolic ops passes
* symbolic ops passes
* progress
* tests pass (almost)
* fix last test
* fix some tests
* global binding and unbinding
* Revert "global binding and unbinding"
This reverts commit 9456725630.
* that test works now
* vars on uop doesn't recurse
* fix fuzzer
* update
* fix type
* fix gpt, it's UOp now
* ssimplify symbolics
removed DISABLE_DROPOUT=1.
updated BS to 54 that works on tinyboxes with dropouts.
used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method
* added model impl
* minor cleanups
* extracted weights loading into from_pretrained
* reorganized model for better weight loading
* removed lru cache for state dict loading
* start writing openelm
* progress...hit bug
* repeat_interleave support
* gqa
* add rotary embedding
* spp
* i think it runs correctly
* broken
* output is good now
* cleanups
* no io_uring on android
replaced all dtype.np with _to_np_dtype defined in tensor.py.
after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer