Francis Lam
7c5729a3bd
wmma: refactor to remove wmma_func and create TC funcs as needed ( #3945 )
...
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
2024-03-27 16:43:09 -04:00
Francis Lam
a26090d404
search: change to use "spawn" and limit the number of tasks per child ( #3862 )
...
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
Francis Lam
ddbdb52f77
wmma: enable METAL half tensor cores and clean up cstyle ( #3095 )
...
* wmma: enable METAL half tensor cores and clean up cstyle
* revert simple_matmul rand changes and break line in tensor
* added metal fp16->fp32 tensor core
2024-01-12 16:25:28 -05:00
chenyu
1d730b8853
remove ACCUM_FP32 in simple_matmul.py ( #3045 )
...
* remove ACCUM_FP32 in simple_matmul.py
accumate for half inputs is always in float
* move test llama compile speed to metal
2024-01-08 17:37:57 -05:00
George Hotz
a280cfe169
move dtypes to dtype.py ( #2964 )
...
* move dtypes to dtype.py
* fix urllib
2024-01-01 14:58:48 -08:00
George Hotz
b5fd160b39
hotfix: increase rtol on simple_matmul
2023-12-11 10:10:29 -08:00
Francis Lam
dece9958f8
wmma: clean up to make WMMA arg order consistent ( #2014 )
...
also add cache defeat to extra/gemm/simple_matmul.py
2023-10-07 17:45:40 -07:00
Francis Lam
f445e056ed
wmma: add test and tensor core shape ( #1925 )
2023-09-28 18:04:28 -07:00
George Hotz
e464442adf
WMMA for 7900XTX ( #1563 )
...
* go
* hip no LRU
* work
* works
* 16 TFLOPS
* 29 TFLOPS
* 30 TFLOPS
* never mind, it's 60 TFLOPS
* fix metal WMMA
* put hip alloc back
2023-08-19 09:07:23 -07:00
George Hotz
90fff82c8a
Rdna ( #776 )
...
* assembler maybe
* custom asm
* rdna3 on quiet
* trigger crashes
* fixed notes
* non-fatal rdna2 crash
* Crash4
* improve rdna sniffer
* comments
* improve sniffer
* asm
* 131 TFLOPS RDNA3
* opt simple matmul
* todos
2023-05-16 05:33:57 -07:00