George Hotz
8cbef912d2
move reshape to MathTraits ( #13054 )
...
* move reshape to MathTraits
* confirm it works in amd_uop_matmul
2025-11-02 12:56:15 +08:00
George Hotz
267be7fc5e
fp16 acc
2025-11-02 12:53:04 +08:00
George Hotz
e98506735b
add CONTRACT support to UOp programs ( #13043 )
...
* add contract support
* use contract
* 342 tflops
2025-11-01 19:11:32 +08:00
George Hotz
65a0a31475
AMD mi350x matmul from stream ( #13040 )
...
* works
* working mfma
* 120 TFLOPS
* regs
* 192 TFLOPS
* try pipelining
* something
* notes
* contract
* linter to 3.11
* that was a bug
2025-11-01 17:55:19 +08:00
George Hotz
bc178d14a9
matmul example on metal showing off tensor core ( #13033 )
...
* matmul example on metal showing off tensor core
* flip the args of placeholder
* mat_idx
* imp
2025-10-31 19:40:36 +08:00
George Hotz
b46229ca51
use shrink in amd_matmul_uop ( #13026 )
...
* use shrink in amd_matmul_uop
* colors
2025-10-31 10:43:41 +08:00
George Hotz
512513c403
cleanup amd uop matmul ( #13025 )
...
* cleanup amd uop matmul
* remove mod
* move that out
* better variable names
* var names
* more
* render fallback
* colors
2025-10-31 10:04:45 +08:00
George Hotz
4a741e8364
modernize amd uop matmul ( #13011 )
...
* modernize amd uop matmul
* progress
* comment
* more comments
* revert that
* mac cleanups
* fix estimates
* format
2025-10-30 17:02:38 +08:00
George Hotz
25c2da1579
check SPEC=2 in CI ( #12945 )
...
* check SPEC=2 in CI
* split SPEC=2
* fast enough
2025-10-27 21:53:57 +08:00
chenyu
c5cee74706
remove BLOCK_REORDER ( #12854 )
...
not used
2025-10-21 19:10:14 -04:00
b1tg
60d7e232f2
cuda fp8 ( #12782 )
...
* cuda fp8
* tensor core
* tc test
* clean
* clean pm
2025-10-21 15:05:25 -04:00
chenyu
ae51bdd06a
remove trivial use of RANGEIFY flag ( #12550 )
...
some tests need update still
2025-10-09 02:29:38 -04:00
chenyu
0e266f376c
ops_gpu -> ops_cl ( #12103 )
2025-09-10 15:15:48 -04:00
nimlgen
fb96394ff5
auto-select available compilers ( #12094 )
...
* device: auto select compilers
* fix
* metal+opencl
* nv/cuda
* test without ptx
* ptx
* fix tests
* fix
* fix test
* rename
* test + cleaner
* xx
* ops
* better test
* win?
* um?
* types
* debug
* win??
* sep rung
* wtf?
* debug
* skip win
* revert this
* types
2025-09-10 19:52:01 +03:00
George Hotz
38dcadf07b
delete kernel.py ( #12040 )
...
* delete kernel.py
* delete that file
* rip and tear
* don't test search
* imports
* fix torch frontend
* not a part of regen
2025-09-05 15:52:07 -07:00
George Hotz
afad7d0cd1
remove dtype from range, it will be dtypes.index soon [pr] ( #11914 )
...
* remove dtype from range, it will be dtypes.index soon [pr]
* a few more
2025-08-29 09:52:07 -07:00
George Hotz
394c2d1db1
update Kernel API in tests + move optimize_local_size ( #11907 )
2025-08-28 15:12:47 -07:00
George Hotz
27701ef823
add locals support to rangeify ( #11826 )
2025-08-24 14:03:12 -07:00
qazal
793ace530e
update amd_uop_matmul.py import ( #11581 )
...
Using this for testing SQTT
2025-08-08 17:07:35 +03:00
George Hotz
82be8abfd2
move opt under codegen ( #11569 )
2025-08-07 14:19:17 -07:00
George Hotz
4f26a9ad32
check elements_per_thread in tensorcore [pr] ( #11435 )
2025-07-30 11:55:48 -07:00
George Hotz
1bef2d80c1
unrolls are all in the same scope ( #11429 )
...
* unrolls are all in the same scope
* fix that import
2025-07-29 16:55:37 -07:00
George Hotz
03909f2772
permute locals for HL uop matmul ( #11412 )
...
* permute locals for HL uop matmul
* parens fix that
* permutes
* 20 TFLOPS
2025-07-29 08:19:59 -07:00
George Hotz
735ad5f10d
kernel4 and 5 in uops ( #11411 )
...
* move simplify views to merge views
* add amd kernel 4
* Revert "move simplify views to merge views"
This reverts commit 1e07dff384 .
* k4 in python
* kernel4 written in uops
* k5 support
* cleanups
2025-07-28 19:35:48 -07:00
George Hotz
fddc645668
HL=2 top matmul ( #11406 )
...
* HL=2 top matmul
* top colored
2025-07-28 12:32:38 -07:00
George Hotz
dfeee63d30
uop matmul work ( #11388 )
...
* uop matmul work
* works with locals
2025-07-26 21:23:55 -07:00
George Hotz
2c70eaf18c
fix load / barrier ( #11386 )
...
* fix load / barrier
* cleanups
* fix CI
2025-07-26 10:27:37 -07:00
George Hotz
466ab5a3f2
store/load not pass through index ( #11381 )
...
* noop
* fix noop
* store cat is NOOP
* store dtype is void
* stores aren't passed through anymore
* meh, skip those for ptx
* correct ptx skip
* hl runs
2025-07-25 21:01:47 -07:00
George Hotz
490a93902c
define reg doesn't have init anymore ( #11365 )
...
* define reg doesn't have init anymore
* remove that
* no special logic for dr
* fix amd uop matmul
2025-07-24 19:15:49 -07:00
George Hotz
0602b22086
kernel spec ( #11359 )
...
* kernel spec
* ops.VIEW
* work
2025-07-24 12:45:38 -07:00
George Hotz
b0dc97d1f7
write out kernel 3 in uops ( #11352 )
...
* write out kernel 3 in uops
* matmul is correct
* gemm passes spec
* bugfix to match speed
* cleanups
2025-07-23 17:32:38 -07:00
George Hotz
108aac8af4
use AddrSpace instead of local ( #11314 )
...
* use AddrSpace instead of local
* addrspace in test
2025-07-21 14:00:06 -07:00
George Hotz
842184a1ab
rename kernelize to schedule, try 2 ( #11305 )
2025-07-21 11:18:36 -07:00
chenyu
a0438012af
remove Kernel.get_program [pr] ( #11203 )
2025-07-12 20:50:29 -04:00
George Hotz
d67c8e7b42
local metal on metal in uop syntax ( #11185 )
...
* local metal on metal in uop syntax
* TODO: just put the axis_info in the kernelinfo
* local
* amd_matmul works @ 28 TFLOPS
* clean up matmul
* kernel8 works
* remove that
* locals
* axistype innovation
* work
* cleanup
* kernel3 regs
* cleanup kernel3
* work
* why is it broken
* no beam
* reenable
* permutes
2025-07-12 16:31:19 -07:00
chenyu
6283d50224
DEPRECATED_linearize -> to_program [pr] ( #11198 )
2025-07-12 13:46:20 -04:00
George Hotz
2893feb9f6
cleanups for kernel.py ( #11143 )
...
* cleanups for kernel.py
* fixups
2025-07-08 18:10:25 -07:00
George Hotz
856759c79c
add halide example ( #10980 )
...
* add halide example
* upd halide gemm
* partial works
* touchups
2025-06-26 16:14:57 -07:00
George Hotz
92678e59ee
move kernel to opt ( #10899 )
2025-06-20 15:22:28 -07:00
Sidharth N. Babu
ef14dfb277
compile fixes ( #10442 )
2025-06-06 18:38:37 -04:00
George Hotz
411392dfb7
move files into uop dir ( #10399 )
...
* move files into uop dir [pr]
* tinygrad.uop is a thing
* fix uop docs, no pr
* fix viz
2025-05-18 11:38:28 -07:00
chenyu
720f20865b
remove required_optimizations ( #9848 )
2025-04-19 16:51:16 -04:00
chenyu
f5256e0020
Kernel.apply_opts [pr] ( #9917 )
...
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
2025-04-17 08:00:56 -04:00
chenyu
8c6299bced
move hand_coded_optimizations to heuristic.py [pr] ( #9844 )
...
* move hand_coded_optimizations to heuristic.py [pr]
also folded all long lines
* make a copy and rename self -> k
* fix test
2025-04-10 23:40:16 -04:00
chenyu
c5db5b83b9
add SHOULD_USE_TC=1 check to simple_matmul ( #9802 )
...
* add SHOULD_USE_TC=1 check to simple_matmul
also zero centered the random input and update atol for tf32
* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
78caf55154
Revert "FP8 support on NVIDIA ( #8631 )"
...
This reverts commit 2c8e4ea865 .
2025-04-09 12:27:41 +08:00
George Hotz
14928fecff
Revert "fix TF32 tensor core dropped in tc_sm89 ( #9798 )"
...
This reverts commit 7c9a96824f .
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f
fix TF32 tensor core dropped in tc_sm89 ( #9798 )
...
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
pkotzbach
2c8e4ea865
FP8 support on NVIDIA ( #8631 )
...
* squashed fp8 commits
* tensorcore start
* minor changes
* pre-commit
* pylint
* Delete fp8mul.cu
* clean
* small bugfix
* fix test_dtype
* fix test_dtype_alu
* add EMULATE_CUDA_SM89
* fix ci
* fix test_linearizer
* fix test_linearizer
* fix swizzle
* add debug to simple_matmul
* fixed swizzle
* python emulator
* refactor python emulator
* setup fix
* numpy setup
* ml_dtypes only in emulate_cuda_sm89
* fix pylint
* fix tests
* fix mypy
* fix mypy
* fix ruff
* done python emulator
* add acc type
* tests
* mypy
* clean code
* add cuda tensor core tests to CI
* minor fix
* clean test_dtype.py
* clean cstyle.py
* clean test_ops.py
* fix test
* fix test
* whitespaces
* pylint
* pylint
* amd?
* amd?
* amd
* reduce lines
* mockgpu remove
* fix
* ruff
* ruff
* fix mypy
* ruff
* test only for cuda
* fixed formatting
* small fixes
* small fix
* least_upper_dtype if fp8s not supported
* log and reciprocal are supported for fp8s
* ops python fixes
* dtypes.fp8s use
* e4m3 + e5m2 result dtype test
* truncate linter fix
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-08 21:54:04 -04:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM ( #6926 )
...
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00