chenyu
0e266f376c
ops_gpu -> ops_cl ( #12103 )
2025-09-10 15:15:48 -04:00
nimlgen
fb96394ff5
auto-select available compilers ( #12094 )
...
* device: auto select compilers
* fix
* metal+opencl
* nv/cuda
* test without ptx
* ptx
* fix tests
* fix
* fix test
* rename
* test + cleaner
* xx
* ops
* better test
* win?
* um?
* types
* debug
* win??
* sep rung
* wtf?
* debug
* skip win
* revert this
* types
2025-09-10 19:52:01 +03:00
George Hotz
38dcadf07b
delete kernel.py ( #12040 )
...
* delete kernel.py
* delete that file
* rip and tear
* don't test search
* imports
* fix torch frontend
* not a part of regen
2025-09-05 15:52:07 -07:00
George Hotz
afad7d0cd1
remove dtype from range, it will be dtypes.index soon [pr] ( #11914 )
...
* remove dtype from range, it will be dtypes.index soon [pr]
* a few more
2025-08-29 09:52:07 -07:00
George Hotz
394c2d1db1
update Kernel API in tests + move optimize_local_size ( #11907 )
2025-08-28 15:12:47 -07:00
George Hotz
27701ef823
add locals support to rangeify ( #11826 )
2025-08-24 14:03:12 -07:00
qazal
793ace530e
update amd_uop_matmul.py import ( #11581 )
...
Using this for testing SQTT
2025-08-08 17:07:35 +03:00
George Hotz
82be8abfd2
move opt under codegen ( #11569 )
2025-08-07 14:19:17 -07:00
George Hotz
4f26a9ad32
check elements_per_thread in tensorcore [pr] ( #11435 )
2025-07-30 11:55:48 -07:00
George Hotz
1bef2d80c1
unrolls are all in the same scope ( #11429 )
...
* unrolls are all in the same scope
* fix that import
2025-07-29 16:55:37 -07:00
George Hotz
03909f2772
permute locals for HL uop matmul ( #11412 )
...
* permute locals for HL uop matmul
* parens fix that
* permutes
* 20 TFLOPS
2025-07-29 08:19:59 -07:00
George Hotz
735ad5f10d
kernel4 and 5 in uops ( #11411 )
...
* move simplify views to merge views
* add amd kernel 4
* Revert "move simplify views to merge views"
This reverts commit 1e07dff384 .
* k4 in python
* kernel4 written in uops
* k5 support
* cleanups
2025-07-28 19:35:48 -07:00
George Hotz
fddc645668
HL=2 top matmul ( #11406 )
...
* HL=2 top matmul
* top colored
2025-07-28 12:32:38 -07:00
George Hotz
dfeee63d30
uop matmul work ( #11388 )
...
* uop matmul work
* works with locals
2025-07-26 21:23:55 -07:00
George Hotz
2c70eaf18c
fix load / barrier ( #11386 )
...
* fix load / barrier
* cleanups
* fix CI
2025-07-26 10:27:37 -07:00
George Hotz
466ab5a3f2
store/load not pass through index ( #11381 )
...
* noop
* fix noop
* store cat is NOOP
* store dtype is void
* stores aren't passed through anymore
* meh, skip those for ptx
* correct ptx skip
* hl runs
2025-07-25 21:01:47 -07:00
George Hotz
490a93902c
define reg doesn't have init anymore ( #11365 )
...
* define reg doesn't have init anymore
* remove that
* no special logic for dr
* fix amd uop matmul
2025-07-24 19:15:49 -07:00
George Hotz
0602b22086
kernel spec ( #11359 )
...
* kernel spec
* ops.VIEW
* work
2025-07-24 12:45:38 -07:00
George Hotz
b0dc97d1f7
write out kernel 3 in uops ( #11352 )
...
* write out kernel 3 in uops
* matmul is correct
* gemm passes spec
* bugfix to match speed
* cleanups
2025-07-23 17:32:38 -07:00
George Hotz
108aac8af4
use AddrSpace instead of local ( #11314 )
...
* use AddrSpace instead of local
* addrspace in test
2025-07-21 14:00:06 -07:00
George Hotz
842184a1ab
rename kernelize to schedule, try 2 ( #11305 )
2025-07-21 11:18:36 -07:00
chenyu
a0438012af
remove Kernel.get_program [pr] ( #11203 )
2025-07-12 20:50:29 -04:00
George Hotz
d67c8e7b42
local metal on metal in uop syntax ( #11185 )
...
* local metal on metal in uop syntax
* TODO: just put the axis_info in the kernelinfo
* local
* amd_matmul works @ 28 TFLOPS
* clean up matmul
* kernel8 works
* remove that
* locals
* axistype innovation
* work
* cleanup
* kernel3 regs
* cleanup kernel3
* work
* why is it broken
* no beam
* reenable
* permutes
2025-07-12 16:31:19 -07:00
chenyu
6283d50224
DEPRECATED_linearize -> to_program [pr] ( #11198 )
2025-07-12 13:46:20 -04:00
George Hotz
2893feb9f6
cleanups for kernel.py ( #11143 )
...
* cleanups for kernel.py
* fixups
2025-07-08 18:10:25 -07:00
George Hotz
856759c79c
add halide example ( #10980 )
...
* add halide example
* upd halide gemm
* partial works
* touchups
2025-06-26 16:14:57 -07:00
George Hotz
92678e59ee
move kernel to opt ( #10899 )
2025-06-20 15:22:28 -07:00
Sidharth N. Babu
ef14dfb277
compile fixes ( #10442 )
2025-06-06 18:38:37 -04:00
George Hotz
411392dfb7
move files into uop dir ( #10399 )
...
* move files into uop dir [pr]
* tinygrad.uop is a thing
* fix uop docs, no pr
* fix viz
2025-05-18 11:38:28 -07:00
chenyu
720f20865b
remove required_optimizations ( #9848 )
2025-04-19 16:51:16 -04:00
chenyu
f5256e0020
Kernel.apply_opts [pr] ( #9917 )
...
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
2025-04-17 08:00:56 -04:00
chenyu
8c6299bced
move hand_coded_optimizations to heuristic.py [pr] ( #9844 )
...
* move hand_coded_optimizations to heuristic.py [pr]
also folded all long lines
* make a copy and rename self -> k
* fix test
2025-04-10 23:40:16 -04:00
chenyu
c5db5b83b9
add SHOULD_USE_TC=1 check to simple_matmul ( #9802 )
...
* add SHOULD_USE_TC=1 check to simple_matmul
also zero centered the random input and update atol for tf32
* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
78caf55154
Revert "FP8 support on NVIDIA ( #8631 )"
...
This reverts commit 2c8e4ea865 .
2025-04-09 12:27:41 +08:00
George Hotz
14928fecff
Revert "fix TF32 tensor core dropped in tc_sm89 ( #9798 )"
...
This reverts commit 7c9a96824f .
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f
fix TF32 tensor core dropped in tc_sm89 ( #9798 )
...
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
pkotzbach
2c8e4ea865
FP8 support on NVIDIA ( #8631 )
...
* squashed fp8 commits
* tensorcore start
* minor changes
* pre-commit
* pylint
* Delete fp8mul.cu
* clean
* small bugfix
* fix test_dtype
* fix test_dtype_alu
* add EMULATE_CUDA_SM89
* fix ci
* fix test_linearizer
* fix test_linearizer
* fix swizzle
* add debug to simple_matmul
* fixed swizzle
* python emulator
* refactor python emulator
* setup fix
* numpy setup
* ml_dtypes only in emulate_cuda_sm89
* fix pylint
* fix tests
* fix mypy
* fix mypy
* fix ruff
* done python emulator
* add acc type
* tests
* mypy
* clean code
* add cuda tensor core tests to CI
* minor fix
* clean test_dtype.py
* clean cstyle.py
* clean test_ops.py
* fix test
* fix test
* whitespaces
* pylint
* pylint
* amd?
* amd?
* amd
* reduce lines
* mockgpu remove
* fix
* ruff
* ruff
* fix mypy
* ruff
* test only for cuda
* fixed formatting
* small fixes
* small fix
* least_upper_dtype if fp8s not supported
* log and reciprocal are supported for fp8s
* ops python fixes
* dtypes.fp8s use
* e4m3 + e5m2 result dtype test
* truncate linter fix
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-08 21:54:04 -04:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM ( #6926 )
...
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
chenyu
0e591baf43
redo simple_matmul change ( #9450 )
...
numpy does not support bfloat16
2025-03-14 17:53:52 -04:00
chenyu
b0f63d3c04
Revert "simple_matmul.py uses np to generate random ( #9438 )" ( #9449 )
...
This reverts commit 14018050c1 .
2025-03-14 17:14:22 -04:00
Ignacio Sica
14018050c1
simple_matmul.py uses np to generate random (#9438 )
...
* np generates randoms
* hotfix: use generator for int dtype
* float32 as default dtype for float generator
* use np.float32 instead of stirng
* add dtype= to integers generator
* change import _to_np_dtype source
2025-03-14 17:36:50 -03:00
chenyu
01e8b60911
acc_dtype -> dtype ( #9402 )
...
matched numpy and torch
2025-03-10 16:05:30 -04:00
George Hotz
a73d8717f3
fast amd gemm ( #9318 )
...
* 50 TFLOP AMD gemm
* add lds tiling
* register tiling
* flip locals
* work
* comment
* remove those
2025-03-03 12:01:14 +08:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
George Hotz
a3c78d47b3
speed docs + upgrades [pr] ( #8964 )
...
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
2025-02-08 17:28:52 +08:00
ignaciosica
b49a04145e
fix for int plus minor cleanup ( #8650 )
2025-01-17 22:30:39 -05:00
qazal
866dfa1f23
create_schedule([x.lazydata]) -> x.schedule() in tests ( #8449 )
2024-12-31 03:15:52 +08:00
George Hotz
c5d458ce02
BufferSpec and ProgramSpec [pr] ( #7814 )
...
* BufferSpec and ProgramSpec [pr]
* delete preallocate, it's unused
* Revert "delete preallocate, it's unused"
This reverts commit dcfcfaccde .
2024-11-21 12:18:05 +08:00
George Hotz
bc977fec53
dname -> device [pr] ( #7804 )
...
* dname -> device [pr]
* a few more
* only one left
2024-11-20 17:57:14 +08:00
George Hotz
d71fe7faa5
rename allocator methods to not conflict [pr] ( #7788 )
...
* rename allocator methods to not conflict [pr]
* forgot those
* transfer + offset
2024-11-20 00:10:29 +08:00