chenyu
d641e63189
improve min/max for AND ( #14356 )
2026-01-26 15:44:18 -05:00
chenyu
f16372487a
fix assign hazard on shrink ( #14355 )
...
* fix assign hazard on shrink
possible to have race if both assign src and dest are shrink
* test_nonoverlapping_shrink_assignment
2026-01-26 14:46:30 -05:00
chenyu
145df879c1
find_permutes -> fix_assign_hazard [pr] ( #14354 )
...
some noop tweaks and comment updates
2026-01-26 14:05:19 -05:00
nimlgen
e152f1b0f5
llama: use ALL2ALL ( #14353 )
2026-01-26 22:01:53 +03:00
nimlgen
3f25eb3026
am: ih ( #14346 )
...
* am: ih
* um
* fix
* line
* no trap and fix ring
* keep
* fix
2026-01-26 20:11:04 +03:00
chenyu
823bc17fb5
failed test case for shrink overlap assigns ( #14350 )
...
* failed test case for shrink overlap assigns
current logic can create a race resulted in wrong output
* skip for now
2026-01-26 11:58:45 -05:00
George Hotz
204f51e739
assembly/amd: bug fixes for PYTHON_REMU ( #14347 )
...
* default PYTHON_REMU to 1
* mockgpu
* less size
* normal compile path
* uniqie
* more
* fix clamp
* Change PYTHON_REMU default to 0 in _try_dlopen_remu
2026-01-27 00:48:22 +08:00
chenyu
231305603d
remove REAL_DEV [pr] ( #14337 )
...
it's just Device.DEFAULT now
2026-01-26 10:08:16 -05:00
Martin Szewieczek
9cbe99348a
func meshgrid: change param index to type str ( #14331 )
2026-01-26 10:07:56 -05:00
George Hotz
3b43d26f10
assembly/amd: emu speed ( #14344 )
...
* assembly/amd: emu speed
* fix spec
* go
* don't do this
* simpler
* no stupid consts
* hack
* simpler
* no index
* no where
* faster linearizer
* fix spec
* no index dtype
2026-01-26 22:21:34 +08:00
George Hotz
774a454bb5
assembly/amd: fix scratch SVE ( #14340 )
...
* assembly/amd: default python REMU
* mem_used
* no lane
* sve
* remove that
* needs s_code_end in tests
2026-01-26 21:03:51 +08:00
qazal
2d91fe6310
use amdgpu dsl in mmapeak ( #14342 )
...
* use amdgpu dsl in mmapeak
* don't rely on llvm for vgpr counting
* llvm roundtrip assert
* rm it, add ci
* vgpr_count
* move emulated test to amd, it needs comgr
* env
* arch
* inst._fields -> inst.operands
* vgpr offset
2026-01-26 22:03:43 +09:00
qazal
b2e2ace85b
viz: remove ci check, it's VIZ=-1/-2 ( #14343 )
2026-01-26 20:36:23 +09:00
George Hotz
be23776ba7
assembly/amd: replace pcode with ucode ( #14002 )
...
* a bunch of todos for my boy claude
* uops have types
* lil cleanups
* simpler ucode
* isNAN
* calls
* move more
* cleanup pcode_parse
* cvt functions
* fix parser bugs
* no void
* minmax
* more pcode parse
* pretty print
* transform
* comments
* move to transform
* assign/declare
* simpler norm
* single PM
* just Uops
* simpler
* more typed
* all rewrite
* less verbose
* work
* spec
* transform
* work
* simpler spec
* less spec
* bitcast
* simpler
* simp ucode
* work
* more in pcode_transform
* remove junk
* more functions
* bug
* no void assign
* load/store
* wave
* fixes
* move denorm
* move more functions
* tests
* cat is shape None
* uop syntax
* move a few more
* program_spec
* cat stuff
* assign fix clear
* unused
* nans
* fp bits
* works with simplify
* remove junk
* special
* meh
* more
* more
* update test pcode parse
* improve parser
* parse some for loops
* merge master
* dead files
* tests pass
* emu2
* better emu2
* test_plus works
* uselessly write more instructions
* use pcode
* something
* something
* bench_emu
* progress
* ds works
* work
* work
* more passing
* run compare
* bench_emu
* more pcode
* a few more
* bugfixes
* bugfix
* test fixes
* tests pass without USE_HW
* all hw tests pass
* add more hw tests
* new hw tests
* bit
* less handcode
* parse more
* consolidate pcode
* fixes
* rsrc
* lane pcode
* cleanups
* simpler
* emu bugs
* one cmp test fails
* fix decode and upd name
* fix name and test harness
* _ftz_f32
* fix denorm
* fix VOPD and use load
* fix carry bug
* no load where / just invalid
* clean
* simpler
* merge sops
* refactoring
* simplifications
* bugfixes
* new tests
* f16 sin fix
* assertion and hw tests
* cvt functions
* one more failure
* bugfixes
* bugfix + regression
* more tests
* fmac
* no manual unrolling
* ordering
* LLVM backend is a lot faster
* compile inst
* more bugs
* f16
* bugfix
* fix regression
* one clang call
* 1M inst
* scratch works
* do scratch correctly
* cleanup
* regression
* cmp
* fmamk fixes
* merge
* fix vcmpx
* unify memory
* remove unused code
* ignore oob for test
* cleanups
* fix mbs
* unify cmp
* test
* minor cleanups
* bump timeout
* fix tests
* revert the CMPLE stuff
* remove opt
* less diff
* simpler
* revert
* support multiple backends
* memset is a lot faster
* split out in bench emu
* improve timing
* timing
* cache that
* cache that
* simpler and faster
* tokenize
* binop table
* simpler
* move to parser
* tok for lambda
* refactor
* expr_parser
* delete emu2_pcode
* import cleanup
* lil
* if parse
* work
* simpler
* no v
* trig preop is faster
* durations for tests
* fix cmp bug
* sdst
* remove scartch_size hack
* null behavior
* _MXCSRContext
* bugfixes
* DEBUG >= 3
* test smem crashes my gpu
* debug
* test
* test smem
* profiler
* full inst
* bugfix
* rtag(1)
* pc is 64-bit and word
* pc is real code now
* dynamic
* more dynamic
* fix oob access
* fix crash, more dyn
* all dyn
* really all dyn
* correct null mask
* lit + format
* 21s on the tests
* 13s on the tests
* canonical name
* simm16
* more dyn
* 14s
* proper saddr dedup
* dyn
* debug 5
* better 5
* revert dynamic stuff
* that can be dyn
* negative offsets
* dyn wmma
* f16 wmma support / ops / dtype / dtype_alu
* symbolic changes not needed
* ConstFloat
* more uop.const
* __eq__
* uop tests
* fix f16
* bf16 tensor cores
* whitespace
* remove cast roundtrip
* Revert "remove cast roundtrip"
This reverts commit c5bb0381c3 .
* just the fix
* remove dead paths
* llvm runs
2026-01-26 18:04:29 +08:00
George Hotz
984cdc4840
add wrapper class for the -0.0 != 0.0 issue ( #14339 )
...
* add wrapper class for the -0.0 != 0.0 issue
* fixes
* spec fix
* missed one
2026-01-26 16:52:37 +08:00
qazal
92bfe92138
assembly/amd: fix cdna mfma xml ( #14329 )
...
* handwritten failing test
* new amdxml
* more mfma from fixes
* ci
* move arch of test integration
* alt
* amdxml human cleanup
* _TestIntegration rename to IntegrationTestBase
* it's the same problem as _LIT
* better comment
* better variable name
2026-01-26 17:51:26 +09:00
Garret Castro
6c109f4d75
LLVM: CPU threading support ( #14320 )
...
* make generic llvmrenderer class for cpu and amd
* move `tensor_cores` back to parent
* remove empty line
* restore extra matcher position
* add threading
* dont need to add core_id here
* dont move code for workitem
* cleanup
---------
Co-authored-by: TheVanadium <claude_user@ret2022.localdomain >
2026-01-26 13:12:39 +08:00
George Hotz
cc49e47ea2
tinygrad changes from ucode ( #14336 )
...
* tinygrad changes from ucode
* dtype
2026-01-26 11:30:18 +08:00
Garret Castro
8477368d07
generic LLVMRenderer class for CPU and AMD ( #14321 )
...
* make generic llvmrenderer class for cpu and amd
* move `tensor_cores` back to parent
* remove empty line
* restore extra matcher position
* cleanup
---------
Co-authored-by: TheVanadium <claude_user@ret2022.localdomain >
2026-01-26 09:11:49 +08:00
George Hotz
11ce1e847d
llama train: null device support
2026-01-26 08:53:05 +08:00
chenyu
e3601788fa
update torch backend function ( #14333 )
...
those have tensor.py implementation
2026-01-25 16:39:34 -05:00
nimlgen
9865f51e39
cupti: ref collector ( #14330 )
...
* cupti: ref collector
* ll
2026-01-25 20:35:21 +03:00
nimlgen
21ab23ae18
nv: add pma for ada ( #14328 )
...
* nv: add pma for ada
* um
* fix
* shorter
* mock
2026-01-25 17:33:37 +03:00
George Hotz
49db266b96
ReprEnum for repr roundtrips ( #14327 )
...
* ReprEnum for repr roundtrips
* dsl
* bugfixes
* vdsty fixes
* cleaner
* fix
* fix cdna fields
* tests all pass
2026-01-25 18:58:31 +08:00
qazal
bf2d9d138f
viz: simplify amdgpu cfg ( #14326 )
...
* viz: replace llvm disasm with our disasm
* it starts with more code
* then it becomes less
* simpler, cdna disassembles with decimal simm16
* s_branch is upper case, add test
* simm16s and others
2026-01-25 15:21:45 +09:00
qazal
647e527a7e
viz: replace llvm disasm with our disasm ( #14325 )
2026-01-25 13:56:56 +09:00
nimlgen
4280a8eef2
am: update fw ( #14323 )
2026-01-25 01:08:47 +03:00
chenyu
7e41da1ae8
fix generate_dataset.sh ( #14324 )
...
added `set -e` so wrong pathes would fail the script, then fixed the path
2026-01-24 16:47:10 -05:00
chenyu
311bfd91d6
clean up where_on_load [pr] ( #14322 )
...
no repeated split_uop and general cleanup
2026-01-24 14:43:43 -05:00
nimlgen
8b282ba6d2
memory: reserved vram ( #14318 )
2026-01-24 19:39:24 +03:00
chenyu
00e9ba0b82
update type for split_uop and where_on_load [pr] ( #14319 )
...
also variable names in where_on_load, before logic update
2026-01-24 11:17:41 -05:00
chenyu
cb69b7b2b2
comment out fold_where_closure ( #14316 )
2026-01-24 10:15:42 -05:00
wozeparrot
d74587f16d
fa multi fix 2 ( #14314 )
2026-01-23 23:35:02 -08:00
chenyu
d9f0ad1d87
update return type for Tensor.tolist ( #14313 )
...
since sequence is incorrect since it can be list of list, use Any to avoid recursive type
2026-01-23 23:21:49 -05:00
qazal
807bc40931
assembly/amd: dsl and disasm cleanup ( #14311 )
...
* rdna4 inst helper
* remove dsl aliases
2026-01-24 11:36:12 +09:00
Christopher Milan
e782d44918
WEBGPU/NIR truncates ints ( #14307 )
...
* WEBGPU truncates ints
* nir has this bug too
2026-01-23 19:28:06 -05:00
nimlgen
26220a472e
no core_id ( #14265 )
...
* no core_id
* kwargs
* est
* linters
* ugh
* revert this
* deps
* glb
* should work?
* nn
* line
* fx
* ym
* z
* d
* um?
* revert
* this one?
* first half
* um p2
* all?
* um
* cleaner
* um
2026-01-23 21:30:12 +03:00
chenyu
e65bc7a7c5
where closure folding ( #14304 )
2026-01-23 10:55:13 -05:00
chenyu
d5a3b02a9c
clean up xpow ( #14295 )
...
mostly for `ret * (base < 0).where(adj, ret.const_like(1))` -> `(base < 0).where(neg_base, ret)`, since it's good for NAN neg_base but not generic
2026-01-23 10:19:47 -05:00
qazal
b913c910c5
assembly/amd: rdna4 passing test_roundtrip ( #14300 )
...
* test_roundtrip on different archs
* failing tests
* take RDNA4 xml changes from the emu branch
* work
* min diff to disasm flat
* test_add passes, rdna4 first
* correct vgpr field for the multi dword store stuff
* amdllvm
* recompile in roundtrip, get sources from emulator
* amdllvm, 2
* clean clean
* note, don't rely on that os.environ
---------
Co-authored-by: George Hotz <geohot@gmail.com >
2026-01-23 21:33:53 +09:00
qazal
f3b0e42863
remove extra sqtt pickles in gfx1200 ( #14302 )
2026-01-23 20:13:48 +09:00
George Hotz
d116312b1a
get cdna sqtt working ( #14301 )
...
* get cdna sqtt working
* cnd aprser
* wavestart/waveend
* names
* cdna
* test that
2026-01-23 18:46:15 +08:00
George Hotz
a5c4fa39d1
RDNA4 support in SQTT ( #14299 )
...
* table test
* cleanups
* dead file
* delta short
* tests
* delta test
* work
* l4 tests pass
* l0
* cnda
* print
* reverT
* wave failure
* wave failure
* test
* encs
* no l0 crap
* L4
* rdna4 sqtt
* notes
* linter
2026-01-23 16:16:45 +08:00
wozeparrot
963c59ebdb
fix: pull fixes from gradacc branch ( #14296 )
2026-01-22 23:07:54 -08:00
Christopher Milan
68668b8f28
fix WEBGPU NEG ( #14298 )
...
* fix WEBGPU NEG
* add test
* parenthesize
2026-01-23 01:44:52 -05:00
qazal
3b8a7bb8c9
use existing roc.py infra for sqtt tests ( #14297 )
...
* add pc, per kernel tracing
* work
* remove those imports
* min diff
2026-01-23 14:07:11 +09:00
chenyu
5f32f7a06b
fix winograd padding order ( #14294 )
2026-01-22 23:00:14 -05:00
George Hotz
52b989c6c8
don't place consts early + fixes from anthropic challenge ( #14286 )
...
* don't place consts early
* add anthropic challenge
* with ref
* do we still have to devectorize bools?
* tests pass
* just WHERE
* fine, revert that
* fine, revert
* only index
* z3 validator doesn't support vectorized
* Revert "z3 validator doesn't support vectorized"
This reverts commit 1b7930ecb3 .
* z3 not for vec
* no spec
* VLIWRenderer
* loop unrolling
* better comments
* cleanups
* skip cast
* renderer
* cleanups
* prints
* no hack
* hacks
* bump to 11
* reg warning
* lil clean
* cleaner renderer
2026-01-23 10:48:39 +09:00
chenyu
0903782bc0
remove few dead or unneeded codes [pr] ( #14275 )
2026-01-22 20:05:43 -05:00
chenyu
3eb5cd7d32
stronger test_rand_is_lazy ( #14293 )
2026-01-22 18:58:53 -05:00