Commit Graph

11880 Commits

Author SHA1 Message Date
chenyu
d641e63189 improve min/max for AND (#14356) 2026-01-26 15:44:18 -05:00
chenyu
f16372487a fix assign hazard on shrink (#14355)
* fix assign hazard on shrink

possible to have race if both assign src and dest are shrink

* test_nonoverlapping_shrink_assignment
2026-01-26 14:46:30 -05:00
chenyu
145df879c1 find_permutes -> fix_assign_hazard [pr] (#14354)
some noop tweaks and comment updates
2026-01-26 14:05:19 -05:00
nimlgen
e152f1b0f5 llama: use ALL2ALL (#14353) 2026-01-26 22:01:53 +03:00
nimlgen
3f25eb3026 am: ih (#14346)
* am: ih

* um

* fix

* line

* no trap and fix ring

* keep

* fix
2026-01-26 20:11:04 +03:00
chenyu
823bc17fb5 failed test case for shrink overlap assigns (#14350)
* failed test case for shrink overlap assigns

current logic can create a race resulted in wrong output

* skip for now
2026-01-26 11:58:45 -05:00
George Hotz
204f51e739 assembly/amd: bug fixes for PYTHON_REMU (#14347)
* default PYTHON_REMU to 1

* mockgpu

* less size

* normal compile path

* uniqie

* more

* fix clamp

* Change PYTHON_REMU default to 0 in _try_dlopen_remu
2026-01-27 00:48:22 +08:00
chenyu
231305603d remove REAL_DEV [pr] (#14337)
it's just Device.DEFAULT now
2026-01-26 10:08:16 -05:00
Martin Szewieczek
9cbe99348a func meshgrid: change param index to type str (#14331) 2026-01-26 10:07:56 -05:00
George Hotz
3b43d26f10 assembly/amd: emu speed (#14344)
* assembly/amd: emu speed

* fix spec

* go

* don't do this

* simpler

* no stupid consts

* hack

* simpler

* no index

* no where

* faster linearizer

* fix spec

* no index dtype
2026-01-26 22:21:34 +08:00
George Hotz
774a454bb5 assembly/amd: fix scratch SVE (#14340)
* assembly/amd: default python REMU

* mem_used

* no lane

* sve

* remove that

* needs s_code_end in tests
2026-01-26 21:03:51 +08:00
qazal
2d91fe6310 use amdgpu dsl in mmapeak (#14342)
* use amdgpu dsl in mmapeak

* don't rely on llvm for vgpr counting

* llvm roundtrip assert

* rm it, add ci

* vgpr_count

* move emulated test to amd, it needs comgr

* env

* arch

* inst._fields -> inst.operands

* vgpr offset
2026-01-26 22:03:43 +09:00
qazal
b2e2ace85b viz: remove ci check, it's VIZ=-1/-2 (#14343) 2026-01-26 20:36:23 +09:00
George Hotz
be23776ba7 assembly/amd: replace pcode with ucode (#14002)
* a bunch of todos for my boy claude

* uops have types

* lil cleanups

* simpler ucode

* isNAN

* calls

* move more

* cleanup pcode_parse

* cvt functions

* fix parser bugs

* no void

* minmax

* more pcode parse

* pretty print

* transform

* comments

* move to transform

* assign/declare

* simpler norm

* single PM

* just Uops

* simpler

* more typed

* all rewrite

* less verbose

* work

* spec

* transform

* work

* simpler spec

* less spec

* bitcast

* simpler

* simp ucode

* work

* more in pcode_transform

* remove junk

* more functions

* bug

* no void assign

* load/store

* wave

* fixes

* move denorm

* move more functions

* tests

* cat is shape None

* uop syntax

* move a few more

* program_spec

* cat stuff

* assign fix clear

* unused

* nans

* fp bits

* works with simplify

* remove junk

* special

* meh

* more

* more

* update test pcode parse

* improve parser

* parse some for loops

* merge master

* dead files

* tests pass

* emu2

* better emu2

* test_plus works

* uselessly write more instructions

* use pcode

* something

* something

* bench_emu

* progress

* ds works

* work

* work

* more passing

* run compare

* bench_emu

* more pcode

* a few more

* bugfixes

* bugfix

* test fixes

* tests pass without USE_HW

* all hw tests pass

* add more hw tests

* new hw tests

* bit

* less handcode

* parse more

* consolidate pcode

* fixes

* rsrc

* lane pcode

* cleanups

* simpler

* emu bugs

* one cmp test fails

* fix decode and upd name

* fix name and test harness

* _ftz_f32

* fix denorm

* fix VOPD and use load

* fix carry bug

* no load where / just invalid

* clean

* simpler

* merge sops

* refactoring

* simplifications

* bugfixes

* new tests

* f16 sin fix

* assertion and hw tests

* cvt functions

* one more failure

* bugfixes

* bugfix + regression

* more tests

* fmac

* no manual unrolling

* ordering

* LLVM backend is a lot faster

* compile inst

* more bugs

* f16

* bugfix

* fix regression

* one clang call

* 1M inst

* scratch works

* do scratch correctly

* cleanup

* regression

* cmp

* fmamk fixes

* merge

* fix vcmpx

* unify memory

* remove unused code

* ignore oob for test

* cleanups

* fix mbs

* unify cmp

* test

* minor cleanups

* bump timeout

* fix tests

* revert the CMPLE stuff

* remove opt

* less diff

* simpler

* revert

* support multiple backends

* memset is a lot faster

* split out in bench emu

* improve timing

* timing

* cache that

* cache that

* simpler and faster

* tokenize

* binop table

* simpler

* move to parser

* tok for lambda

* refactor

* expr_parser

* delete emu2_pcode

* import cleanup

* lil

* if parse

* work

* simpler

* no v

* trig preop is faster

* durations for tests

* fix cmp bug

* sdst

* remove scartch_size hack

* null behavior

* _MXCSRContext

* bugfixes

* DEBUG >= 3

* test smem crashes my gpu

* debug

* test

* test smem

* profiler

* full inst

* bugfix

* rtag(1)

* pc is 64-bit and word

* pc is real code now

* dynamic

* more dynamic

* fix oob access

* fix crash, more dyn

* all dyn

* really all dyn

* correct null mask

* lit + format

* 21s on the tests

* 13s on the tests

* canonical name

* simm16

* more dyn

* 14s

* proper saddr dedup

* dyn

* debug 5

* better 5

* revert dynamic stuff

* that can be dyn

* negative offsets

* dyn wmma

* f16 wmma support / ops / dtype / dtype_alu

* symbolic changes not needed

* ConstFloat

* more uop.const

* __eq__

* uop tests

* fix f16

* bf16 tensor cores

* whitespace

* remove cast roundtrip

* Revert "remove cast roundtrip"

This reverts commit c5bb0381c3.

* just the fix

* remove dead paths

* llvm runs
2026-01-26 18:04:29 +08:00
George Hotz
984cdc4840 add wrapper class for the -0.0 != 0.0 issue (#14339)
* add wrapper class for the -0.0 != 0.0 issue

* fixes

* spec fix

* missed one
2026-01-26 16:52:37 +08:00
qazal
92bfe92138 assembly/amd: fix cdna mfma xml (#14329)
* handwritten failing test

* new amdxml

* more mfma from fixes

* ci

* move arch of test integration

* alt

* amdxml human cleanup

* _TestIntegration rename to IntegrationTestBase

* it's the same problem as _LIT

* better comment

* better variable name
2026-01-26 17:51:26 +09:00
Garret Castro
6c109f4d75 LLVM: CPU threading support (#14320)
* make generic llvmrenderer class for cpu and amd

* move `tensor_cores` back to parent

* remove empty line

* restore extra matcher position

* add threading

* dont need to add core_id here

* dont move code for workitem

* cleanup

---------

Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 13:12:39 +08:00
George Hotz
cc49e47ea2 tinygrad changes from ucode (#14336)
* tinygrad changes from ucode

* dtype
2026-01-26 11:30:18 +08:00
Garret Castro
8477368d07 generic LLVMRenderer class for CPU and AMD (#14321)
* make generic llvmrenderer class for cpu and amd

* move `tensor_cores` back to parent

* remove empty line

* restore extra matcher position

* cleanup

---------

Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 09:11:49 +08:00
George Hotz
11ce1e847d llama train: null device support 2026-01-26 08:53:05 +08:00
chenyu
e3601788fa update torch backend function (#14333)
those have tensor.py implementation
2026-01-25 16:39:34 -05:00
nimlgen
9865f51e39 cupti: ref collector (#14330)
* cupti: ref collector

* ll
2026-01-25 20:35:21 +03:00
nimlgen
21ab23ae18 nv: add pma for ada (#14328)
* nv: add pma for ada

* um

* fix

* shorter

* mock
2026-01-25 17:33:37 +03:00
George Hotz
49db266b96 ReprEnum for repr roundtrips (#14327)
* ReprEnum for repr roundtrips

* dsl

* bugfixes

* vdsty fixes

* cleaner

* fix

* fix cdna fields

* tests all pass
2026-01-25 18:58:31 +08:00
qazal
bf2d9d138f viz: simplify amdgpu cfg (#14326)
* viz: replace llvm disasm with our disasm

* it starts with more code

* then it becomes less

* simpler, cdna disassembles with decimal simm16

* s_branch is upper case, add test

* simm16s and others
2026-01-25 15:21:45 +09:00
qazal
647e527a7e viz: replace llvm disasm with our disasm (#14325) 2026-01-25 13:56:56 +09:00
nimlgen
4280a8eef2 am: update fw (#14323) 2026-01-25 01:08:47 +03:00
chenyu
7e41da1ae8 fix generate_dataset.sh (#14324)
added `set -e` so wrong pathes would fail the script, then fixed the path
2026-01-24 16:47:10 -05:00
chenyu
311bfd91d6 clean up where_on_load [pr] (#14322)
no repeated split_uop and general cleanup
2026-01-24 14:43:43 -05:00
nimlgen
8b282ba6d2 memory: reserved vram (#14318) 2026-01-24 19:39:24 +03:00
chenyu
00e9ba0b82 update type for split_uop and where_on_load [pr] (#14319)
also variable names in where_on_load, before logic update
2026-01-24 11:17:41 -05:00
chenyu
cb69b7b2b2 comment out fold_where_closure (#14316) 2026-01-24 10:15:42 -05:00
wozeparrot
d74587f16d fa multi fix 2 (#14314) 2026-01-23 23:35:02 -08:00
chenyu
d9f0ad1d87 update return type for Tensor.tolist (#14313)
since sequence is incorrect since it can be list of list, use Any to avoid recursive type
2026-01-23 23:21:49 -05:00
qazal
807bc40931 assembly/amd: dsl and disasm cleanup (#14311)
* rdna4 inst helper

* remove dsl aliases
2026-01-24 11:36:12 +09:00
Christopher Milan
e782d44918 WEBGPU/NIR truncates ints (#14307)
* WEBGPU truncates ints

* nir has this bug too
2026-01-23 19:28:06 -05:00
nimlgen
26220a472e no core_id (#14265)
* no core_id

* kwargs

* est

* linters

* ugh

* revert this

* deps

* glb

* should work?

* nn

* line

* fx

* ym

* z

* d

* um?

* revert

* this one?

* first half

* um p2

* all?

* um

* cleaner

* um
2026-01-23 21:30:12 +03:00
chenyu
e65bc7a7c5 where closure folding (#14304) 2026-01-23 10:55:13 -05:00
chenyu
d5a3b02a9c clean up xpow (#14295)
mostly for `ret * (base < 0).where(adj, ret.const_like(1))` -> `(base < 0).where(neg_base, ret)`, since it's good for NAN neg_base but not generic
2026-01-23 10:19:47 -05:00
qazal
b913c910c5 assembly/amd: rdna4 passing test_roundtrip (#14300)
* test_roundtrip on different archs

* failing tests

* take RDNA4 xml changes from the emu branch

* work

* min diff to disasm flat

* test_add passes, rdna4 first

* correct vgpr field for the multi dword store stuff

* amdllvm

* recompile in roundtrip, get sources from emulator

* amdllvm, 2

* clean clean

* note, don't rely on that os.environ

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2026-01-23 21:33:53 +09:00
qazal
f3b0e42863 remove extra sqtt pickles in gfx1200 (#14302) 2026-01-23 20:13:48 +09:00
George Hotz
d116312b1a get cdna sqtt working (#14301)
* get cdna sqtt working

* cnd aprser

* wavestart/waveend

* names

* cdna

* test that
2026-01-23 18:46:15 +08:00
George Hotz
a5c4fa39d1 RDNA4 support in SQTT (#14299)
* table test

* cleanups

* dead file

* delta short

* tests

* delta test

* work

* l4 tests pass

* l0

* cnda

* print

* reverT

* wave failure

* wave failure

* test

* encs

* no l0 crap

* L4

* rdna4 sqtt

* notes

* linter
2026-01-23 16:16:45 +08:00
wozeparrot
963c59ebdb fix: pull fixes from gradacc branch (#14296) 2026-01-22 23:07:54 -08:00
Christopher Milan
68668b8f28 fix WEBGPU NEG (#14298)
* fix WEBGPU NEG

* add test

* parenthesize
2026-01-23 01:44:52 -05:00
qazal
3b8a7bb8c9 use existing roc.py infra for sqtt tests (#14297)
* add pc, per kernel tracing

* work

* remove those imports

* min diff
2026-01-23 14:07:11 +09:00
chenyu
5f32f7a06b fix winograd padding order (#14294) 2026-01-22 23:00:14 -05:00
George Hotz
52b989c6c8 don't place consts early + fixes from anthropic challenge (#14286)
* don't place consts early

* add anthropic challenge

* with ref

* do we still have to devectorize bools?

* tests pass

* just WHERE

* fine, revert that

* fine, revert

* only index

* z3 validator doesn't support vectorized

* Revert "z3 validator doesn't support vectorized"

This reverts commit 1b7930ecb3.

* z3 not for vec

* no spec

* VLIWRenderer

* loop unrolling

* better comments

* cleanups

* skip cast

* renderer

* cleanups

* prints

* no hack

* hacks

* bump to 11

* reg warning

* lil clean

* cleaner renderer
2026-01-23 10:48:39 +09:00
chenyu
0903782bc0 remove few dead or unneeded codes [pr] (#14275) 2026-01-22 20:05:43 -05:00
chenyu
3eb5cd7d32 stronger test_rand_is_lazy (#14293) 2026-01-22 18:58:53 -05:00