nimlgen
c6769badc2
mockgpu: async support ( #13868 )
...
* mockgpu: async support
* cpu
2025-12-29 13:18:37 +03:00
qazal
fc5278746f
mi350x assembly gemm cleanups ( #13867 )
2025-12-29 18:47:23 +09:00
George Hotz
f07c39cfa4
hwtest fixes for rdna3 dsl ( #13865 )
2025-12-28 20:42:29 -05:00
George Hotz
d9603c1bee
improve asm dsl syntax ( #13864 )
...
* improve asm dsl syntax
* improve asm dsl syntax
2025-12-28 20:04:59 -05:00
chenyu
f5090192c8
reorder AMD tensor core benchmark test ( #13860 )
...
* reorder AMD tensor core benchmark test
* disable that
2025-12-28 12:29:51 -05:00
qazal
066d96c397
print tflops in asm gemm test ( #13859 )
...
* print tflops in asm gemm test
* change order
2025-12-29 02:26:40 +09:00
chenyu
a03cd43e78
fix typing in compute_gradient ( #13852 )
2025-12-28 11:52:14 -05:00
chenyu
cba05acadf
re-enable TYPED=1 import test ( #13858 )
2025-12-28 11:49:06 -05:00
qazal
2cfbabdc34
mi350x 1tflop bf16 gemm in extra ( #13702 )
2025-12-28 21:45:42 +09:00
qazal
2180eee5e4
use the asm dsl in remu hwtest.py ( #13856 )
...
* remu hw test with the asm dsl
* simpler
* nthreads and exec mask
* cmp/cmpx
* assembler error in s_mov_b32
* vopd in dsl?
2025-12-28 11:32:41 +09:00
chenyu
784b919f7f
Revert "optim empty shard #13513 ( #13598 )" ( #13855 )
...
* Revert "optim empty shard #13513 (#13598 )"
This reverts commit 76d465dbc3 .
* test_arange_shrink
* update test
2025-12-27 21:10:23 -05:00
anu
9b4de8abc7
fix beam in python 3.14+ ( #13836 )
...
* fix beam search on python 3.14
* add PickleableCount class to helpers
* change name, add test, add step
* tidy count init
2025-12-27 16:24:22 -05:00
chenyu
0f74909ae9
clean up rearrange ( #13851 )
2025-12-27 11:06:10 -05:00
qazal
f6c660f7fa
simplify sqtt decoder infra ( #13849 )
...
* more work
* simpler
2025-12-28 00:31:16 +09:00
Clément Verrier
ae013beab8
handle empty VECTORIZE in UOp.render() ( #13847 )
...
`UOp.render()` crashed with `IndexError: tuple index out of range` when
the UOp graph contained a `VECTORIZE` with empty `src=()`. This occurs
when reshaping to scalar shape `()`, e.g., `Tensor.ones(4).sum()`.
The bug was in the renderer's VECTORIZE pattern: `all_same(())` returns
`True` (vacuous truth), causing the code to access `x.src[0]` on an
empty tuple.
- Fix `IndexError` when calling `UOp.render()` on graphs containing
empty `VECTORIZE` nodes.
- Add test for empty `VECTORIZE` rendering.
2025-12-27 10:09:39 -05:00
qazal
a2da61d096
use new style amd compiler in viz ( #13848 )
...
* working version, handcode gfx1100 arch
* get target from device properties
* lib in cfg test program spec
2025-12-27 23:59:30 +09:00
JINO ROHIT
1ee92003ea
minor typo ( #13846 )
2025-12-27 09:34:57 -05:00
nimlgen
276159cb87
system: add base_class to pci_scan_bus ( #13845 )
...
* system: add base_class to pci_scan_bus
* fix
2025-12-27 13:22:21 +03:00
Francis Lata
fac137779e
remove flux1 seed image ( #13843 )
2025-12-27 00:45:11 -05:00
qazal
f6de9095a0
switch asm tests to dsl ( #13840 )
...
* switch asm tests to dsl
* labeled basic blocks also work
* indenting for basic blocks
* allow define from star import
2025-12-27 02:15:16 +09:00
chenyu
ba922094f2
remove redudant check in disk_supports_fast_copyout ( #13838 )
2025-12-26 11:30:55 -05:00
George Hotz
e9f2aaba2a
simplify rdna3 asm ( #13835 )
...
* simplify rdna3 asm
* cleanups
* fix names
* fix tests
* fixes
* more test fixes
* type fixes
* tests pass + mypy passes
* 3.11 syntax
2025-12-26 11:21:03 -05:00
nimlgen
c44b4f9ae0
am: fix sdma warm boot ( #13837 )
2025-12-26 12:38:06 +03:00
George Hotz
c6937fa744
more work on RDNA3 asm ( #13833 )
...
* more llvm asm tests
* roundtrip test
* work
* more handwritten
* more handwritten
* work
* tests pass
* dual mov
* all tests pass
* all tests pass fast
2025-12-25 23:28:14 -05:00
George Hotz
f1111ac7de
move amd compilers to new style ( #13831 )
...
* move amd compilers to new style
* simplest diff
* AMDHIPrenderer
2025-12-25 13:42:24 -05:00
George Hotz
9d94b8c6b2
python asm dsl in extra + python REMU ( #13436 )
...
* having fun with python asm dsl
* rdna3
* meh
* all in rdna3
* work
* more work
* work
* integration
* tests
* simpler
* simpler
* asm
* better
* simpler
* progress
* emu
* simpler
* emu
* tests
* types
* vopd
* cleaups
* work
* memory ranges
* add tracing
* refactors
* run_asm exit
* more readable
* compare to remu
* test gemm
* bug + stale
* more tests
* refactor
* tests fix
* more ins
* more instructions
* refactor
* faster
* match case
* match case
* simpler
* work
* tests
* run_asm
* work
* bug fixes
* more emu
* alu/emu
* refactor
* no pipeline emu yet
* alu direct
* fix
* bugfixes + new test
* fix exceptions in emulators
* update gen.py
* pylint
* no pdf
* improve bench_emu
* speedups
* cleanups
* more tests
2025-12-25 13:04:14 -05:00
nimlgen
b5f3a5ad79
am: cleanup comment ( #13828 )
2025-12-25 18:00:28 +03:00
chenyu
8985a4a023
one less branch in Buffer.view [pr] ( #13829 )
2025-12-25 09:34:15 -05:00
chenyu
094753b4e0
renderer arch version cleanup [pr] ( #13830 )
2025-12-25 09:32:56 -05:00
chenyu
54af29dbdb
trange can just be a function ( #13827 )
2025-12-24 23:57:10 -05:00
qazal
a1c1684b91
set .amdhsa_kernarg_size in asm test ( #13826 )
2025-12-25 13:08:14 +09:00
chenyu
da1cb6a9ec
update llama dataloader ( #13825 )
...
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
a7fc0c288b
clean up BufferCopy init [pr] ( #13824 )
2025-12-24 10:40:15 -05:00
chenyu
903753c60c
llama wandb logging ( #13822 )
2025-12-24 10:24:59 -05:00
qazal
e3a646dce3
viz: skip plaintext disassemble for cfg ( #13821 )
2025-12-24 23:16:59 +09:00
chenyu
cb07c5d0e8
fewer import annotations ( #13819 )
2025-12-23 18:45:50 -05:00
George Hotz
43c6e973d8
add optional compiler in Renderer ( #13817 )
...
* add optional compiler in Renderer [pr]
* fix
* late init
* remove precompiled
* cleanup
2025-12-23 17:58:46 -05:00
George Hotz
8eab6175ee
get_program refactor ( #13816 )
...
* get_program refactor
* fix docs
* cleanup
2025-12-23 16:44:46 -05:00
George Hotz
3d3c5b2fb9
add device to program ( #13815 )
...
* add device to program
* from_uop
* from_uop no renderer
* simpler global_size
2025-12-23 16:15:33 -05:00
nimlgen
90b217896f
am: xgmi p2p ( #13811 )
...
* system: use addr space
* am: xgmi
* fix
* ugh
2025-12-23 20:11:38 +03:00
George Hotz
6439a515be
test fixups / speedups / var_vals refactor ( #13812 )
...
* no PYTHONPATH + llm server port 0
* llm tok speedup
* refactor var_vals
2025-12-23 12:05:59 -05:00
George Hotz
8dcba2e2cc
no full_rewrite [pr] ( #13809 )
...
* no full_rewrite [pr]
* fix
* fix docs
2025-12-22 23:20:01 -05:00
George Hotz
edce2303f4
rewrite to program ( #13808 )
2025-12-22 20:03:33 -05:00
George Hotz
2af2b4da5d
Revert "rewrites for renderer and compiler ( #13646 )" ( #13806 )
...
This reverts commit 339dadf056 .
2025-12-22 19:21:33 -05:00
George Hotz
339dadf056
rewrites for renderer and compiler ( #13646 )
...
* rewrites for renderer and compiler
* full_rewrite_to_program
* fix pre-commit
* compiler passed into get_program
* no pkl compiler
* lib on program spec
* fix spec
* fix test
* no device
* compiler_device
* nm
* fix nir
* fix
* simplest
* fix tests
* revert
2025-12-22 18:58:43 -05:00
Daniel Xu
4edaaf19e5
Handle tied embeddings for llama 3.2 1B ( #13796 )
...
Previously the output.weight layer would not be loaded, and would only
contain randomly initialized values. This led to junk when doing a
forward pass.
Signed-off-by: Daniel Xu <daniel@thinkingmachines.ai >
2025-12-22 16:31:40 -05:00
chenyu
7f1d41c9f9
delete files that import ShapeTracker ( #13805 )
2025-12-22 15:54:18 -05:00
qazal
b31373ca70
remove llvm-mca stuff from viz ( #13802 )
2025-12-23 01:41:51 +08:00
chenyu
27d899ce97
TRAIN=0 to only eval llama ( #13804 )
2025-12-22 11:55:46 -05:00
chenyu
39d962106f
update llama logging ( #13803 )
...
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used, 19644.30 GFLOPS
2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used, 17039.35 GFLOPS
3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00