11640 Commits

Author SHA1 Message Date
nimlgen
c6769badc2 mockgpu: async support (#13868)
* mockgpu: async support

* cpu
2025-12-29 13:18:37 +03:00
qazal
fc5278746f mi350x assembly gemm cleanups (#13867) 2025-12-29 18:47:23 +09:00
George Hotz
f07c39cfa4 hwtest fixes for rdna3 dsl (#13865) 2025-12-28 20:42:29 -05:00
George Hotz
d9603c1bee improve asm dsl syntax (#13864)
* improve asm dsl syntax

* improve asm dsl syntax
2025-12-28 20:04:59 -05:00
chenyu
f5090192c8 reorder AMD tensor core benchmark test (#13860)
* reorder AMD tensor core benchmark test

* disable that
2025-12-28 12:29:51 -05:00
qazal
066d96c397 print tflops in asm gemm test (#13859)
* print tflops in asm gemm test

* change order
2025-12-29 02:26:40 +09:00
chenyu
a03cd43e78 fix typing in compute_gradient (#13852) 2025-12-28 11:52:14 -05:00
chenyu
cba05acadf re-enable TYPED=1 import test (#13858) 2025-12-28 11:49:06 -05:00
qazal
2cfbabdc34 mi350x 1tflop bf16 gemm in extra (#13702) 2025-12-28 21:45:42 +09:00
qazal
2180eee5e4 use the asm dsl in remu hwtest.py (#13856)
* remu hw test with the asm dsl

* simpler

* nthreads and exec mask

* cmp/cmpx

* assembler error in s_mov_b32

* vopd in dsl?
2025-12-28 11:32:41 +09:00
chenyu
784b919f7f Revert "optim empty shard #13513 (#13598)" (#13855)
* Revert "optim empty shard #13513 (#13598)"

This reverts commit 76d465dbc3.

* test_arange_shrink

* update test
2025-12-27 21:10:23 -05:00
anu
9b4de8abc7 fix beam in python 3.14+ (#13836)
* fix beam search on python 3.14

* add PickleableCount class to helpers

* change name, add test, add step

* tidy count init
2025-12-27 16:24:22 -05:00
chenyu
0f74909ae9 clean up rearrange (#13851) 2025-12-27 11:06:10 -05:00
qazal
f6c660f7fa simplify sqtt decoder infra (#13849)
* more work

* simpler
2025-12-28 00:31:16 +09:00
Clément Verrier
ae013beab8 handle empty VECTORIZE in UOp.render() (#13847)
`UOp.render()` crashed with `IndexError: tuple index out of range` when
the UOp graph contained a `VECTORIZE` with empty `src=()`. This occurs
when reshaping to scalar shape `()`, e.g., `Tensor.ones(4).sum()`.

The bug was in the renderer's VECTORIZE pattern: `all_same(())` returns
`True` (vacuous truth), causing the code to access `x.src[0]` on an
empty tuple.

- Fix `IndexError` when calling `UOp.render()` on graphs containing
  empty `VECTORIZE` nodes.
- Add test for empty `VECTORIZE` rendering.
2025-12-27 10:09:39 -05:00
qazal
a2da61d096 use new style amd compiler in viz (#13848)
* working version, handcode gfx1100 arch

* get target from device properties

* lib in cfg test program spec
2025-12-27 23:59:30 +09:00
JINO ROHIT
1ee92003ea minor typo (#13846) 2025-12-27 09:34:57 -05:00
nimlgen
276159cb87 system: add base_class to pci_scan_bus (#13845)
* system: add base_class to pci_scan_bus

* fix
2025-12-27 13:22:21 +03:00
Francis Lata
fac137779e remove flux1 seed image (#13843) 2025-12-27 00:45:11 -05:00
qazal
f6de9095a0 switch asm tests to dsl (#13840)
* switch asm tests to dsl

* labeled basic blocks also work

* indenting for basic blocks

* allow define from star import
2025-12-27 02:15:16 +09:00
chenyu
ba922094f2 remove redudant check in disk_supports_fast_copyout (#13838) 2025-12-26 11:30:55 -05:00
George Hotz
e9f2aaba2a simplify rdna3 asm (#13835)
* simplify rdna3 asm

* cleanups

* fix names

* fix tests

* fixes

* more test fixes

* type fixes

* tests pass + mypy passes

* 3.11 syntax
2025-12-26 11:21:03 -05:00
nimlgen
c44b4f9ae0 am: fix sdma warm boot (#13837) 2025-12-26 12:38:06 +03:00
George Hotz
c6937fa744 more work on RDNA3 asm (#13833)
* more llvm asm tests

* roundtrip test

* work

* more handwritten

* more handwritten

* work

* tests pass

* dual mov

* all tests pass

* all tests pass fast
2025-12-25 23:28:14 -05:00
George Hotz
f1111ac7de move amd compilers to new style (#13831)
* move amd compilers to new style

* simplest diff

* AMDHIPrenderer
2025-12-25 13:42:24 -05:00
George Hotz
9d94b8c6b2 python asm dsl in extra + python REMU (#13436)
* having fun with python asm dsl

* rdna3

* meh

* all in rdna3

* work

* more work

* work

* integration

* tests

* simpler

* simpler

* asm

* better

* simpler

* progress

* emu

* simpler

* emu

* tests

* types

* vopd

* cleaups

* work

* memory ranges

* add tracing

* refactors

* run_asm exit

* more readable

* compare to remu

* test gemm

* bug + stale

* more tests

* refactor

* tests fix

* more ins

* more instructions

* refactor

* faster

* match case

* match case

* simpler

* work

* tests

* run_asm

* work

* bug fixes

* more emu

* alu/emu

* refactor

* no pipeline emu yet

* alu direct

* fix

* bugfixes + new test

* fix exceptions in emulators

* update gen.py

* pylint

* no pdf

* improve bench_emu

* speedups

* cleanups

* more tests
2025-12-25 13:04:14 -05:00
nimlgen
b5f3a5ad79 am: cleanup comment (#13828) 2025-12-25 18:00:28 +03:00
chenyu
8985a4a023 one less branch in Buffer.view [pr] (#13829) 2025-12-25 09:34:15 -05:00
chenyu
094753b4e0 renderer arch version cleanup [pr] (#13830) 2025-12-25 09:32:56 -05:00
chenyu
54af29dbdb trange can just be a function (#13827) 2025-12-24 23:57:10 -05:00
qazal
a1c1684b91 set .amdhsa_kernarg_size in asm test (#13826) 2025-12-25 13:08:14 +09:00
chenyu
da1cb6a9ec update llama dataloader (#13825)
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
a7fc0c288b clean up BufferCopy init [pr] (#13824) 2025-12-24 10:40:15 -05:00
chenyu
903753c60c llama wandb logging (#13822) 2025-12-24 10:24:59 -05:00
qazal
e3a646dce3 viz: skip plaintext disassemble for cfg (#13821) 2025-12-24 23:16:59 +09:00
chenyu
cb07c5d0e8 fewer import annotations (#13819) 2025-12-23 18:45:50 -05:00
George Hotz
43c6e973d8 add optional compiler in Renderer (#13817)
* add optional compiler in Renderer [pr]

* fix

* late init

* remove precompiled

* cleanup
2025-12-23 17:58:46 -05:00
George Hotz
8eab6175ee get_program refactor (#13816)
* get_program refactor

* fix docs

* cleanup
2025-12-23 16:44:46 -05:00
George Hotz
3d3c5b2fb9 add device to program (#13815)
* add device to program

* from_uop

* from_uop no renderer

* simpler global_size
2025-12-23 16:15:33 -05:00
nimlgen
90b217896f am: xgmi p2p (#13811)
* system: use addr space

* am: xgmi

* fix

* ugh
2025-12-23 20:11:38 +03:00
George Hotz
6439a515be test fixups / speedups / var_vals refactor (#13812)
* no PYTHONPATH + llm server port 0

* llm tok speedup

* refactor var_vals
2025-12-23 12:05:59 -05:00
George Hotz
8dcba2e2cc no full_rewrite [pr] (#13809)
* no full_rewrite [pr]

* fix

* fix docs
2025-12-22 23:20:01 -05:00
George Hotz
edce2303f4 rewrite to program (#13808) 2025-12-22 20:03:33 -05:00
George Hotz
2af2b4da5d Revert "rewrites for renderer and compiler (#13646)" (#13806)
This reverts commit 339dadf056.
2025-12-22 19:21:33 -05:00
George Hotz
339dadf056 rewrites for renderer and compiler (#13646)
* rewrites for renderer and compiler

* full_rewrite_to_program

* fix pre-commit

* compiler passed into get_program

* no pkl compiler

* lib on program spec

* fix spec

* fix test

* no device

* compiler_device

* nm

* fix nir

* fix

* simplest

* fix tests

* revert
2025-12-22 18:58:43 -05:00
Daniel Xu
4edaaf19e5 Handle tied embeddings for llama 3.2 1B (#13796)
Previously the output.weight layer would not be loaded, and would only
contain randomly initialized values. This led to junk when doing a
forward pass.

Signed-off-by: Daniel Xu <daniel@thinkingmachines.ai>
2025-12-22 16:31:40 -05:00
chenyu
7f1d41c9f9 delete files that import ShapeTracker (#13805) 2025-12-22 15:54:18 -05:00
qazal
b31373ca70 remove llvm-mca stuff from viz (#13802) 2025-12-23 01:41:51 +08:00
chenyu
27d899ce97 TRAIN=0 to only eval llama (#13804) 2025-12-22 11:55:46 -05:00
chenyu
39d962106f update llama logging (#13803)
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py

    1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used,  19644.30 GFLOPS
    2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used,  17039.35 GFLOPS
    3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
    4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
    5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
    6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
    7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
    8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
    9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
   10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00