Commit Graph

9917 Commits

Author SHA1 Message Date
George Hotz
6540bb32a6 move into codegen late [pr] (#11823) 2025-08-24 10:23:25 -07:00
nimlgen
bba088ef11 amd aql queue (#11708)
* amd aql queue

* xcc

* fiz

* aql better

* llvm

* no for aql

* wrap

* is_sql

* am support

* complete

* fix

* mypy

* minor
2025-08-24 19:53:00 +03:00
George Hotz
1fa09d9ede BLOCK_REORDER is context var, heuristic cleanups [pr] (#11819)
* BLOCK_REORDER is context var, heuristic cleanups [pr]

* split get opt and do opt

* oops, should be on
2025-08-24 09:41:34 -07:00
qazal
8b18cc2a94 viz memory layout cleanup (#11820)
* rename to dtype_size

* cleanr memory shape creator
2025-08-24 19:37:31 +03:00
Sieds Lykles
dd69114573 Revert "Better div nesting (#11811)" (#11818)
This reverts commit 952f729b07.
2025-08-24 18:11:24 +02:00
nimlgen
e19f901330 amd: rptr/wptr in create_queue (#11817) 2025-08-24 18:03:45 +03:00
nimlgen
d71444857e amd: apply relocs for kernel_code_entry_byte_offset for AMD_LLVM (#11816)
* amd: apply relocs for kernel_code_entry_byte_offset for AMD_LLVM

* fix
2025-08-24 17:48:40 +03:00
George Hotz
44bc7dc73d remove KernelInfo from GROUP_REDUCE (#11814) 2025-08-23 19:55:41 -07:00
George Hotz
229adfb7c3 Revert "remove KernelInfo from gpudims (#11809)" (#11813)
This reverts commit 846753f343.
2025-08-23 19:37:10 -07:00
Sieds Lykles
952f729b07 Better div nesting (#11811)
* remove check

* use fold_divmod_congruence instead of simplify

* adjust tests

* shorten line
2025-08-24 04:17:40 +02:00
Sieds Lykles
e652062f92 tweak divmod_folding condition (#11810) 2025-08-24 02:59:02 +02:00
George Hotz
846753f343 remove KernelInfo from gpudims (#11809)
* remove KernelInfo from gpudims

* that's good in there
2025-08-23 16:32:45 -07:00
Sieds Lykles
07d4ed7e4c one more symbolic add variation (#11807) 2025-08-24 01:15:04 +02:00
qazal
759ebea4eb viz: reflect timeline API boundary in names (#11808)
* define shapes once

* depth isn't an event property

* update server naming
2025-08-24 02:12:12 +03:00
George Hotz
132f09fab7 global/locals from AxisType in range (#11806) 2025-08-23 15:49:17 -07:00
qazal
0d86288bd7 viz: calculate timeline fixed points in client side (#11805)
* viz: calculate timeline fixed points in client side

* 26 bytes / event

* math
2025-08-24 01:44:40 +03:00
George Hotz
a75da49951 use AxisType for UPCAST/UNROLL (#11800)
* use AxisType for UPCAST/UNROLL

* fixes

* fix the bug

* fix hack

* bad test

* flaky test
2025-08-23 14:44:48 -07:00
qazal
2407fecdae viz bytepack format (#11792)
* viz bytepack format

Training a 1B llama yields ~20M profiler events.

With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.

**Design decisions:**

- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.

More optimization possible:

- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.

**Results**

|  | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |

`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`

* python reference decoder

* 27 bytes / event, 1hr hard limit
2025-08-23 23:50:21 +03:00
qazal
b12d1d866c count bytes per kernel in test_viz (#11801)
Currently at ~100 bytes/kernel with JSON.
2025-08-23 23:35:27 +03:00
Sieds Lykles
6a50ab6b87 adjust idiv min_max (#11802)
* change div min_max

* add tests
2025-08-23 22:25:51 +02:00
chenyu
9d4cccd0f9 test_dtype_alu cleanups (#11799) 2025-08-23 15:11:17 -04:00
George Hotz
aefabaf774 add AxisType to range (#11798)
* add AxisType to range

* missed them

* fix that test

* fix that test
2025-08-23 11:15:00 -07:00
qazal
b975830424 add profile loader helper in test_viz (#11797) 2025-08-23 19:20:29 +03:00
chenyu
7123df3928 Use Tensor.logaddexp to implement Tensor.softplus (#11796)
instead of piecewise linear, numerical is handled by logaddexp. jax does this and i think it's more elegant than torch's approach
2025-08-23 11:52:29 -04:00
qazal
aaea6b97ad viz memory: compute nbytes (#11795)
* viz memory: compute nbytes

* local map
2025-08-23 17:34:07 +03:00
qazal
58653b5eae viz: store memory scale (#11794) 2025-08-23 16:19:44 +03:00
chenyu
fb8ee02424 Tensor.logaddexp (#11793) 2025-08-23 09:15:00 -04:00
Sieds Lykles
5a6817d5f8 Fix z3 rendering of floats in indexing (#11740)
* Fix floating point comparison in indexing

* wrap in noop

* update tests

* improve rules for loading and comparing floats

* add test cast to bool
2025-08-23 05:56:19 +02:00
chenyu
4267c45db3 non-supported dtype in transcendental (#11754)
* non-supported dtype in transcendental

`CPU=1 python3 test/test_dtype_alu.py TestDTypeALU.test_bfloat16_unary` works

* test

* works on real mac
2025-08-22 23:13:45 -04:00
chenyu
e39b25cd36 upcast float exp to at least float32 (#11758)
* upcast float exp to at least float32

* unlucky seed
2025-08-22 20:16:34 -04:00
nimlgen
b057a90d49 memory: rename is_huge_page -> is_page (#11786) 2025-08-22 20:08:58 +03:00
qazal
38f0fa7bde viz: only send trace duration (#11789)
* viz: only send trace duration

* can unwrap
2025-08-22 20:00:48 +03:00
qazal
1c81ec9248 viz: rename to start/end timestamp (#11788) 2025-08-22 19:47:49 +03:00
qazal
9ff03680ba viz: store relative timestamps (#11787)
* viz: store relative timestamps

* err

* update test
2025-08-22 19:30:21 +03:00
nimlgen
698392334f system: message for eaccess as well (#11785) 2025-08-22 18:21:32 +03:00
geohotstan
1e679bd789 fix max_unpool2d inf (#11784)
* start

* add regression test for maxunpool2d
2025-08-22 08:31:24 -04:00
George Hotz
9832599c9e test_vmap + permute isn't a sint (#11783)
* test_vmap + permute isn't a sint

* order
2025-08-21 22:39:35 -07:00
George Hotz
bb8de51e5f remove unused early cleanups + contig w range [pr] (#11780)
* remove unused early cleanups [pr]

* contiguous with range

* woah, this works
2025-08-21 20:04:45 -07:00
chenyu
91a4de4ca7 fix getitem with inf in tensor (#11781) 2025-08-21 21:55:32 -04:00
George Hotz
66e9d54eed RANGEIFY=2 is partial contig (#11777) 2025-08-21 16:53:58 -07:00
Jordan Chalupka
8de6db15ac exclude .git from ruff (#11773) 2025-08-21 15:37:50 -07:00
George Hotz
5954a0975f fix some assigns on rangeify (#11774)
* fix some assigns

* llvm test

* more tests

* upd test
2025-08-21 15:15:54 -07:00
qazal
2e0eb88549 viz: add metadata to UOp tracing (#11772)
* viz: add metadata to UOp tracing

* place after tag

* optional field

* err, refcount of root must be 0
2025-08-22 00:18:45 +03:00
George Hotz
d6f9606e93 small cleanups to rangeify (#11769) 2025-08-21 11:15:09 -07:00
uuuvn
bd4a9473b0 Multihost exception handling (#11729)
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2025-08-21 13:51:49 -04:00
George Hotz
a2c7b807e0 don't bufferize 0s (#11766) 2025-08-21 10:10:56 -07:00
nimlgen
9eff7cd1d8 am: support 64bit discovery (#11768) 2025-08-21 18:28:13 +03:00
b1tg
56cd47a159 fix amd llvm bf16 tc (#11713)
* fix amd llvm bf16 tc

* is_cdna

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-21 09:33:28 -04:00
George Hotz
a044648111 rangeify load cleanups + multi support (#11765)
* use the old buf_uop + cleanups

* simpler handling of load

* everything needed for multi too
2025-08-20 20:55:49 -07:00
George Hotz
9f94c25a25 fix symbolic usage. use shrink, not reshape (#11762)
* fix test_var

* revert those things

* fix the ones in test tiny

* use better syntax

* it's the same, but that's clearer

* fix pad
2025-08-20 18:35:42 -07:00