George Hotz
6540bb32a6
move into codegen late [pr] ( #11823 )
2025-08-24 10:23:25 -07:00
nimlgen
bba088ef11
amd aql queue ( #11708 )
...
* amd aql queue
* xcc
* fiz
* aql better
* llvm
* no for aql
* wrap
* is_sql
* am support
* complete
* fix
* mypy
* minor
2025-08-24 19:53:00 +03:00
George Hotz
1fa09d9ede
BLOCK_REORDER is context var, heuristic cleanups [pr] ( #11819 )
...
* BLOCK_REORDER is context var, heuristic cleanups [pr]
* split get opt and do opt
* oops, should be on
2025-08-24 09:41:34 -07:00
qazal
8b18cc2a94
viz memory layout cleanup ( #11820 )
...
* rename to dtype_size
* cleanr memory shape creator
2025-08-24 19:37:31 +03:00
Sieds Lykles
dd69114573
Revert "Better div nesting ( #11811 )" ( #11818 )
...
This reverts commit 952f729b07 .
2025-08-24 18:11:24 +02:00
nimlgen
e19f901330
amd: rptr/wptr in create_queue ( #11817 )
2025-08-24 18:03:45 +03:00
nimlgen
d71444857e
amd: apply relocs for kernel_code_entry_byte_offset for AMD_LLVM ( #11816 )
...
* amd: apply relocs for kernel_code_entry_byte_offset for AMD_LLVM
* fix
2025-08-24 17:48:40 +03:00
George Hotz
44bc7dc73d
remove KernelInfo from GROUP_REDUCE ( #11814 )
2025-08-23 19:55:41 -07:00
George Hotz
229adfb7c3
Revert "remove KernelInfo from gpudims ( #11809 )" ( #11813 )
...
This reverts commit 846753f343 .
2025-08-23 19:37:10 -07:00
Sieds Lykles
952f729b07
Better div nesting ( #11811 )
...
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
2025-08-24 04:17:40 +02:00
Sieds Lykles
e652062f92
tweak divmod_folding condition ( #11810 )
2025-08-24 02:59:02 +02:00
George Hotz
846753f343
remove KernelInfo from gpudims ( #11809 )
...
* remove KernelInfo from gpudims
* that's good in there
2025-08-23 16:32:45 -07:00
Sieds Lykles
07d4ed7e4c
one more symbolic add variation ( #11807 )
2025-08-24 01:15:04 +02:00
qazal
759ebea4eb
viz: reflect timeline API boundary in names ( #11808 )
...
* define shapes once
* depth isn't an event property
* update server naming
2025-08-24 02:12:12 +03:00
George Hotz
132f09fab7
global/locals from AxisType in range ( #11806 )
2025-08-23 15:49:17 -07:00
qazal
0d86288bd7
viz: calculate timeline fixed points in client side ( #11805 )
...
* viz: calculate timeline fixed points in client side
* 26 bytes / event
* math
2025-08-24 01:44:40 +03:00
George Hotz
a75da49951
use AxisType for UPCAST/UNROLL ( #11800 )
...
* use AxisType for UPCAST/UNROLL
* fixes
* fix the bug
* fix hack
* bad test
* flaky test
2025-08-23 14:44:48 -07:00
qazal
2407fecdae
viz bytepack format ( #11792 )
...
* viz bytepack format
Training a 1B llama yields ~20M profiler events.
With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.
**Design decisions:**
- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.
More optimization possible:
- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.
**Results**
| | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |
`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`
* python reference decoder
* 27 bytes / event, 1hr hard limit
2025-08-23 23:50:21 +03:00
qazal
b12d1d866c
count bytes per kernel in test_viz ( #11801 )
...
Currently at ~100 bytes/kernel with JSON.
2025-08-23 23:35:27 +03:00
Sieds Lykles
6a50ab6b87
adjust idiv min_max ( #11802 )
...
* change div min_max
* add tests
2025-08-23 22:25:51 +02:00
chenyu
9d4cccd0f9
test_dtype_alu cleanups ( #11799 )
2025-08-23 15:11:17 -04:00
George Hotz
aefabaf774
add AxisType to range ( #11798 )
...
* add AxisType to range
* missed them
* fix that test
* fix that test
2025-08-23 11:15:00 -07:00
qazal
b975830424
add profile loader helper in test_viz ( #11797 )
2025-08-23 19:20:29 +03:00
chenyu
7123df3928
Use Tensor.logaddexp to implement Tensor.softplus ( #11796 )
...
instead of piecewise linear, numerical is handled by logaddexp. jax does this and i think it's more elegant than torch's approach
2025-08-23 11:52:29 -04:00
qazal
aaea6b97ad
viz memory: compute nbytes ( #11795 )
...
* viz memory: compute nbytes
* local map
2025-08-23 17:34:07 +03:00
qazal
58653b5eae
viz: store memory scale ( #11794 )
2025-08-23 16:19:44 +03:00
chenyu
fb8ee02424
Tensor.logaddexp ( #11793 )
2025-08-23 09:15:00 -04:00
Sieds Lykles
5a6817d5f8
Fix z3 rendering of floats in indexing ( #11740 )
...
* Fix floating point comparison in indexing
* wrap in noop
* update tests
* improve rules for loading and comparing floats
* add test cast to bool
2025-08-23 05:56:19 +02:00
chenyu
4267c45db3
non-supported dtype in transcendental ( #11754 )
...
* non-supported dtype in transcendental
`CPU=1 python3 test/test_dtype_alu.py TestDTypeALU.test_bfloat16_unary` works
* test
* works on real mac
2025-08-22 23:13:45 -04:00
chenyu
e39b25cd36
upcast float exp to at least float32 ( #11758 )
...
* upcast float exp to at least float32
* unlucky seed
2025-08-22 20:16:34 -04:00
nimlgen
b057a90d49
memory: rename is_huge_page -> is_page ( #11786 )
2025-08-22 20:08:58 +03:00
qazal
38f0fa7bde
viz: only send trace duration ( #11789 )
...
* viz: only send trace duration
* can unwrap
2025-08-22 20:00:48 +03:00
qazal
1c81ec9248
viz: rename to start/end timestamp ( #11788 )
2025-08-22 19:47:49 +03:00
qazal
9ff03680ba
viz: store relative timestamps ( #11787 )
...
* viz: store relative timestamps
* err
* update test
2025-08-22 19:30:21 +03:00
nimlgen
698392334f
system: message for eaccess as well ( #11785 )
2025-08-22 18:21:32 +03:00
geohotstan
1e679bd789
fix max_unpool2d inf ( #11784 )
...
* start
* add regression test for maxunpool2d
2025-08-22 08:31:24 -04:00
George Hotz
9832599c9e
test_vmap + permute isn't a sint ( #11783 )
...
* test_vmap + permute isn't a sint
* order
2025-08-21 22:39:35 -07:00
George Hotz
bb8de51e5f
remove unused early cleanups + contig w range [pr] ( #11780 )
...
* remove unused early cleanups [pr]
* contiguous with range
* woah, this works
2025-08-21 20:04:45 -07:00
chenyu
91a4de4ca7
fix getitem with inf in tensor ( #11781 )
2025-08-21 21:55:32 -04:00
George Hotz
66e9d54eed
RANGEIFY=2 is partial contig ( #11777 )
2025-08-21 16:53:58 -07:00
Jordan Chalupka
8de6db15ac
exclude .git from ruff ( #11773 )
2025-08-21 15:37:50 -07:00
George Hotz
5954a0975f
fix some assigns on rangeify ( #11774 )
...
* fix some assigns
* llvm test
* more tests
* upd test
2025-08-21 15:15:54 -07:00
qazal
2e0eb88549
viz: add metadata to UOp tracing ( #11772 )
...
* viz: add metadata to UOp tracing
* place after tag
* optional field
* err, refcount of root must be 0
2025-08-22 00:18:45 +03:00
George Hotz
d6f9606e93
small cleanups to rangeify ( #11769 )
2025-08-21 11:15:09 -07:00
uuuvn
bd4a9473b0
Multihost exception handling ( #11729 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com >
2025-08-21 13:51:49 -04:00
George Hotz
a2c7b807e0
don't bufferize 0s ( #11766 )
2025-08-21 10:10:56 -07:00
nimlgen
9eff7cd1d8
am: support 64bit discovery ( #11768 )
2025-08-21 18:28:13 +03:00
b1tg
56cd47a159
fix amd llvm bf16 tc ( #11713 )
...
* fix amd llvm bf16 tc
* is_cdna
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-08-21 09:33:28 -04:00
George Hotz
a044648111
rangeify load cleanups + multi support ( #11765 )
...
* use the old buf_uop + cleanups
* simpler handling of load
* everything needed for multi too
2025-08-20 20:55:49 -07:00
George Hotz
9f94c25a25
fix symbolic usage. use shrink, not reshape ( #11762 )
...
* fix test_var
* revert those things
* fix the ones in test tiny
* use better syntax
* it's the same, but that's clearer
* fix pad
2025-08-20 18:35:42 -07:00