Commit Graph

952 Commits

Author SHA1 Message Date
qazal
0d86288bd7 viz: calculate timeline fixed points in client side (#11805)
* viz: calculate timeline fixed points in client side

* 26 bytes / event

* math
2025-08-24 01:44:40 +03:00
qazal
2407fecdae viz bytepack format (#11792)
* viz bytepack format

Training a 1B llama yields ~20M profiler events.

With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.

**Design decisions:**

- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.

More optimization possible:

- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.

**Results**

|  | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |

`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`

* python reference decoder

* 27 bytes / event, 1hr hard limit
2025-08-23 23:50:21 +03:00
qazal
b12d1d866c count bytes per kernel in test_viz (#11801)
Currently at ~100 bytes/kernel with JSON.
2025-08-23 23:35:27 +03:00
Sieds Lykles
6a50ab6b87 adjust idiv min_max (#11802)
* change div min_max

* add tests
2025-08-23 22:25:51 +02:00
George Hotz
aefabaf774 add AxisType to range (#11798)
* add AxisType to range

* missed them

* fix that test

* fix that test
2025-08-23 11:15:00 -07:00
qazal
b975830424 add profile loader helper in test_viz (#11797) 2025-08-23 19:20:29 +03:00
qazal
9ff03680ba viz: store relative timestamps (#11787)
* viz: store relative timestamps

* err

* update test
2025-08-22 19:30:21 +03:00
qazal
2e0eb88549 viz: add metadata to UOp tracing (#11772)
* viz: add metadata to UOp tracing

* place after tag

* optional field

* err, refcount of root must be 0
2025-08-22 00:18:45 +03:00
chenyu
be7b0b6970 TRANSCENDENTAL_SUPPORTED_DTYPES->TRANSCENDENTAL_DTYPES (#11752) 2025-08-20 10:29:36 -04:00
ttomsa
70c3f1fb29 x.where(False, True) -> !x (#11738)
* add pat

* add test
2025-08-19 19:08:16 -04:00
George Hotz
1d307f568c move device tests to test/device + test cleanups (#11735)
* move device tests to test/device

* test speedups

* test device

* linalg to unit

* upd

* so pytest just works

* more divide and skip

* speed

* test devectorize

* add pillow
2025-08-19 16:02:20 -07:00
George Hotz
4b3fcb4064 Revert "REDUCE_AXIS keepdim=False (#11311)" (#11718)
This reverts commit b518a7378a.
2025-08-18 13:28:53 -07:00
b1tg
b518a7378a REDUCE_AXIS keepdim=False (#11311)
* progress

* fix tests

* fix tests

* remove hack for test_symfold

* fix test_conv.py  on llvm

* hack test_cache_speed

* lint

* remove hack for helper_linearizer_opt

* tests

* fix DSP

* clean up

* remove hack for kernelize.py

* hack for test/test_multitensor.py TestMultiTensor.test_matmul_shard_none

* clean

* uop.r need reshape?

* lower_store cause fail

* fix lower?

* avoid contiguous hack

* 2134

* conv2d count

* remove unused

* hack lower

* reduced and clean up

* fix TestMultiTensor.test_matmul_shard_none

* src sync + fix TestMultiTensor.test_matmul_shard_none

* remove excluded in mop

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-08-18 10:09:17 -07:00
chenyu
c30a113b2a support bf16 and fp8 in Tensor.tolist (#11704)
memoryview does not support it, but casting works fine so cast is fine
2025-08-17 15:11:13 -04:00
qazal
d762edd694 viz: define tracks in python (#11701)
* viz: defines tracks in python

* update unittests

* figuring it out

* works

* diff cleanup

* math

* y axis is back
2025-08-17 18:19:13 +03:00
qazal
c8ba48b223 show rewrite errors in viz (#11684) 2025-08-15 19:09:47 +03:00
George Hotz
560984fd8d small changes from rangeify (#11682)
* small changes from rangeify

* const like thing

* ksym
2025-08-15 08:45:52 -07:00
Sieds Lykles
06beeb6e13 Nest div even if factor is negative (#11666) 2025-08-14 13:58:59 +02:00
Sieds Lykles
661e9a2d5d div_and_mod_folding refactor (#11585)
* divmod const folding is its own function

* split nested mod optimization out of div and mod folding

* make `fold_binary_numerator` its own function

* factor out `fold_divmod_congruence`

* check sign of numerator

* add tests

* assert int on vmin and vmax

* add type: ignore

* factor out more rules

* remove div_and_mod_folding

* cached_property to property

* remove import

* add returns

* restore old order

* check sign of x.vmin and newx.vmin

* check more signs

* add some test that would have caught bugs

* better test if the div simplified

* shorten line

* replace terms_factors_const with pop_const

* move that back

* minor cleanup

* remove comments

* some cleanup
2025-08-14 11:52:42 +02:00
chenyu
4fe19eec72 Ops.TRUNC (#11659) 2025-08-13 18:40:48 -04:00
George Hotz
22bdf48cdd render ranges in viz, name gbufs with sizes. changes from rangeify (#11656)
* render ranges in viz, name gbufs with sizes. changes from rangeify

* fix unit test dtypes
2025-08-13 12:46:16 -07:00
George Hotz
d2521d828a transcendental+idiv+threefry are uop decompositions (#11636)
* transcendental+idiv+threefry are uop decompositions [pr]

* threefry decomp

* fix randomness tests

* fix webgpu

* unneeded now

* fix

* move prematcher

* all cast should probably be cast_vec
2025-08-13 09:37:12 -07:00
Sieds Lykles
4c3982c44e Take sign out of mod (#11631)
* Add rule and test

* fix tests
2025-08-12 18:44:36 +02:00
George Hotz
ca41b5e38b skip_0 in graph rewrite [pr] (#11627)
* skip_0 in graph rewrite [pr]

* no track_rewrites on test

* use dict instead of set
2025-08-11 18:29:04 -07:00
George Hotz
996c907c0b rewrite not ready + children machinery (#11607)
* rewrite not ready + children machinery

* it doesn't like track rewrites
2025-08-10 15:28:30 -07:00
qazal
960cc6533a pass through name function args in track_rewrites (#11572) 2025-08-08 02:28:52 +03:00
George Hotz
82be8abfd2 move opt under codegen (#11569) 2025-08-07 14:19:17 -07:00
George Hotz
6ed2dfd187 delete the arange dim mismatch restriction (#11568)
* delete the arange dim mismatch restriction

* skip that test race
2025-08-07 13:46:17 -07:00
George Hotz
9764c6cdee fix mismatch reduce, try 2 (#11560)
* fix mismatch reduce, try 2

* fix heuristic

* delete that test

* don't start allowing ones
2025-08-07 07:57:58 -07:00
George Hotz
a1aa5670aa Revert "fix mismatch reduce (#11547)" (#11549)
This reverts commit 49d21a9055.
2025-08-06 22:43:15 -07:00
George Hotz
49d21a9055 fix mismatch reduce (#11547)
* fix mismatch reduce

* cleanups

* fix shape

* fix mypy

* resolve
2025-08-06 21:12:51 -07:00
George Hotz
21570545d3 move view pushing to codegen, try 2 (#11534)
* move view pushing to codegen, try 2

* fix up some linearizer tests

* fix test search

* fix test schedule

* delete that test

* fix test arange

* fix a few tests

* update tests

* push views

* ebs cleanup

* fix local/reg

* test and lint

* fix more tests

* test cleanups

* skipped that one
2025-08-06 15:58:38 -07:00
George Hotz
80d9cced07 more test cleanups (#11544)
* more test cleanups

* revert that
2025-08-06 15:05:21 -07:00
qazal
846a2826ab viz: remove TracingKey.fmt (#11482)
* viz: remove TracingKey.fmt

* remove from test too
2025-08-05 00:00:03 +03:00
leopf
4f0ee4e982 BPE tokenizer (#11415)
* BPE works

* refactor tok

* oops

* basic tests

* fix eval

* smaller diff

* fix error

* proper vocab decoding

* use regex for splitting

* escape ucatrange

* full compat

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-08-04 09:52:38 -07:00
chenyu
e0106b6b25 1/(x*c) -> (1/c)*(1/x) (#11491)
example: 2*(2*a).reciprocal() -> a.reciprocal()

# TODO: bounds for reciprocal
# TODO: should z3 work?
2025-08-03 23:35:46 -04:00
chenyu
66be747908 few more dtype cast convinience methods (#11480) 2025-08-02 15:47:09 -04:00
chenyu
e22e5da9a5 move some test_dtype tests to unit (#11479) 2025-08-02 15:25:00 -04:00
qazal
fa66d9772d viz: show const node when it's root (#11456) 2025-08-01 01:01:58 +03:00
chenyu
d5fc6af4a2 remove unused ShapeTracker.consecutive [pr] (#11426) 2025-07-29 18:36:19 -04:00
chenyu
88c338bfcc add kernelize to keccak for each data block (#11370)
* add kernelize to keccak for each data block

test_long works now. this prevents internal uops from growing propotional to data length and eventually too deep

* this?

* hash stuff

* gate test

* mv
2025-07-25 16:07:20 -04:00
chenyu
82e6de7fc6 more keccak reference tests (#11329) 2025-07-23 22:06:39 -04:00
George Hotz
e14b4fefa5 ranges on store (#11334)
* ranges on store

* fix store spec

* fix that

* fix gates

* fix tests

* fix ptx
2025-07-22 21:00:50 -07:00
chenyu
4535908679 update keccak test_long (#11331)
it should compare with arg "shake_128"
2025-07-22 16:08:01 -04:00
qazal
6668d6d241 fix word_wrap with newlines in input string [pr] (#11319) 2025-07-22 12:03:13 +03:00
George Hotz
842184a1ab rename kernelize to schedule, try 2 (#11305) 2025-07-21 11:18:36 -07:00
wozeparrot
30ce16a424 feat: failing test for long keccak (#11292) 2025-07-21 12:49:23 -04:00
nimlgen
188ed38315 replace from_mv with lightweight mv_address (#11280) 2025-07-19 13:50:51 +03:00
quortus
924bc7c9ae Fix test_uop_spec (#11259) 2025-07-16 11:02:31 +03:00
Alisher Zhubanyshev
4ef6b46b34 hcq: reduce launch overhead (#11193)
* nv: improve mmio creation speed

* add memoryview test

* fix indents

* move mv bench to `test_helpers`, remove comparison
2025-07-13 19:25:50 +03:00