Commit Graph

4184 Commits

Author SHA1 Message Date
Sieds Lykles
a3aeef45cc associative variation of where branch-merging (#11851)
* add rule and test

* change comment
2025-08-26 19:27:05 +02:00
b1tg
1dd613cb89 test float_to_bf16 round-to-even behavior (#11849)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-26 12:16:10 -04:00
b1tg
409399c609 fix nan in float_to_bf16 (#11843)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-26 11:42:25 -04:00
chenyu
f28f613f85 improved float_to_bf16 (#11848)
round instead of truncate
2025-08-26 11:14:06 -04:00
chenyu
337e979a59 call dtypes.as_const in Tensor(list) (#11840) 2025-08-25 22:08:26 -04:00
chenyu
ac3449b0c8 truncate_fp16 cleanup (#11838)
native `@` is default
2025-08-25 19:03:41 -04:00
qazal
a1f6823060 viz: memory layout in client side (#11830)
* viz: memory layout in client side

* update test_viz
2025-08-25 14:49:33 +03:00
Sieds Lykles
a286a1a6f7 Fast idiv try removing factors of two before cast (#11824)
* try removing factors of two

* dont return if None

* add test
2025-08-24 20:04:25 +02:00
George Hotz
6540bb32a6 move into codegen late [pr] (#11823) 2025-08-24 10:23:25 -07:00
Sieds Lykles
dd69114573 Revert "Better div nesting (#11811)" (#11818)
This reverts commit 952f729b07.
2025-08-24 18:11:24 +02:00
Sieds Lykles
952f729b07 Better div nesting (#11811)
* remove check

* use fold_divmod_congruence instead of simplify

* adjust tests

* shorten line
2025-08-24 04:17:40 +02:00
Sieds Lykles
e652062f92 tweak divmod_folding condition (#11810) 2025-08-24 02:59:02 +02:00
Sieds Lykles
07d4ed7e4c one more symbolic add variation (#11807) 2025-08-24 01:15:04 +02:00
qazal
0d86288bd7 viz: calculate timeline fixed points in client side (#11805)
* viz: calculate timeline fixed points in client side

* 26 bytes / event

* math
2025-08-24 01:44:40 +03:00
George Hotz
a75da49951 use AxisType for UPCAST/UNROLL (#11800)
* use AxisType for UPCAST/UNROLL

* fixes

* fix the bug

* fix hack

* bad test

* flaky test
2025-08-23 14:44:48 -07:00
qazal
2407fecdae viz bytepack format (#11792)
* viz bytepack format

Training a 1B llama yields ~20M profiler events.

With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.

**Design decisions:**

- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.

More optimization possible:

- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.

**Results**

|  | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |

`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`

* python reference decoder

* 27 bytes / event, 1hr hard limit
2025-08-23 23:50:21 +03:00
qazal
b12d1d866c count bytes per kernel in test_viz (#11801)
Currently at ~100 bytes/kernel with JSON.
2025-08-23 23:35:27 +03:00
Sieds Lykles
6a50ab6b87 adjust idiv min_max (#11802)
* change div min_max

* add tests
2025-08-23 22:25:51 +02:00
chenyu
9d4cccd0f9 test_dtype_alu cleanups (#11799) 2025-08-23 15:11:17 -04:00
George Hotz
aefabaf774 add AxisType to range (#11798)
* add AxisType to range

* missed them

* fix that test

* fix that test
2025-08-23 11:15:00 -07:00
qazal
b975830424 add profile loader helper in test_viz (#11797) 2025-08-23 19:20:29 +03:00
chenyu
7123df3928 Use Tensor.logaddexp to implement Tensor.softplus (#11796)
instead of piecewise linear, numerical is handled by logaddexp. jax does this and i think it's more elegant than torch's approach
2025-08-23 11:52:29 -04:00
chenyu
fb8ee02424 Tensor.logaddexp (#11793) 2025-08-23 09:15:00 -04:00
Sieds Lykles
5a6817d5f8 Fix z3 rendering of floats in indexing (#11740)
* Fix floating point comparison in indexing

* wrap in noop

* update tests

* improve rules for loading and comparing floats

* add test cast to bool
2025-08-23 05:56:19 +02:00
chenyu
e39b25cd36 upcast float exp to at least float32 (#11758)
* upcast float exp to at least float32

* unlucky seed
2025-08-22 20:16:34 -04:00
qazal
9ff03680ba viz: store relative timestamps (#11787)
* viz: store relative timestamps

* err

* update test
2025-08-22 19:30:21 +03:00
geohotstan
1e679bd789 fix max_unpool2d inf (#11784)
* start

* add regression test for maxunpool2d
2025-08-22 08:31:24 -04:00
George Hotz
9832599c9e test_vmap + permute isn't a sint (#11783)
* test_vmap + permute isn't a sint

* order
2025-08-21 22:39:35 -07:00
George Hotz
bb8de51e5f remove unused early cleanups + contig w range [pr] (#11780)
* remove unused early cleanups [pr]

* contiguous with range

* woah, this works
2025-08-21 20:04:45 -07:00
chenyu
91a4de4ca7 fix getitem with inf in tensor (#11781) 2025-08-21 21:55:32 -04:00
George Hotz
5954a0975f fix some assigns on rangeify (#11774)
* fix some assigns

* llvm test

* more tests

* upd test
2025-08-21 15:15:54 -07:00
qazal
2e0eb88549 viz: add metadata to UOp tracing (#11772)
* viz: add metadata to UOp tracing

* place after tag

* optional field

* err, refcount of root must be 0
2025-08-22 00:18:45 +03:00
George Hotz
9f94c25a25 fix symbolic usage. use shrink, not reshape (#11762)
* fix test_var

* revert those things

* fix the ones in test tiny

* use better syntax

* it's the same, but that's clearer

* fix pad
2025-08-20 18:35:42 -07:00
chenyu
5276fbc9c5 fix gather with inf values (#11760)
(mask * x) is wrong because 0*inf is nan. i feel we have a lot of those still...
2025-08-20 20:35:40 -04:00
George Hotz
9635592141 ** rangeify, try 3 (#11683)
* ** rangeify, try 3

* bring that over

* bufferize, don't use contig tag

* work

* ish

* fix rangeify

* flash attention is back

* fix rangeify tests

* stuff passes

* fix test_log_softmax

* more stuff passes

* progress children

* new endrange solution

* progress

* progress counter

* basic assign

* contigs only

* symbolic in schedule

* unbind_kernel

* late children

* ops fixed

* beautiful mnist is close

* that seems to work

* mnist works

* improve names

* fix bmnist

* no pcontig

* testing backward

* work

* clone movement ops

* new_range helper

* MBLOCK/MERGE

* ops tests pass

* revert mblock stuff

* cleanups...but it breaks ops

* remove reindex

* hack for relu

* disable the hacks

* more hacks

* upd

* mostly works with cleanups disabled

* ndr

* ops tests pass

* terrible hacks for indexing to work

* context mismatch

* pcontig

* split pcontig v contig

* z3 trunc

* null

* no fuse in rangeify

* ops test passes

* lnorm

* fix assign

* nd rangeify

* both should work

* tests for rangeify

* cleanups

* stores pass the pointer through

* disable pcontig for now

* PARTIAL_CONTIG is a flag
2025-08-20 14:22:44 -07:00
chenyu
d7553721d1 clean up test_dtype_alu (#11757)
remove the check that looks into schedule, only test if output matches
2025-08-20 14:36:18 -04:00
qazal
de4cb722a4 viz: add metadata and var_vals tracing (#11753)
* viz: add metadata and var_vals tracing

* add test_trace_metadata

* set TRACEMETA=1
2025-08-20 18:39:51 +03:00
chenyu
be7b0b6970 TRANSCENDENTAL_SUPPORTED_DTYPES->TRANSCENDENTAL_DTYPES (#11752) 2025-08-20 10:29:36 -04:00
ttomsa
220a2a88d7 a*(1/b) -> a/b on LLVM, CPU (#11743)
* add fdiv rewrite

* :)

* use float_lop

* use reciprocal()

* revert

* move to decompositions
2025-08-20 09:35:10 -04:00
George Hotz
12ab3f8b06 correct row_count in process replay (#11748) 2025-08-19 22:21:07 -07:00
George Hotz
8af8808c61 cleanup tests, bump caches (#11746) 2025-08-19 21:21:07 -07:00
George Hotz
00391db628 no ast for mem estimate (#11744)
* no ast for mem estimate

* skip for webgpu
2025-08-19 20:18:45 -07:00
ttomsa
70c3f1fb29 x.where(False, True) -> !x (#11738)
* add pat

* add test
2025-08-19 19:08:16 -04:00
George Hotz
1d307f568c move device tests to test/device + test cleanups (#11735)
* move device tests to test/device

* test speedups

* test device

* linalg to unit

* upd

* so pytest just works

* more divide and skip

* speed

* test devectorize

* add pillow
2025-08-19 16:02:20 -07:00
nimlgen
9c9e337c78 amd: parse soc enums (#11727)
* amd: parse soc enums

* remove from mock

* fix

* minimal amd_gpu
2025-08-19 15:06:09 +03:00
George Hotz
4b3fcb4064 Revert "REDUCE_AXIS keepdim=False (#11311)" (#11718)
This reverts commit b518a7378a.
2025-08-18 13:28:53 -07:00
b1tg
b518a7378a REDUCE_AXIS keepdim=False (#11311)
* progress

* fix tests

* fix tests

* remove hack for test_symfold

* fix test_conv.py  on llvm

* hack test_cache_speed

* lint

* remove hack for helper_linearizer_opt

* tests

* fix DSP

* clean up

* remove hack for kernelize.py

* hack for test/test_multitensor.py TestMultiTensor.test_matmul_shard_none

* clean

* uop.r need reshape?

* lower_store cause fail

* fix lower?

* avoid contiguous hack

* 2134

* conv2d count

* remove unused

* hack lower

* reduced and clean up

* fix TestMultiTensor.test_matmul_shard_none

* src sync + fix TestMultiTensor.test_matmul_shard_none

* remove excluded in mop

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-08-18 10:09:17 -07:00
chenyu
c30a113b2a support bf16 and fp8 in Tensor.tolist (#11704)
memoryview does not support it, but casting works fine so cast is fine
2025-08-17 15:11:13 -04:00
qazal
d762edd694 viz: define tracks in python (#11701)
* viz: defines tracks in python

* update unittests

* figuring it out

* works

* diff cleanup

* math

* y axis is back
2025-08-17 18:19:13 +03:00
George Hotz
9366a23eb0 test backward in test_tiny (#11697)
* test backward in test_tiny

* empty
2025-08-16 20:29:39 -07:00