Sieds Lykles
a3aeef45cc
associative variation of where branch-merging ( #11851 )
...
* add rule and test
* change comment
2025-08-26 19:27:05 +02:00
b1tg
1dd613cb89
test float_to_bf16 round-to-even behavior ( #11849 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-08-26 12:16:10 -04:00
b1tg
409399c609
fix nan in float_to_bf16 ( #11843 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-08-26 11:42:25 -04:00
chenyu
f28f613f85
improved float_to_bf16 ( #11848 )
...
round instead of truncate
2025-08-26 11:14:06 -04:00
chenyu
337e979a59
call dtypes.as_const in Tensor(list) ( #11840 )
2025-08-25 22:08:26 -04:00
chenyu
ac3449b0c8
truncate_fp16 cleanup ( #11838 )
...
native `@` is default
2025-08-25 19:03:41 -04:00
qazal
a1f6823060
viz: memory layout in client side ( #11830 )
...
* viz: memory layout in client side
* update test_viz
2025-08-25 14:49:33 +03:00
Sieds Lykles
a286a1a6f7
Fast idiv try removing factors of two before cast ( #11824 )
...
* try removing factors of two
* dont return if None
* add test
2025-08-24 20:04:25 +02:00
George Hotz
6540bb32a6
move into codegen late [pr] ( #11823 )
2025-08-24 10:23:25 -07:00
Sieds Lykles
dd69114573
Revert "Better div nesting ( #11811 )" ( #11818 )
...
This reverts commit 952f729b07 .
2025-08-24 18:11:24 +02:00
Sieds Lykles
952f729b07
Better div nesting ( #11811 )
...
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
2025-08-24 04:17:40 +02:00
Sieds Lykles
e652062f92
tweak divmod_folding condition ( #11810 )
2025-08-24 02:59:02 +02:00
Sieds Lykles
07d4ed7e4c
one more symbolic add variation ( #11807 )
2025-08-24 01:15:04 +02:00
qazal
0d86288bd7
viz: calculate timeline fixed points in client side ( #11805 )
...
* viz: calculate timeline fixed points in client side
* 26 bytes / event
* math
2025-08-24 01:44:40 +03:00
George Hotz
a75da49951
use AxisType for UPCAST/UNROLL ( #11800 )
...
* use AxisType for UPCAST/UNROLL
* fixes
* fix the bug
* fix hack
* bad test
* flaky test
2025-08-23 14:44:48 -07:00
qazal
2407fecdae
viz bytepack format ( #11792 )
...
* viz bytepack format
Training a 1B llama yields ~20M profiler events.
With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.
**Design decisions:**
- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.
More optimization possible:
- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.
**Results**
| | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |
`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`
* python reference decoder
* 27 bytes / event, 1hr hard limit
2025-08-23 23:50:21 +03:00
qazal
b12d1d866c
count bytes per kernel in test_viz ( #11801 )
...
Currently at ~100 bytes/kernel with JSON.
2025-08-23 23:35:27 +03:00
Sieds Lykles
6a50ab6b87
adjust idiv min_max ( #11802 )
...
* change div min_max
* add tests
2025-08-23 22:25:51 +02:00
chenyu
9d4cccd0f9
test_dtype_alu cleanups ( #11799 )
2025-08-23 15:11:17 -04:00
George Hotz
aefabaf774
add AxisType to range ( #11798 )
...
* add AxisType to range
* missed them
* fix that test
* fix that test
2025-08-23 11:15:00 -07:00
qazal
b975830424
add profile loader helper in test_viz ( #11797 )
2025-08-23 19:20:29 +03:00
chenyu
7123df3928
Use Tensor.logaddexp to implement Tensor.softplus ( #11796 )
...
instead of piecewise linear, numerical is handled by logaddexp. jax does this and i think it's more elegant than torch's approach
2025-08-23 11:52:29 -04:00
chenyu
fb8ee02424
Tensor.logaddexp ( #11793 )
2025-08-23 09:15:00 -04:00
Sieds Lykles
5a6817d5f8
Fix z3 rendering of floats in indexing ( #11740 )
...
* Fix floating point comparison in indexing
* wrap in noop
* update tests
* improve rules for loading and comparing floats
* add test cast to bool
2025-08-23 05:56:19 +02:00
chenyu
e39b25cd36
upcast float exp to at least float32 ( #11758 )
...
* upcast float exp to at least float32
* unlucky seed
2025-08-22 20:16:34 -04:00
qazal
9ff03680ba
viz: store relative timestamps ( #11787 )
...
* viz: store relative timestamps
* err
* update test
2025-08-22 19:30:21 +03:00
geohotstan
1e679bd789
fix max_unpool2d inf ( #11784 )
...
* start
* add regression test for maxunpool2d
2025-08-22 08:31:24 -04:00
George Hotz
9832599c9e
test_vmap + permute isn't a sint ( #11783 )
...
* test_vmap + permute isn't a sint
* order
2025-08-21 22:39:35 -07:00
George Hotz
bb8de51e5f
remove unused early cleanups + contig w range [pr] ( #11780 )
...
* remove unused early cleanups [pr]
* contiguous with range
* woah, this works
2025-08-21 20:04:45 -07:00
chenyu
91a4de4ca7
fix getitem with inf in tensor ( #11781 )
2025-08-21 21:55:32 -04:00
George Hotz
5954a0975f
fix some assigns on rangeify ( #11774 )
...
* fix some assigns
* llvm test
* more tests
* upd test
2025-08-21 15:15:54 -07:00
qazal
2e0eb88549
viz: add metadata to UOp tracing ( #11772 )
...
* viz: add metadata to UOp tracing
* place after tag
* optional field
* err, refcount of root must be 0
2025-08-22 00:18:45 +03:00
George Hotz
9f94c25a25
fix symbolic usage. use shrink, not reshape ( #11762 )
...
* fix test_var
* revert those things
* fix the ones in test tiny
* use better syntax
* it's the same, but that's clearer
* fix pad
2025-08-20 18:35:42 -07:00
chenyu
5276fbc9c5
fix gather with inf values ( #11760 )
...
(mask * x) is wrong because 0*inf is nan. i feel we have a lot of those still...
2025-08-20 20:35:40 -04:00
George Hotz
9635592141
** rangeify, try 3 ( #11683 )
...
* ** rangeify, try 3
* bring that over
* bufferize, don't use contig tag
* work
* ish
* fix rangeify
* flash attention is back
* fix rangeify tests
* stuff passes
* fix test_log_softmax
* more stuff passes
* progress children
* new endrange solution
* progress
* progress counter
* basic assign
* contigs only
* symbolic in schedule
* unbind_kernel
* late children
* ops fixed
* beautiful mnist is close
* that seems to work
* mnist works
* improve names
* fix bmnist
* no pcontig
* testing backward
* work
* clone movement ops
* new_range helper
* MBLOCK/MERGE
* ops tests pass
* revert mblock stuff
* cleanups...but it breaks ops
* remove reindex
* hack for relu
* disable the hacks
* more hacks
* upd
* mostly works with cleanups disabled
* ndr
* ops tests pass
* terrible hacks for indexing to work
* context mismatch
* pcontig
* split pcontig v contig
* z3 trunc
* null
* no fuse in rangeify
* ops test passes
* lnorm
* fix assign
* nd rangeify
* both should work
* tests for rangeify
* cleanups
* stores pass the pointer through
* disable pcontig for now
* PARTIAL_CONTIG is a flag
2025-08-20 14:22:44 -07:00
chenyu
d7553721d1
clean up test_dtype_alu ( #11757 )
...
remove the check that looks into schedule, only test if output matches
2025-08-20 14:36:18 -04:00
qazal
de4cb722a4
viz: add metadata and var_vals tracing ( #11753 )
...
* viz: add metadata and var_vals tracing
* add test_trace_metadata
* set TRACEMETA=1
2025-08-20 18:39:51 +03:00
chenyu
be7b0b6970
TRANSCENDENTAL_SUPPORTED_DTYPES->TRANSCENDENTAL_DTYPES ( #11752 )
2025-08-20 10:29:36 -04:00
ttomsa
220a2a88d7
a*(1/b) -> a/b on LLVM, CPU ( #11743 )
...
* add fdiv rewrite
* :)
* use float_lop
* use reciprocal()
* revert
* move to decompositions
2025-08-20 09:35:10 -04:00
George Hotz
12ab3f8b06
correct row_count in process replay ( #11748 )
2025-08-19 22:21:07 -07:00
George Hotz
8af8808c61
cleanup tests, bump caches ( #11746 )
2025-08-19 21:21:07 -07:00
George Hotz
00391db628
no ast for mem estimate ( #11744 )
...
* no ast for mem estimate
* skip for webgpu
2025-08-19 20:18:45 -07:00
ttomsa
70c3f1fb29
x.where(False, True) -> !x ( #11738 )
...
* add pat
* add test
2025-08-19 19:08:16 -04:00
George Hotz
1d307f568c
move device tests to test/device + test cleanups ( #11735 )
...
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
2025-08-19 16:02:20 -07:00
nimlgen
9c9e337c78
amd: parse soc enums ( #11727 )
...
* amd: parse soc enums
* remove from mock
* fix
* minimal amd_gpu
2025-08-19 15:06:09 +03:00
George Hotz
4b3fcb4064
Revert "REDUCE_AXIS keepdim=False ( #11311 )" ( #11718 )
...
This reverts commit b518a7378a .
2025-08-18 13:28:53 -07:00
b1tg
b518a7378a
REDUCE_AXIS keepdim=False ( #11311 )
...
* progress
* fix tests
* fix tests
* remove hack for test_symfold
* fix test_conv.py on llvm
* hack test_cache_speed
* lint
* remove hack for helper_linearizer_opt
* tests
* fix DSP
* clean up
* remove hack for kernelize.py
* hack for test/test_multitensor.py TestMultiTensor.test_matmul_shard_none
* clean
* uop.r need reshape?
* lower_store cause fail
* fix lower?
* avoid contiguous hack
* 2134
* conv2d count
* remove unused
* hack lower
* reduced and clean up
* fix TestMultiTensor.test_matmul_shard_none
* src sync + fix TestMultiTensor.test_matmul_shard_none
* remove excluded in mop
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com >
2025-08-18 10:09:17 -07:00
chenyu
c30a113b2a
support bf16 and fp8 in Tensor.tolist ( #11704 )
...
memoryview does not support it, but casting works fine so cast is fine
2025-08-17 15:11:13 -04:00
qazal
d762edd694
viz: define tracks in python ( #11701 )
...
* viz: defines tracks in python
* update unittests
* figuring it out
* works
* diff cleanup
* math
* y axis is back
2025-08-17 18:19:13 +03:00
George Hotz
9366a23eb0
test backward in test_tiny ( #11697 )
...
* test backward in test_tiny
* empty
2025-08-16 20:29:39 -07:00