* add dtypes.index
* cast shape, stride and mask to dtypes.index in view.create
* move pm_lower_index_dtype to ops
* DEFINE_VAR is dtype.index by default
* merge var_val_using_str
* remove int from commutative
* fix test_rewrite_map
* change that to dtypes.index
* change some int to index
* shorten those
* remove old cast in renderer
* cleanup
* change that back
* add comment
* delete comment
* just delete those
* view doesnt have to cast anymore
* adjust comment
* var_vals is str,int
* remove imports
* remove print
* fix test
* change var_vals in hcq
* update test_hcq
* fix multitensor _device_num var
* fix syminfer test
* shorten line
* p.vars stays list[Variable]
* shorten line
* vars is back to tuple[Variable, ...]
* change var_vals in extra
* change var_vals from shapetracker
* var_vals is str:int
* fix signature
* add rule and test
* more rules and tests
* add all four variations
* fix test
* test fixed!
* adjust commment
* add new variations
* disable intel tensor core ops count test for bigger_matmul_half
This enables seeing rewrites in unit tests like `VIZ=1 python3 test/test_uop_graph.py TestUOpGraph.test_in_bounds_access_gated_local` that call graph_rewrite directly.
`@track_rewrites` keeps existing as an optional helper to organize larger traces.
* ** simple kernel to replace Kernel for postopt
* support old
* fix beam
* beaming
* beam on old
* bring tensor cores back
* raise
* postbeam
* test ops passes on mac
* skip that
* postopt default
* gate that
* fix tensor cores
* a few test fixes
* dsp fix
* tc fix
* loop
* support swap
* test_gemv
* fix beam for variable
* test opts from high level stuff
* range annoying
* compile slow
* metal slow
* better beam
* no POSTBEAM
* fix nolocals
* hc opt mostly works
* put that back
* lil
* some work
* fix that
* POSTOPT 2
* fix tests
* no postopt 2
* work
* back
* padded tensors cores
* shift_to
* postopt 0 passes?
* write PADTO
* fix padded tensor cores
* compare hcopt
* 18000 lines
* should pass tests
* fix rangeify
* put types back
* add overflows helper
* add rules
* x -> y
* check overflow of u too
* cleaner
* use alu instead of replace to preserve vectorization
* just one rule
* add test
* Modify tests and start work towards removing symbolic reshape
* Refactor symbolic reshape
* fix small error
* much cleaner + fix more tests
* Can remove this now
* Update test_symbolic_ops and test_tiny
* Couple more tests
* Unused import
* More tests and add EXPAND to Tensor.empty
* Fix test beam search
* all int
* Fix rangeify by adding shrink
* Remove OOB check and so fix test_symbolic_jit
* test_symbolic_jit doesn't need OOB Context anymore either
* Should remove that test now
* Cleanups part 1
* fix linters
* Final cleanups
* Don't reassign inside for loop
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* viz bytepack format
Training a 1B llama yields ~20M profiler events.
With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.
**Design decisions:**
- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.
More optimization possible:
- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.
**Results**
| | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |
`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`
* python reference decoder
* 27 bytes / event, 1hr hard limit