* more work parsing SQTT
* more minimal runner
* sep VIZ/PROFILE
* parse print new
* improve parser
* more filter
* that
* split them
* lil cleanup
* skip flaky test
* AQL in mmapeak
* trace buffer producer and consumers
* work
* generic colored util
* fix batched
* basic clicking works
* generic javascript that works for producer and consumers
* keep focused shape
* idle time
* timings for producer and consumers dedup
* from sd test
* tiny cleanups
* timeline
* work
* up to here
* assert
* list it
* work
* better viz names
* delete unused
* don't use opacity, it's multiplicative
* keep styles
* scrollbar coloring
* pyrender doesn't work here
beautiful_mnist r_64_16_32_36@lower all index dtypes
* add dtypes.index
* cast shape, stride and mask to dtypes.index in view.create
* move pm_lower_index_dtype to ops
* DEFINE_VAR is dtype.index by default
* merge var_val_using_str
* remove int from commutative
* fix test_rewrite_map
* change that to dtypes.index
* change some int to index
* shorten those
* remove old cast in renderer
* cleanup
* change that back
* add comment
* delete comment
* just delete those
* view doesnt have to cast anymore
* adjust comment
This enables seeing rewrites in unit tests like `VIZ=1 python3 test/test_uop_graph.py TestUOpGraph.test_in_bounds_access_gated_local` that call graph_rewrite directly.
`@track_rewrites` keeps existing as an optional helper to organize larger traces.
* viz bytepack format
Training a 1B llama yields ~20M profiler events.
With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.
**Design decisions:**
- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.
More optimization possible:
- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.
**Results**
| | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |
`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`
* python reference decoder
* 27 bytes / event, 1hr hard limit
* add mem_layout
* ui
* cleanup
* work
* debugLine work and expander
* tooltip style
* real expand device
* wheel does one thing
* diff
* shows llama oom
* add y axis
* mypy chill
* work
* unittests for the memory layout