* ** simple kernel to replace Kernel for postopt
* support old
* fix beam
* beaming
* beam on old
* bring tensor cores back
* raise
* postbeam
* test ops passes on mac
* skip that
* postopt default
* gate that
* fix tensor cores
* a few test fixes
* dsp fix
* tc fix
* loop
* support swap
* test_gemv
* fix beam for variable
* test opts from high level stuff
* range annoying
* compile slow
* metal slow
* better beam
* no POSTBEAM
* fix nolocals
* hc opt mostly works
* put that back
* lil
* some work
* fix that
* POSTOPT 2
* fix tests
* no postopt 2
* work
* back
* padded tensors cores
* shift_to
* postopt 0 passes?
* write PADTO
* fix padded tensor cores
* compare hcopt
* 18000 lines
* should pass tests
* fix rangeify
* put types back
* start
* tiny clean up
* whoops, didn't mean to accidentally fix this
* fix .to(device), kinda hacky and this fix makes it slower?
* merge properly
* FINALLY figured out slowness, also hack pylint for now
* add DEBUGONNX print for subgraph
* oops
* WOOOOOOOO SHAPE CACHE 50% SPEED INCREASE
* small fix, but maybe all deterministic Tensor creation in fp should be cached
* cache condition
* sliiiightly cleaner
* better abstraction?
* remove sam from model_benchmark
* remove shape cache speed up for now
* less lines
* isinstance fix
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
* fast idiv for signed ints
* Add rule and test
* fix tests
* redo fuzz_fast_idiv to do negative ints as well
* adjust comments
* remove unused imports
* start
* more
* fix onnx_runner test
* pass
* patch for disk and add domains from huggingface
* simpler docs
* revert domain changes
* rerun ci
* revert onnx ops test change
* add fix from strenum stuff
* correct way
* revert correct way to leave the fix for another PR
* test segfault
* Revert "test segfault"
This reverts commit 4e1aaf41e7.
* remove some unnecessary documentation
* test segfault again
* Revert "test segfault again"
This reverts commit 56fc5f03e7.
* try gemini suggested patch for sys._getframe
* keep trying with gemini
* revert not working gemini suggestions and try faulthandler
* remove pythonfaulthandler
* trigger CI a few times
* minimize diff
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* start
* add reference
* this is so much slower
* this makes sense but differs from official impl, but results are still correct..?
* add a comment
* Just keep it simple for now since I don't fully get it yet
* address comments
* correct
* teeny clean up
* another small comment improvement lol
* fix ptx process replay
* keyword arg
* renderer is also optional [pr]
* test_linearizer fixup
* name function order is args,ret,kwargs
* can use opts_to_apply
* pass through p.applied_opts
* sink_arg
* now it opens devices too
* start
* remove onnx.load from compile4 and move np to dropout
* clean up and enable test
* clean up
* move WebGPU ONNX test into MacOS (WebGPU)
* leave test in ONNX (CPU)
* fix raw_data init None, and simplify onnx_runner test a little?
* THESE TESTS ARE SO UGLY UGHH
* need to really think about how to structure the test
* wow LLMs are quite something
* not always on disk now
* also add external data loading test
* cleaner tests
* minimize diff and add const folding tests
* add external data loading too
* whoops add webgpu back.. but why was it not needed in the first place?
* better comment
* move webgpu test to macos(webgpu)?
* llm english so much better than me wow
* trigger CI to check flakiness
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* bump
* thou hast implement functions
* hacked in domain support
* some clean ups
* hack quantize_onnx_test too
* add helper lol, why onnx tests why
* better dispatcher, but need tests and better naming
* flaky ci
* change some names
* small clean ups
* make it easier to clean up tests once ORT supports 1.18.0
* nits
* fix bug of Softmax_1 being registered in onnx_ops
* need a default value
* resolve_const is better name
* fix OnnxRunner.to
* use proper domain names
* file path as input and have parse be in OnnxRunner.__init__
* modelproto_to_onnxrunner -> modelproto_to_runner
* whoops, fix import
* oh flakiness again, is it because it's getting gc-ed?
* small changes
* CI flaky so just move compile4 fix in
* copy typing of onnx_load
* actually can just import onnx_load instead of onnx.load
* fix external_benchmark_openpilot
* fix onnx_runner test to use onnx_helper
* rerun CI
* try run_modelproto
* spam CI a few times
* revert run_modelproto since that's flaky also
* no external onnx_load usage except onnx.py
* cursor tab complete is evil. Snuck a darn sorted in. But does order change result? Why?
* model_benchmark 193s -> 80s, add OnnxRunner.to()...
* minimize diff and clean up
* device can be None, weird but eh
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* start LLM app, tons of clean up required. target is 200 line ollama
* kind of works
* simpler
* add k/v cache
* with SYM=1, it loops
* no rope cache
* simpler
* more cleanups
* cleanups
* works
* argparse and comments
* from gguf
* generate is a function
* no copy from cpu
* fix max context pass in
* test
* improve test
* ai2_arc
* fix 8B, use less ram
* 136 lines
* print inputs to get_program in process replay [pr]
* colors
* keep dataclass default escapes
* Revert "keep dataclass default escapes"
This reverts commit c6db7e8a7a.
* note for ast_repr
* add that back
* fix process replay diff in PYTHON device [pr]
The PYTHON backend pickles and encodes UOps, the encoded binary can't be
directly diffed in process replay.
* note
* try
* ruff check --fix
* no skip test
* hmmmmmmm I don't get this D:
* run CI again
* why is PYTHON device faster than CPU?
* run ci again and fix lint
* actually doesn't PYTHON device make sense here?
* see cpu speed again
* Revert "see cpu speed again"
This reverts commit 1e366f2256.
* trigger CI
* pretty good
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* fix extract_dataset + tests
* add CI
* sops.gz itself is same as master
* yml + gzip -c + ge
* don't commit that
* bump limit to 1000
* axis=7
* test_tiny
* squash commits
* temp fix for const tensor
* actually realizing float16 can only happen in raw_data
* .float -> cast(float) to rerun CI
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* proposal: add option to override opts in the get_program API
* update test_linearizer_rewrite
* state in uops
* update process_replay and names
* empty isn't none
* fix process replay