Commit Graph

1363 Commits

Author SHA1 Message Date
nimlgen
d747eeed32 amd logs parser based on device (#11669) 2025-08-14 19:49:33 +03:00
geohotstan
1e904155e3 Add Onnx Huggingface to test/models/test_onnx.py (#11468)
* BOOM

* cache extra/huggingface/models/

* why max buffer size is not 0

* override MAX_BUFFER_SIZE

* less models

* remove more models and change cache dir to already cached dir

* only metal

* less is more?

* remove check ops

* why is this not setting the ENVVAR

* ughhhhh just test in models

* only cpu and gpu

* only cpu actually

* just override it idk

* final

* move extra dependencies up top

* simplification

* fix print

* make README better

* revert ops_disk fix for now

* clean up test_onnx

* remove testing fashion clip model cuz sloooowwwwww

* actually let METAL run this

* fix comment mistake

* fix download path in run_models

* does this work?

* cleanup setup and teardown

* contextvar like this?

* prove model is cached

* do I need to increment DOWNLOAD_CACHE_VERSION?

* see if cached with incremented DOWNLOAD_CACHE_VERSION

* use warnings to see if the model exists

* revert DOWNLOAD_CACHE_VERSION stuff and clean up

* add retry to download

* nit
2025-08-14 11:16:41 -04:00
kevvz
e2873a3a41 [bounty] Muon optim (#11414)
* newton schulz

* add muon + move newton schulz to tensor

* compact newton schulz

* better tests

* cleanup

* add comments for muon

* cleanup

* add export with tests

* match muon optim with test optim

* cleanup

* unsed import

* correct comment

* whitespace

* move export

* muon test fix

* match reference impl + tests

* remove export by moving muon device

* add credit

* cleanup

* remove print

* spacing

* spacing

* comma

* cleanup

* removal

* fix tests + optim momentum

* consistent is not/ not

* more consistency

* fix test

* cleanup

* fix the nones

* remove comment

* cast

* comment

* comment

* muon teeny test

* muon flag beautiful mnist

* set steps

* steps as hyperparam

* match default test steps

* name

* large cleanup

* dont care about steps

* nesterov false default

* match each other impl

* steps

* switch nest

* swap defaults

* update docstring

* add no nesterov test

* ban fuse_optim

* prints

* classical momentum

* alternative condition

* recon

* pre + post wd

* false default

* detach

* signature changes

* context

* swap order

* big cleanup

* 0 step instead

* parity

* remove fuse

* remove fused

* better paper

* assert message

* correct shape check + eps

* multidim

* add eps

* cleanup

* correct assert message

* lint

* better tests

* naming

* ns_steps,ns_params

* update docstring

* docstring

* match sgd and muon together

* sandwich

* add back fused

* parity

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-08-13 14:27:55 -04:00
geohotstan
cf7224ce3e fully lint onnx.py (#11647)
* mypy

* ruff ruff ruff
2025-08-13 08:22:06 -07:00
geohotstan
925555b62a Fix onnx Domain bug (#11650) 2025-08-13 08:20:50 -07:00
chenyu
3fb79bb43a minor onnx cleanups (#11642) 2025-08-13 01:05:19 -04:00
chenyu
e9e5a08a04 simplify onnx cubic (#11641)
we can drop the double where and abs since we know which ranges the inputs map into
2025-08-12 19:57:31 -04:00
geohotstan
ad9dec25b3 combine onnx parser and onnx (#11485)
* start

* more

* fix onnx_runner test

* pass

* patch for disk and add domains from huggingface

* simpler docs

* revert domain changes

* rerun ci

* revert onnx ops test change

* add fix from strenum stuff

* correct way

* revert correct way to leave the fix for another PR

* test segfault

* Revert "test segfault"

This reverts commit 4e1aaf41e7.

* remove some unnecessary documentation

* test segfault again

* Revert "test segfault again"

This reverts commit 56fc5f03e7.

* try gemini suggested patch for sys._getframe

* keep trying with gemini

* revert not working gemini suggestions and try faulthandler

* remove pythonfaulthandler

* trigger CI a few times

* minimize diff

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-12 12:56:39 -04:00
Joshua Kissoon
c44760c89d torch backend: fix arange, add linalg.cross, add tests (#11628) 2025-08-11 23:34:41 -04:00
geohotstan
27bcb9fd1c Support cubic mode for ONNX Resize OP (#11612)
* start

* add reference

* this is so much slower

* this makes sense but differs from official impl, but results are still correct..?

* add a comment

* Just keep it simple for now since I don't fully get it yet

* address comments

* correct

* teeny clean up

* another small comment improvement lol
2025-08-11 11:49:30 -04:00
geohotstan
b0dab6a4cd onnx Resize OP clean up (#11603)
* start

* slight clean up
2025-08-10 14:10:39 -04:00
chenyu
ef17af85c6 remove .float call in llama logit (#11598)
* remove .float call in llama logit

* bfloat item
2025-08-10 00:02:18 -04:00
chenyu
3e64467322 remove freqs_cis contiguous in llama (#11597) 2025-08-09 21:11:12 -04:00
qazal
793ace530e update amd_uop_matmul.py import (#11581)
Using this for testing SQTT
2025-08-08 17:07:35 +03:00
George Hotz
82be8abfd2 move opt under codegen (#11569) 2025-08-07 14:19:17 -07:00
chenyu
702e38dc19 remove FUSE_ARANGE_UINT (#11567)
also add IGNORE_OOB=1 to bert runs. lowered BS on tinybox to 90 since 96 oom during eval without reset
2025-08-07 16:49:06 -04:00
geohotstan
1163292759 move onnx_parser into onnx (#11530) 2025-08-06 10:46:27 -04:00
nimlgen
eafc7fda12 upd perfetto (#11528) 2025-08-06 14:00:34 +03:00
nimlgen
4877aa965a ast seems to probe nv as well (#11494) 2025-08-04 11:47:07 +03:00
George Hotz
8ff03806e8 add llama layers (#11460)
* add llama layers

* add contig bw for speed
2025-07-31 16:28:04 -07:00
George Hotz
474ee9daa5 hotfix: add contiguous_backward to llama 2025-07-31 15:07:12 -07:00
kevvz
c3cfcb50cb Add linalg_det and test for torch backend (#11405)
* add linalg_det and test

* space

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-07-30 22:04:44 -04:00
wozeparrot
825b6a2505 feat: llama3 dataloader (#11340) 2025-07-30 13:27:55 -07:00
nimlgen
5fc5bb5237 ci: clear processes (#11434)
* unified hcq_smi for managment

* fix

* fix

* no reset for amd
2025-07-30 22:15:18 +03:00
George Hotz
4f26a9ad32 check elements_per_thread in tensorcore [pr] (#11435) 2025-07-30 11:55:48 -07:00
George Hotz
1bef2d80c1 unrolls are all in the same scope (#11429)
* unrolls are all in the same scope

* fix that import
2025-07-29 16:55:37 -07:00
George Hotz
03909f2772 permute locals for HL uop matmul (#11412)
* permute locals for HL uop matmul

* parens fix that

* permutes

* 20 TFLOPS
2025-07-29 08:19:59 -07:00
George Hotz
735ad5f10d kernel4 and 5 in uops (#11411)
* move simplify views to merge views

* add amd kernel 4

* Revert "move simplify views to merge views"

This reverts commit 1e07dff384.

* k4 in python

* kernel4 written in uops

* k5 support

* cleanups
2025-07-28 19:35:48 -07:00
George Hotz
fddc645668 HL=2 top matmul (#11406)
* HL=2 top matmul

* top colored
2025-07-28 12:32:38 -07:00
George Hotz
dfeee63d30 uop matmul work (#11388)
* uop matmul work

* works with locals
2025-07-26 21:23:55 -07:00
George Hotz
2c70eaf18c fix load / barrier (#11386)
* fix load / barrier

* cleanups

* fix CI
2025-07-26 10:27:37 -07:00
George Hotz
466ab5a3f2 store/load not pass through index (#11381)
* noop

* fix noop

* store cat is NOOP

* store dtype is void

* stores aren't passed through anymore

* meh, skip those for ptx

* correct ptx skip

* hl runs
2025-07-25 21:01:47 -07:00
chenyu
3d68feb67d minor onnx Gather cleanup (#11375)
removed a type ignore and one error code skip
2025-07-25 21:08:08 -04:00
George Hotz
490a93902c define reg doesn't have init anymore (#11365)
* define reg doesn't have init anymore

* remove that

* no special logic for dr

* fix amd uop matmul
2025-07-24 19:15:49 -07:00
George Hotz
0602b22086 kernel spec (#11359)
* kernel spec

* ops.VIEW

* work
2025-07-24 12:45:38 -07:00
George Hotz
b0dc97d1f7 write out kernel 3 in uops (#11352)
* write out kernel 3 in uops

* matmul is correct

* gemm passes spec

* bugfix to match speed

* cleanups
2025-07-23 17:32:38 -07:00
chenyu
86e7504111 mypy check extra/onnx.py (#11348)
instead of running test with 3.10, add onnx to mypy which would have caught StrEnum regression. Several type annotation failed mypy now that does not affect running the code and were skipped for now
2025-07-23 12:42:59 -04:00
chenyu
960da9319d Remove StrEnum in onnx for python 3.10 (#11345)
some training tests failed looks like parsing error?
2025-07-23 11:52:25 -04:00
George Hotz
108aac8af4 use AddrSpace instead of local (#11314)
* use AddrSpace instead of local

* addrspace in test
2025-07-21 14:00:06 -07:00
geohotstan
445ff8de56 ONNX onnx_parser and buffer_parse clean up (#11000)
* start

* remove onnx.load from compile4 and move np to dropout

* clean up and enable test

* clean up

* move WebGPU ONNX test into MacOS (WebGPU)

* leave test in ONNX (CPU)

* fix raw_data init None, and simplify onnx_runner test a little?

* THESE TESTS ARE SO UGLY UGHH

* need to really think about how to structure the test

* wow LLMs are quite something

* not always on disk now

* also add external data loading test

* cleaner tests

* minimize diff and add const folding tests

* add external data loading too

* whoops add webgpu back.. but why was it not needed in the first place?

* better comment

* move webgpu test to macos(webgpu)?

* llm english so much better than me wow

* trigger CI to check flakiness

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-07-21 15:10:25 -04:00
George Hotz
842184a1ab rename kernelize to schedule, try 2 (#11305) 2025-07-21 11:18:36 -07:00
वेदांत
e368628736 Add amin support to Tensor operations in Torch backend (#11290)
* intiger div mod fix

* Revert "intiger div mod fix"

This reverts commit d5d2f201bf.

* feat arg_min support

* tets update

* test fix
2025-07-21 09:14:08 -04:00
nimlgen
cc3c1e4c14 hcq: move cpu to hcq (#11262)
* hcq: move cpu to hcq

* import time

* upd

* fix

* windows support

* hm

* cleaner

* fix timer

* fix timing

* std is ns

* skip profiler

* mypy

* cleaner

* cleanups

* after merge

* default is back
2025-07-21 15:10:38 +03:00
nimlgen
2f72be5055 nv_smi: init basic insmod/rmmod/reset cmds (#11282) 2025-07-19 15:43:03 +03:00
qazal
577e581943 fix typo in sqtt/readme (#11281) 2025-07-19 15:10:24 +03:00
geohotstan
536b254df4 Bump onnx to 1.18.0 (#11266)
* bump

* thou hast implement functions

* hacked in domain support

* some clean ups

* hack quantize_onnx_test too

* add helper lol, why onnx tests why

* better dispatcher, but need tests and better naming

* flaky ci

* change some names

* small clean ups

* make it easier to clean up tests once ORT supports 1.18.0

* nits

* fix bug of Softmax_1 being registered in onnx_ops

* need a default value

* resolve_const is better name

* fix OnnxRunner.to

* use proper domain names
2025-07-17 15:35:41 -04:00
chenyu
c8e5c4d7c3 insert_before -> insert_at [pr] (#11257)
more precise
2025-07-15 17:44:34 -04:00
chenyu
a0438012af remove Kernel.get_program [pr] (#11203) 2025-07-12 20:50:29 -04:00
George Hotz
d67c8e7b42 local metal on metal in uop syntax (#11185)
* local metal on metal in uop syntax

* TODO: just put the axis_info in the kernelinfo

* local

* amd_matmul works @ 28 TFLOPS

* clean up matmul

* kernel8 works

* remove that

* locals

* axistype innovation

* work

* cleanup

* kernel3 regs

* cleanup kernel3

* work

* why is it broken

* no beam

* reenable

* permutes
2025-07-12 16:31:19 -07:00
geohotstan
5ce278b245 OnnxRunner file as input (#10789)
* file path as input and have parse be in OnnxRunner.__init__

* modelproto_to_onnxrunner -> modelproto_to_runner

* whoops, fix import

* oh flakiness again, is it because it's getting gc-ed?

* small changes

* CI flaky so just move compile4 fix in

* copy typing of onnx_load

* actually can just import onnx_load instead of onnx.load

* fix external_benchmark_openpilot

* fix onnx_runner test to use onnx_helper

* rerun CI

* try run_modelproto

* spam CI a few times

* revert run_modelproto since that's flaky also

* no external onnx_load usage except onnx.py

* cursor tab complete is evil. Snuck a darn sorted in. But does order change result? Why?

* model_benchmark 193s -> 80s, add OnnxRunner.to()...

* minimize diff and clean up

* device can be None, weird but eh

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-07-12 14:27:46 -04:00