Commit Graph

1248 Commits

Author SHA1 Message Date
George Hotz
4e6665bda5 different way to write torch backend (#9197)
* different way to write torch backend

* both backends

* more work

* simpler code

* more work

* test both

* imply unwrap/wrap

* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works

* ready to start making test_ops work in torch backend

* backward pass, TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works

* FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_simple_conv2d works

* matmul backward is broken with as_strided
2025-02-22 14:42:26 +08:00
geohotstan
6587c7879b simple fixes to onnx (#9195)
* uncontroversial changes

* cleaner _prepare_quantize
2025-02-21 13:10:06 -05:00
George Hotz
e87be0131e torch backend start (#9191)
* start torch backend

* progress

* ugh, you need cpp crap

* 1+1 works

* 1+1 works

* becoming a real backend

* ready to merge?
2025-02-21 16:57:28 +08:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
chenyu
1692087db5 _one_hot_along_dim input needs to be int (#9179)
* _one_hot_along_dim input needs to be int

indexing and onehot compare with arange, and non-int dtype is likely a bug
2025-02-20 09:00:43 -05:00
George Hotz
bf36967883 cuda hooking (#9180)
* cuda hooking

* progress

* more hook cuda

* fix params

* compile + cuMemHostAlloc hook

* work

* revert that
2025-02-20 19:20:01 +08:00
chenyu
975c318dbc bert use int32 for input ids (#9173)
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
George Hotz
7eea9b639d hotfix: add replay_pkl debugging env 2025-02-17 17:34:58 +08:00
chenyu
1fda98d14f fix import time_linearizer [pr] (#9118)
only test that used it was skipped in CI due to being slow
2025-02-15 21:33:28 -05:00
Josh Moore
1f9d2442b9 Add Tensor.scatter_reduce (#8947)
* pytorch scatter -> scatter_reduce

* WIP scatter_reduce implementation

* _pre_scatter return type hint

* split out src, mask to satisfy linter

* Add src cast back in

* dict of lambdas instead of ifs

* sum and prod reduction ops with include_self

* add reduce arg error message

* add amax and amin reduction ops

* Fix include_self for higher dims

* Simplify

* Simplify amax and amin too

* Pull include_self logic out into _inv_mask function

* reduce arg cannot be None for scatter_reduce

* Fix self-mask issue

* Add mean reduce op

* Add tests

* any() not needed here

* remove comment

* End support for Tensor src with reduce arg in tinygrad scatter

* Process index, dim inside actual functions

* Add scatter_reduce to onnx

* Add excluded onnx ScatterElements reduction tests back in

* Save 2 lines on the mask helpers

* Update docs

* Add include_self=False tests

* cleanup

* Remove unneeded helper function

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-13 09:08:54 -05:00
George Hotz
74742c018f hotfix: setup_mock_nv_osx 2025-02-13 12:26:15 +08:00
chenyu
f4f56d7c15 move time_linearizer to extra.optimization.helpers [pr] (#9048)
no longer used in tinygrad
2025-02-12 15:49:58 -05:00
divinity76
bec4f59ce8 workaround f16 cast ambiguity (#8935)
for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose)

error: 
```
Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                 ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                               ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                                              ^

4 errors detected in the compilation of "<null>".
```
2025-02-11 09:38:56 +08:00
nimlgen
dfc9d6827f am_smi: print power state (#9013) 2025-02-10 23:07:39 +03:00
nimlgen
f91409f038 am: fix proclogs (#9004) 2025-02-10 16:38:58 +03:00
nimlgen
c6c2373bc0 replace libpciaccess autogen with just pci regs (#8983)
* replace libpciaccess autogen with just pci regs

* add pci.py
2025-02-09 18:40:45 +03:00
George Hotz
6ffee2fca9 reduce speed example [pr] (#8978)
* reduce speed example

* fast like a nascar
2025-02-09 14:13:59 +08:00
George Hotz
a3c78d47b3 speed docs + upgrades [pr] (#8964)
* add some docs about speed [pr]

* better torch gemm

* enable locals on llvm/clang

* disable locals for beam speed on LLVM/CLANG

* 0x20 alignment in llvm allows ymm use
2025-02-08 17:28:52 +08:00
nimlgen
11d50324d8 am: tiny cleanups (#8958)
* am: start cleanups

* am
2025-02-07 23:44:43 +03:00
Ahmed Harmouche
133cacadde Autogen webgpu dawn, removing wgpu-py dependency (f16 support part 1) (#8646)
* Switch to dawn, all tests passing locally

* Use dawn-python

* Skip failing test

* Skip midcast and fix timestamp on metal ci

* Autogen webgpu

* Try fetch dawn lib again

* /usr/lib

* Without lib prefix

* Test autogen diff

* Delete webgpu support, move everything to ops_webgpu

* mypy fix

* Simplify, refactor

* Line savings

* No ResultContainer

* Type annotation for result

* Some more simplifications

* Why was this explicit sync used at all?

* Refactor: delete functions that are only used once

* Create shader module inline

* Clear unit tests cache, maybe that solves it

* That wasn't it

* Try deleting cache to pass failing weight compare

* weights_only=False for pytorch 2.6

* Simplify ctype array creation

* Remove nanosecond precision timestamps

* Simplify error handling

* Refactor, add back type annotations

* Deleted custom submit function, refactor

* read_buffer simplify

* Fix use after free, refactor

* Simplify supported_features

* Runtime docs

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-07 15:16:59 +08:00
nimlgen
ee1a0fb8ec am_smi: print device name (#8939) 2025-02-07 03:01:25 +03:00
chenyu
a092b6395d Tuple -> tuple, List -> list [pr] (#8936) 2025-02-06 14:21:19 -05:00
nimlgen
86feb98dcd am: add support for 7600 (#8910)
* am: start to add support for 7600

* test_tiny passes

* mmhub 3 0 2

* cleaner
2025-02-06 14:04:07 +03:00
geohotstan
057c70b05f add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890)
* start

* log severity

* only change this

* change abstraction so it's more usable for huggingface

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-04 16:36:01 -05:00
George Hotz
56fa5c1191 dsp simulator (#8869)
* dsp simulator

* progress

* fix

* close on test tiny

* working

* less waste

* line savings

* Device DSP compiler

* mock DSP at the bottom

* DSP tests

* docker caching

* test update

* need load

* skip that test for CI DSP

* last touch

* ugh
2025-02-04 09:45:04 +08:00
geohotstan
d1aa9f30bc copy onnx_ops into onnx (#8876)
* just copy it over

* make OnnxOps a global var

* some small style stuff

* rerun CI but also some small clean up

* some comments
2025-02-03 12:15:07 -05:00
George Hotz
f484db0e63 dsp cleanups [pr] (#8866) 2025-02-03 15:18:53 +08:00
qazal
ba17786068 do not construct unmasked VALID (#8759)
* new lines that exist in codegen/ops

* update tests

* update sops.gz (13071 -> 13070 asts)

* fix viz too

* remove that TODO

* diff pruning

* mask assert + device

* work

* diff pruning

* re: fix viz too

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-28 20:51:21 +02:00
George Hotz
80089536e5 Revert "move llvm_bf16_cast to renderer for CLANG and LLVM [pr] (#8720)" (#8786)
This reverts commit af0452f116.
2025-01-28 18:59:02 +09:00
mesozoic-egg
af0452f116 move llvm_bf16_cast to renderer for CLANG and LLVM [pr] (#8720)
* handle bf16 via bitcasting for CLANG and LLVM

* On LLVM, skip float16 cast

* float32 on llvm lite, float32 elsewhere

* code format

* trigger pr

* move to rewriter

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-28 18:16:43 +09:00
nimlgen
1c608ae34f am_smi: better spacing (#8773)
* am_smi: better spacing

* not used
2025-01-27 23:01:02 +03:00
Ignacio Sica
b240f12593 [TIP-9] rename Opt's amt to arg 2 (#8770)
* rename Opt amt to arg

* ignore_beam_cache for test_tiny

* move ignore_beam_cache to test_tiny

* move to separate pr

* revert space change

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-27 14:19:04 -05:00
George Hotz
3ed146a5ff Revert "rename Opt amt to arg (#8767)" (#8769)
This reverts commit bf041659a5.
2025-01-27 23:46:37 +09:00
Ignacio Sica
bf041659a5 rename Opt amt to arg (#8767) 2025-01-27 23:36:47 +09:00
nimlgen
2f06eccf1d am: script and vfio msg (#8742)
* am: script and vfio msg

* use sysfs bars always for now

* tiny chnages
2025-01-25 00:33:00 +03:00
nimlgen
dc10187fc0 am: add am_smi (#8739)
* am: start monitor

* cleanups

* fixes

* hmm

* progress

* cleanup
2025-01-24 20:16:19 +03:00
geohotstan
04846b91aa reorder and categorize onnx_ops (#8731)
* new order

* remove a todo

* constant node is definitely requires_grad false

* one new line spacing

* property and graph

* oops linter
2025-01-23 13:18:54 -05:00
chenyu
49b914ee69 simpler bert acc [pr] (#8714)
logit.log_softmax().argmax(-1) is equivalent to logit.argmax(-1)
2025-01-22 10:32:19 -05:00
geohotstan
dd82b4c913 make onnx runner a class (#8647)
* this

* clean up

* more clean ups and improve debug msg

* more correct training toggler

* remove manual training toggling

* change some variable names

* actually just add the training toggle for LIMIT envvar too

* more refinement

* __call__ and OnnxRunner

* fix half pylint, other half is importing from onnx while this file is onnx.py, figure out later

* ahhhh found another mistake

* remove limit from __call__

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-20 10:11:05 -08:00
ignaciosica
b49a04145e fix for int plus minor cleanup (#8650) 2025-01-17 22:30:39 -05:00
geohotstan
4abe631b56 fix onnx mobilenetv2-7-quantized.onnx (#8574)
* is 67% considered fixed?

* move test up

* share function

* add qgemm too

* make sure qgemm comes out as int

* actually that note is not right

* remove qgemm (I did it wrong) and add it later lol.
2025-01-13 09:25:06 -08:00
Francis Lata
c25d5d3101 improve isin checks (#8589) 2025-01-13 12:12:31 -05:00
nimlgen
38b5ac4d4a mypy for mockgpu/cuda & dsp/run (#8575) 2025-01-12 18:25:39 +03:00
geohotstan
815c505e1d fixes from adapting tvm tests (#8570) 2025-01-11 11:38:36 -05:00
qazal
60503c8621 use CAPTURE_PROCESS_REPLAY=1 in CI [pr] (#8564) 2025-01-11 06:03:48 -05:00
George Hotz
c7acd40574 more aggressive onnx const creation [pr] (#8561) 2025-01-10 17:38:32 -08:00
George Hotz
70fa65cd95 viz fixups + scheduler option [pr] (#8557) 2025-01-10 15:09:31 -08:00
George Hotz
9833fe83d8 more work on onnx imagenet [pr] (#8552)
* more work on onnx imagenet [pr]

* working quantization

* static quant

* benchmark onnx 0 dim
2025-01-09 20:28:18 -08:00
George Hotz
5720871903 onnx consts are const [pr] (#8548) 2025-01-09 16:09:22 -08:00
geohotstan
299d333806 Add QLinearConv, QLinearMatMul, QLinearAdd, QLinearGlobalAveragePool to onnx (#8478)
* QLinearEverything

* ok ort verify passes

* this should be int instead

* cast to int then char to do wraparound

* cleaner

* move contrib ops to microsoft ops

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-09 15:08:53 -08:00