Commit Graph

870 Commits

Author SHA1 Message Date
qazal
a70d1bf439 move print_diff to process replay [pr] (#8566)
* move print_diff to process replay [pr]

* ruff rightfully complians
2025-01-11 09:28:45 -05:00
qazal
60503c8621 use CAPTURE_PROCESS_REPLAY=1 in CI [pr] (#8564) 2025-01-11 06:03:48 -05:00
chenyu
6a7f971fa0 hotfix max(DEBUG, 2) -> max(DEBUG.value, 2) [pr] (#8553) 2025-01-10 12:57:44 -05:00
George Hotz
9833fe83d8 more work on onnx imagenet [pr] (#8552)
* more work on onnx imagenet [pr]

* working quantization

* static quant

* benchmark onnx 0 dim
2025-01-09 20:28:18 -08:00
chenyu
2cbb34535c simpler allreduce script [pr] (#8551)
time everything on tensor level and get time from GlobalCounters.time_sum_s
2025-01-09 21:38:13 -05:00
chenyu
23c56817d8 update and clean up allreduce script [pr] (#8549)
make `run` to able to run with ring only
2025-01-09 19:35:28 -05:00
geohotstan
299d333806 Add QLinearConv, QLinearMatMul, QLinearAdd, QLinearGlobalAveragePool to onnx (#8478)
* QLinearEverything

* ok ort verify passes

* this should be int instead

* cast to int then char to do wraparound

* cleaner

* move contrib ops to microsoft ops

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-09 15:08:53 -08:00
chenyu
85a4397f27 fix create_schedule_with_vars usage in allreduce benchmark [pr] (#8522)
* fix create_schedule_with_vars usage in allreduce benchmark [pr]

because i didn't know how to use it...

* increase time limit because tiny17 is slow
2025-01-07 01:30:01 -05:00
chenyu
0061dc7447 fix benchmark allreduce and add to ci [pr] (#8521) 2025-01-07 00:37:59 -05:00
geohotstan
9229867fec Support asymmetrical pads for all pooling functions (#8109)
* implemented in tensor

* apply onnx tests to asymmetrical pads

* better onnx op ordering

* correct ceil_mode asymmetrical

* fix onnx_ops comments

* a few more TODOs and fix some stupidity

* fix some typing

* fix test

* mypy still a little messed up

* refactor out pad struct transformation

* add simple docs for now

* add whatever tests possible

* add tests for _resolve_pool_pads

* better err msg

* whoops didn't mean to include this

* retry CI

* enable asymmetric pads onnx tests

* better docs

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-05 16:01:08 -05:00
qazal
12fa4340b3 pickle ContextVars in process replay [pr] (#8484)
* pickle ContextVars in process replay

* add test_pickle_context_var [pr]

* more realistic
2025-01-03 23:11:54 +08:00
geohotstan
de306c615b [fixed] onnx pool cleanup (#8474)
* pool janitor duty

* actually conv allows asymmetric pads

* a little prettier
2025-01-02 16:56:10 -05:00
chenyu
6fa38367bf Revert "onnx pool ops clean up (#8471)" (#8472)
This reverts commit 241db29ede.
2025-01-02 11:04:34 -05:00
geohotstan
241db29ede onnx pool ops clean up (#8471) 2025-01-02 10:45:30 -05:00
geohotstan
c4b13e2f6d add onnx DequantizeLinear (#8468)
* is this right?

* small changes

* dont support float8

* mergeable?
2025-01-02 09:52:49 -05:00
nimlgen
c18307e749 AM driver (#6923)
* connect to gpu

* rlc init?

* gfx comp start init

* early init is hardoded, some progress with fw

* gart

* progress, next mqd

* ring setup, still does not execute anything

* ugh write correct reg

* pci2: vm

* pci2: start psp

* vm seems to work

* pci2: gfx start

* pci2: fix psp ring resp

* pci2: try ring

* pci2: mes and some fixes

* pci2: some progress

* pci2: progress

* pci2: mm

* pci2: discovery

* pci2: correct apertures

* pci2: b

* pci2: i

* pci2: l

* pci2: o

* pci2: cmu

* pci2: mes_kiq works

* pci2: mes

* pci2: kcq does not work(

* pci2: unhalt gfx

* ops_am

* minor

* check if amdgpu is there, or we will crash

* bring back graph, it just works

* less prints

* do not init mes (not used)

* remove unused files

* ops_am: start move into core

* ops_am: works

* clcks, but still slower

* faster + no mes_kiq

* vm frags + remove mes

* cleanup fw

* gmc tiny cleanup

* move to ops_amd

* comment out what we dont really need

* driverless

* close in speed

* am clean most of ips

* gmc to ips

* cleaner

* new vm walker

* comment old one

* remove unsued autogens

* last write ups

* remove psp hardcoded values

* more

* add logs

* ih

* p2p and sdma

* vfio hal and interrupts

* smth

* amd dev iface

* minor after rebase

* bind for sdma

* Revert "bind for sdma"

This reverts commit a90766514d.

* tmp

* debug new mm

* ugh, allreduce hangs fixed

* p1

* works

* no pci.py

* cleaner a bit

* smth

* tiny cleanups

* cleaner a bit

* pciiface

* linter

* linter 2

* linter 3

* linter

* pylint

* reverted unrelated changes

* unrelated

* cmp tool

* ugh wrong fw

* clockgating

* unrelated

* alloc smaller chunks

* this

* opt sigs

* collect stat

* ops

* upd

* proclogs

* proclogs2

* vfio

* ruff

* linter pylint

* oops

* mypy p1

* mem fix

* mypy p2

* mypy p3

* mypy p4

* correct

* minor

* more tests

* linter in tests

* pci_regs header

* minor write up

* setup

* do not require libs

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-12-31 23:06:17 +03:00
chenyu
f3fdec940d Tensor.mod (#8458)
it's a python style mod. possibily can be cleaner with a floor div

relaxed the vmin for MOD slightly for cstyle negatives mod, it's more correct and might fix other bugs
2024-12-31 11:31:42 -05:00
qazal
866dfa1f23 create_schedule([x.lazydata]) -> x.schedule() in tests (#8449) 2024-12-31 03:15:52 +08:00
chenyu
b7397c1322 more typing cleanups [pr] (#8376)
List, Tuple, DefaultDict
2024-12-22 05:21:03 -05:00
chenyu
18dca3c3d7 isolate train_gpt2 slow kernels [pr] (#8358)
also fixed run_linearizer with var_vals=None
2024-12-20 17:59:01 -05:00
qazal
5776ea9386 hotfix: account for all changes in process_replay early stopping [pr] (#8348) 2024-12-20 23:46:46 +08:00
geohotstan
423d823c50 add GatherND and ScatterND to onnx ops (#8241)
* implemented

* this implementation is now correct

* this is fine I guess

* better variable names

* finally correct gathernd

* add a note

* eh just leave it at this for now

* teeny adjustment
2024-12-19 00:35:04 -05:00
chenyu
9789a83064 hotfix DEBUG in speed_v_theoretical.py conv (#8266)
infinite loop with manual DEBUG set `DEBUG=2 python test/external/speed_v_theoretical.py -k conv`

```
  File "/Users/chenyu/code/tinygrad/tinygrad/helpers.py", line 95, in __ge__
    def __ge__(self, x): return self.value >= x
                                ^^^^^^^^^^^^^^^
  [Previous line repeated 4984 more times]
RecursionError: maximum recursion depth exceeded in comparison
```
2024-12-15 19:44:45 -05:00
qazal
67e66ac1ab hotfix: schedule_uop in process replay (#8260)
* hotfix: schedule_uop in process replay

* notes
2024-12-15 21:24:54 +08:00
chenyu
62e19649c0 lower test_conv_3x3_256_32_32_256_256 (#8226)
tiny7 is slow
2024-12-13 17:15:53 -05:00
qazal
5864627abe process replay filter warnings [pr] (#8199) 2024-12-13 17:43:43 +08:00
George Hotz
8a04a3a77a rename LazyBuffer -> UOp [pr] (#8169)
* rename LazyBuffer -> UOp [pr]

* fix docs
2024-12-11 16:15:52 -08:00
chenyu
155f7df599 lower test_gemm_4096 expectation on green (#8152)
getting 119 sometimes, so lowered to 115
2024-12-10 18:05:12 -05:00
qazal
07b6d5cf63 assign early folding (#8093)
* assign early folding [pr]

* move to to_si

* -

* fix generate_dataset

* diff too big

* no recreation, no diff

* gzip

* new sops from tiny10

* final try
2024-12-07 17:02:55 +08:00
chenyu
564b3a3e1b onnx Bitwise ops (#8095)
free stuff!
2024-12-06 16:58:09 -05:00
chenyu
d000c08f04 fix return type of Tensor.pow (#8091)
int to power of int should return int etc, it hints that we would like to have Ops.POW
2024-12-06 13:38:29 -05:00
geohotstan
5184410fc3 combine get inputs and type_parse function in onnx [fixed] (#8081)
* 1 is simpler than 2

* variable name

* change error wording

* shapes for sequence type must be homogeneous

* bug fix for model benchmark

* fix comments too

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-12-06 12:34:47 -05:00
chenyu
b73d9a7d24 Revert "combine get inputs and type_parse function in onnx (#8069)" (#8079)
This reverts commit 074a67a6eb.
2024-12-06 08:04:21 -05:00
geohotstan
074a67a6eb combine get inputs and type_parse function in onnx (#8069)
* 1 is simpler than 2

* variable name

* change error wording

* shapes for sequence type must be homogeneous
2024-12-06 07:42:35 -05:00
chenyu
5c6ed5dba6 lower test_conv_3x3_256_32_32_256_256 expectation (#8060)
failed https://github.com/tinygrad/tinygrad/actions/runs/12182799887/job/33982676812#step:9:210
2024-12-05 10:30:56 -05:00
George Hotz
20878be2af lower test_gemv_4096_16384 expectations 2024-12-05 12:08:26 +08:00
geohotstan
5ce8090d42 simple onnx_ops cleanups (#8003)
* simple clean ups first

* more work

* kinda have adam

* ooo momentum worked nicely

* almost there

* wow.. is the onnx test wrong

* nicer optim stuff

* just skip that test

* small comment changes

* use naming convention from other parts of codebase

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-12-04 15:33:03 -05:00
chenyu
0693158d28 lower v_theoretical gemv on red (#8042)
tiny7 is still slower https://github.com/tinygrad/tinygrad/actions/runs/12166149038/job/33931736130#step:8:209
2024-12-04 13:59:40 -05:00
George Hotz
08657cb7b0 hotfix: bump expectations in speed_v_theoretical 2024-12-04 19:00:33 +08:00
George Hotz
ea65c79ba2 hotfix: don't spam BEAM debug in speed_v_theoretical 2024-12-04 18:47:16 +08:00
George Hotz
09b00b1b04 hotfix: use kernel timings instead of python timings in speed_v_theoretical 2024-12-04 18:36:17 +08:00
uuuvn
e9c5b23ba1 Use MTLCompiler directly (v2) (#7920)
* Use MTLCompiler directly (v2)

* to_block_literal and REQUEST_TYPE_COMPILE

* Rewrite command encoding

* Revert to_block_literal

* Maybe that's more readable to some people?

* Typo and comment about stdlib caching

* Update ops_metal.py

* Update ops_metal.py

* Update ops_metal.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-12-04 16:36:48 +08:00
George Hotz
09eac42fd6 cache indexed uops in st [pr] (#8008)
* cache indexed uops in st [pr]

* remove arg from range
2024-12-03 21:27:07 +08:00
George Hotz
b8bf5b2787 minor uop speedups [pr] (#8002)
* minor uop cleaner [pr]

* free uop creation speed by removing WeakValueDictionary

* a lil faster

* disable that test

* lines

* and it doesn't print non hit patterns
2024-12-03 17:04:48 +08:00
George Hotz
0905f87b68 hotfix: print only kernel time 2024-12-03 14:25:08 +08:00
chenyu
b91fa24387 script to run regressed sd conv on metal (#7995)
* script to run regressed sd conv on metal

this and other similar `conv2d + add` kernels contributed to most of the speed regression

* # ruff: noqa: E501
2024-12-02 15:34:27 -05:00
qazal
b797aee720 uop global buf number tracking try 2 [pr] (#7912)
* uop buffer init small refactor [pr]

* add early

* this way it doesn't need late

* buffer_num

* itertools.count

* count from 0

* down to 380
2024-12-02 14:45:17 +08:00
George Hotz
cbcc1c20eb second try at block linearize (#7892)
* second try at block linearize

* weeee, works for lil matmul

* it's so beautiful

* test tiny passes

* fix bugs

* combine matching BLOCKENDS

* wrapping

* test lin failures passes

* those failures were fake

* flip sort order

* fix ptx tests

* deal with store better

* dumb ptx fix

* expect less

* reduce lines

* reduce lines

* less lines and cleaner

* no defaultdict

* tighter

* simpler block_parent_count
2024-12-02 13:43:09 +08:00
George Hotz
6c1efb9a72 hotfix: amd gemv was flaky 2024-12-02 11:08:24 +08:00
chenyu
bb23469f93 lower conv threshold on red (#7948) 2024-11-28 13:31:06 -05:00