Commit Graph

3701 Commits

Author SHA1 Message Date
qazal
b6904bbf83 Revert "split grouper into insert and finalize stages [pr] (#10222)" (#10224)
This reverts commit 2594e4db15.
2025-05-09 03:02:38 +03:00
qazal
2594e4db15 split grouper into insert and finalize stages [pr] (#10222) 2025-05-09 02:36:22 +03:00
qazal
1d0f239df7 use Tensor.train() in schedule test + typo [pr] (#10220) 2025-05-08 23:46:42 +03:00
nimlgen
267ba9b592 usbgpu: better names in copy speed benchmark (#10212) 2025-05-08 16:12:37 +03:00
hooved
7b4f05fd00 Add test for correctness of Infinity in WebGPU (#10201)
* use function for infinity instead of uniform

* test infinity math locally

* test infinity math in CI

* make pytest available to MacOS (WebGPU)

* revert to master except failing webgpu test
2025-05-08 05:20:05 -07:00
nimlgen
ba52fce4b2 usbgpu: benchmark in ci (#10208)
* usbgpu: benchmark

* usbgpu: benchmark
2025-05-08 12:02:04 +03:00
qazal
d0e3449992 remove view_supported_devices, check allocator instead [pr] (#10209) 2025-05-08 11:45:02 +03:00
George Hotz
8d4c563c01 all COPY can be clone (#10205)
* match old behavior

* simple

* it means the naive thing before the multi

* fix
2025-05-07 20:31:39 -07:00
hooved
8e76c40aea Refactor test: Enable generality in testing UOp alu expressions (#10200)
* use function for infinity instead of uniform

* test infinity math locally

* test infinity math in CI

* make pytest available to MacOS (WebGPU)

* revert to master except failing webgpu test

* isolate test refactor
2025-05-07 19:39:44 -07:00
uuuvn
10c9ede6b7 Cloud graph (#9876) 2025-05-07 11:41:41 -07:00
Sieds Lykles
2891892834 Fold constant variable (#10196)
* Add rule

* add test and comment

* merge rule
2025-05-07 11:39:44 -07:00
Sieds Lykles
8386527bb9 Take neg out of idiv (#10164)
* Add rules

* Fix tests

* Move rules lower to prevent recursion
2025-05-07 11:39:08 -07:00
Sieds Lykles
09544d4556 Add rule and test (#10189)
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-07 10:15:55 -04:00
nimlgen
0fbe494c6b usb: cache writes into 0xa000 (#10191)
* usb: cache writes into 0xa000

* mock

* match parent spec

* ugh
2025-05-07 16:03:35 +03:00
nimlgen
685d5c46df usbgpu: send pci write in batches (#10190)
* usbgpu: send pci write in batches

* mock
2025-05-07 14:41:56 +03:00
qazal
94e07725a6 only reorder expand if it can fuse with input (#10186)
* failing test

* only reorder expand if it can fuse with input

* (16,) is reshaped to (4, 4)
2025-05-07 18:14:31 +08:00
uuuvn
dba073e5c0 Less messy broken graph on paravirtualized metal workaround (#10182)
* Less messy broken graph on paravirtualized metal workaround

GitHub CI macOS runners use paravirtualized metal which is broken with
graph (some comments say that ICB in particular is broken but in my
testing it was fine sometimes, but other times hitting an assert inside
metal's code related to resouces, so not sure).

> Assertion failed: (resource != nil), function -[IOGPUMetalResource initWithResource:], file IOGPUMetalResource.m, line 458.

This can be reproduced locally with any virtualization software (like utm)
that can create macOS VMs with apple's own virtualization framework.

* unused import
2025-05-06 20:41:02 +03:00
nimlgen
37a7a99adb metal: fix graph when unrelated input buffers are not metal buffers (#10170)
* metal: fix graph when unrelated input buffers are not metal buffers

* tinier test
2025-05-06 11:37:16 +03:00
George Hotz
603c03bef2 fix tests for rewrite [pr] (#10167)
* fix tests for rewrite [pr]

* cleaner

* delete linearize_uop

* clean up the rest
2025-05-05 19:19:49 -07:00
wozeparrot
10437904cd refactor: ops_cloud -> ops_remote [pr] (#10166) 2025-05-05 15:59:51 -07:00
Sieds Lykles
338f33efae Fast mod (#10055)
* Enable fast mod

* Add test
2025-05-05 09:15:43 -07:00
qazal
62e86bc5ec insert Ops.FUSE for arange (#10140)
* insert Ops.FUSE for arange

* reshape does not collapse

* do not fuse reshapes

* add children

* fixups

* work

* add Ops.WHERE support to z3

* fix fuse for cast

* diff

* ugh

* don't need this anymore

* contiguous

* add always_contiguous

* there too
2025-05-05 08:32:12 +03:00
George Hotz
a0240d8c2b lil work on llvm speed (#10157)
* lil work on llvm speed

* llvm failing test

* 1e-4

* simpler failing test

* once is fine

* gpt suggests this syntax change

* bump that debug
2025-05-04 16:37:26 -07:00
George Hotz
36ccaa88a6 move merge views [pr] (#10156)
* move merge views [pr]

* move flow to __init__ [pr]
2025-05-04 14:41:47 -07:00
George Hotz
5f3f162606 cache rewrites for renderer [pr] (#10155)
* add caching to rewrites for renderer [pr]

* remove that

* update ebs
2025-05-04 13:45:15 -07:00
Sieds Lykles
848c7783a4 Sign check in div const div pattern (#10150)
* Add rule

* Relax the condition

* Add test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-03 18:04:34 -04:00
George Hotz
7c33924a50 don't use real_size for mem_bytes [pr] (#10147) 2025-05-03 09:41:21 -04:00
nimlgen
45bf7c5b81 am: add allocation bench (#10135)
* init allocation bench

* sorryg

* betetr
2025-05-02 13:51:07 +03:00
Ignacio Sica
8f79492c75 fix test_tensor_cores_codegen for ptx renderer (#10119) 2025-05-01 21:52:36 -03:00
nimlgen
30bd6a619f usb gpu (#8766)
* start gpu

* progress

* fixes

* read correct

* libusb

* libusb works

* support asm24

* hmm

* one access file

* fix extra

* start AMBar

* works on am

* back to usb

* patch fw

* full fast write into a bar

* ugh, minus one gpus, next please

* mute libusb for now

* usb for asm24

* 63

* hmm

* ops

* rescan

* and gpu shoudl be there

* enumerate them?

* usbgpu bus 4, 100% reliable (draft)

* lil

* works

* comments

* add DEBUG

* cleaner

* simplest

* Revert "simplest"

This reverts commit 1d00354c16.

* Revert "cleaner"

This reverts commit c5662de956.

* assert we find gpu

* that's simpler

* this back

* simpler?

* correcT

* work

* nonsense

* works with more checks

* this works

* the 6s in the right place

* reliable now

* fix after reboot

* set config

* 1s timeouts

* close to fw loading

* streams

* usbhub works

* endpoints

* fix

* want to test tiny10

* move to tiny 10

* fix gpu

* ugly speed

* smth

* mostly broken, but signals and dmas

* do not reset gpu every time

* changes to run kernels

* ugh, not working

* t10

* pg and sc files

* some prog

* um?

* somehow it works

* patched for 24

* some tries

* minimal

* moving

* back to working

* so sloooooow

* move to controller

* usb.py rewrite

* rework

* cleaner 1

* cleaner 2

* cleaner 3

* new abstractions

* aft merge

* init controller

* cleaner 4

* cleaner 5

* patcher + tiny changes

* ignore that

* cleaner 6

* after rebase

* cleaner 7

* bring it back

* start linter war

* linter 2

* autogen was missing

* fix autogen

* typing

* better?

* mypy

* extra/legacy rename and cleaner

* shuffle

* better printing

* tiny changes and tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-05-01 18:03:47 +03:00
George Hotz
ef011ff5f9 flip Ops.COPY order [pr] (#10122)
* flip Ops.COPY order [pr]

* fix copy and support multi device copy in _device
2025-05-01 00:26:24 -04:00
chenyu
145e51247a split CAST and BITCAST in PYTHON [pr] (#10123)
CAST only needs truncate and does not require dtype fmt. added bfloat16 tests can run locally
2025-04-30 23:27:35 -04:00
Ignacio Sica
bf5fb97498 fix AMD_LLVM bf16 tc for gfx1100 (#10102)
* fix amd_llvm bf16 tc

* cleanup pattern
2025-04-30 20:06:38 -03:00
George Hotz
dd0070daab Revert "flip Ops.COPY order [pr] (#10120)" (#10121)
This reverts commit 984f09ac74.
2025-04-30 17:25:21 -04:00
George Hotz
984f09ac74 flip Ops.COPY order [pr] (#10120) 2025-04-30 16:50:18 -04:00
chenyu
17d4d258ea simple symbolic slice in llama [pr] (#10112)
support slice that has step None and stop > start
2025-04-30 14:36:35 -04:00
nimlgen
0e1beaf44f nv: align copies + better test (#10118) 2025-04-30 20:09:53 +03:00
nimlgen
2ec3b722e2 nv: fix copies larger than 4g (#10117) 2025-04-30 18:43:17 +03:00
George Hotz
d81acbeef6 multi: move shrink after copy (#10109)
* multi: move shrink after copy

* passing now
2025-04-30 10:29:51 -04:00
nimlgen
5c7d004da5 hcq: refactor int ptrs to hcqbuffers (#10105)
* hcq: refactor int ptrs to hcqbuffers

* more refactors

* linter

* use in allocator

* test fiz

* fx

* ops

* final?

* simpler

* keep this for now
2025-04-30 00:12:18 +03:00
qazal
93bf8764f2 do not open devices in lowering (#10101)
* do not open devices in lowering [pr]

* ctx=opts

* ctx

* fuzz test
2025-04-29 23:18:16 +08:00
George Hotz
c3ff308abb range has only one src now [pr] (#10100)
* range has only one op now

* fix z3 checker

* ci fix

* needs shell

* try pip ensure update

* that ensurepip is useless

* upgrade pip before cache

* windows happy?
2025-04-29 10:31:05 -04:00
George Hotz
427471550a hotfix: amd tflops to 74 and some external_benchmark_sdxl_softmax stuff 2025-04-29 09:02:27 -04:00
qazal
ad7546c931 assert in test_indexing_two_bind instead of silent fail (#10099)
* assert in test_indexing_two_bind instead of silent fail

* debuggable

* skip test_simple_train
2025-04-29 20:23:25 +08:00
qazal
cbf7347cd6 display viz rewrites with tabbing if they are subrewrites (#10097)
* display viz rewrites with tabbing if they are subrewrites

* update viz api
2025-04-29 17:57:21 +08:00
George Hotz
73c2f6602f test sdxl softmax (#10096) 2025-04-28 21:55:50 -04:00
George Hotz
eaceafecae do fusion locally (#10095)
* do fusion locally

* oops, that's the right way

* explicit delete closure
2025-04-28 20:45:37 -04:00
George Hotz
a2d0684fc1 test_attention_simple_view (#10092)
* test_attention_simple_view

* correct comment
2025-04-28 20:01:22 -04:00
Ignacio Sica
bda116d773 fix use_tensor_cores propagation (#10048)
* propagate use_tensor_cores

* add use_tensor_core to arg in test and search

* bugfix

* get TC val from ContextVar in search

* revert minor space change

* add tc emulation test to ci and benchmark

* revert

* revert whitespace change

* remove test for ptx

* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
George Hotz
d32f5e9f3a improve rendering of shapes in viz + investigate symbolic [pr] (#10091) 2025-04-28 16:44:09 -04:00