Commit Graph

10490 Commits

Author SHA1 Message Date
qazal
d342f7688d remove some skips in test_schedule + use assertRaisesRegex [pr] (#10296) 2025-05-14 14:54:07 +03:00
qazal
40f4ce3390 enable AMD CI for TestRandomness.test_multinomial [pr] (#10295) 2025-05-14 14:32:22 +03:00
nimlgen
792853b9e2 usbgpu: enable cache for compute queue (#10294) 2025-05-14 13:05:36 +03:00
nimlgen
1218fc2230 usbgpu: enable cache for 64bit addresses (#10293) 2025-05-14 12:37:39 +03:00
qazal
1770e00c41 only CAPTURE_PROCESS_REPLAY=1 + add filterwarnings back [pr] (#10292) 2025-05-14 11:58:42 +03:00
qazal
1c97338be5 enable process replay assert for schedule [pr] (#10280)
* enable process replay assert for schedule

* start at unique+1
2025-05-14 11:10:47 +03:00
George Hotz
f1130ab3d3 openpilot benchmark test (#10290)
* openpilot benchmark test

* that
2025-05-13 22:49:28 -07:00
uuuvn
f726f79a9e Remote multi (transfer) (#10285) 2025-05-13 18:26:32 -07:00
uuuvn
7bc4864bc4 Make dev a property of Allocator (#10286)
* Make `dev` a property of `Allocator`

(this is a prereq refactor for #10285)

At least `BufferXfer.copy` accesses it assuming it's always present,
currently most devices just add this property on their own repeating
the same code over and over again.

This is also a bit footguny, see `RemoteAllocator` that named this
property `device` instead of `dev`, i could obviously just change that
in one place but doing it globally seems like a better solution (and it
reduces code duplication too).

`MallocAllocator` is a bit special, but passing `None` works just fine.

* typing

* ignore type instead of cast
2025-05-13 17:01:01 -07:00
George Hotz
ec46f658d7 openpilot llvm test [pr] (#10288) 2025-05-13 16:51:41 -07:00
uuuvn
453b268342 Factor out remote connection and cache it (#10282)
Should be a small speed improvement but the main reason this is needed
is to have a defined ordering of RemoteRequests within one host so that
transfers won't required doing something like:
```python
src_dev.batch_submit()
dest_dev.q(Transfer(dest, src_dev.session, src))
dest_dev.batch_submit()
```
for correctness.
2025-05-13 15:02:06 -07:00
uuuvn
ddff9857b8 Remote properties is a dataclass (#10283)
Not strictly required for anything but soon there will be like 4 new
properties and having it be a huge json just seems like a bad taste.

It also seems right to not have a separate endpoint for this, just
`GetProperties` request that returns a repr of this similar to how
requests are sent in `BatchRequest`.

This will also make a switch to anything other than http much simpler
if it will be required for any reason, like just a tcp stream of
`BatchRequest`s
2025-05-13 11:56:58 -07:00
uuuvn
ba87eca0f1 Remote multi (basic) (#10269)
* Basic remote multi support

Simplest thing to be able to use remote with multiple gpus, very slow
because no transfers (copyin copyout for cross-device copies)

* tests
2025-05-13 09:52:47 -07:00
George Hotz
5f64bbc63d improve multi tests + add support for fixedvars [pr] (#10281)
* improve multi tests + add support for fixedvars [pr]

* add support for fixedvars
2025-05-13 09:27:00 -07:00
chenyu
8a906cb124 Tensor.randn_like (#10276) 2025-05-13 11:53:59 -04:00
nimlgen
eab71d70ba usbgpu: rescan pci bus every run (#10279)
* usbgpu: rescan pci bus every run

* ff
2025-05-13 18:31:42 +03:00
chenyu
c4988bc07b only run test_u32_to_f16 if it supports fp16 (#10277)
* only run test_u32_to_f16 if it supports fp16

* cleanup
2025-05-13 11:16:14 -04:00
nimlgen
9924c7d0e4 usbgpu: rebar (#10275)
* usbgpu: rebar

* cache back

* revert this

* fix

* ugh

* tt
2025-05-13 17:25:51 +03:00
uuuvn
1900c3c68a Metal multi in ci is fine actually (#10274)
Useful for testing remote multi stuff
2025-05-13 10:07:35 -04:00
nimlgen
6f42bf8b54 usbgpu: 10 steps in benchmark to hit cache (#10273) 2025-05-13 17:06:50 +03:00
chenyu
ad5cb2717d FUSE_ARANGE=1 in bert bench (#10263)
still fails, something multi related maybe

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-05-13 09:12:19 -04:00
qazal
a2d6b0afe0 fix FUSE pushing through SHRINK (#10271) 2025-05-13 11:38:53 +03:00
geohotstan
1c4ab6b991 ONNX add tests against ORT (#10270)
* start

* clean up

* indicate file location too
2025-05-13 04:03:52 -04:00
nimlgen
bb31cc4582 usbgpu: check hash in patcher (#10266) 2025-05-12 21:08:53 +03:00
uuuvn
94907d02c8 Move session to RemoteRequest (#10264)
This is a prereq refactor for cloud multi which will make it possible to
use multiple devices from cloud host instead of just one.

I will do that via changing a session to be a `tuple[token, dev_idx]`

Previously the session was in cookies, this is a problem because a single
http request can contain many RemoteRequests with potentially different
devices.

The alternatives are either:

\- sending commands for different devices in separate http requests (slow)

\- only adding an idx in RemoteRequest in basically the same way i added
session here, keeping session a cookie and concat in server. This is how
i've done it previously and it looks just strictly worse than having it
all be in the same place.
2025-05-12 10:06:09 -07:00
Sieds Lykles
02208565de add check (#10257) 2025-05-12 11:03:01 -04:00
nimlgen
08ab184dfd usbgpu: copyin over 100mb/s (#10259)
* usbgpu: over 100mb/s

* align

* h
2025-05-12 16:52:43 +03:00
Kirill R.
4c7c139102 Use cmod/cdiv in sym_infer (#10258)
* Use cmod/cdiv in sym_infer

* test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-12 09:07:28 -04:00
chenyu
0015b3921f sleep more in CI Remove amdgpu (#10261)
see if this is less flaky
2025-05-12 08:13:44 -04:00
qazal
95c6a736a9 fix FUSE_ARANGE=1 for bert (#10255) 2025-05-12 14:44:05 +03:00
Sieds Lykles
7c4b381fbf Extra simplify valid test [pr] (#10256)
* add test

* Change the range

* add todo test
2025-05-12 07:32:03 -04:00
b1tg
7eeb35ba6f fix AMD LLVM compile error for bf16 cifar (#10254)
* fix AMD LLVM compile error

* remove llvm_bf16_cast

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-05-12 01:57:07 -04:00
uuuvn
a0ed1ec1ae Faster remote server (#10235) 2025-05-11 19:15:05 -07:00
b1tg
41f5ece877 add nsw flag (#10249)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-05-11 19:14:32 -07:00
George Hotz
8864ff894b hotfix: that repeat_kv belongs outside the if 2025-05-11 18:43:01 -07:00
George Hotz
98c84a711d min rectified flow example [pr] (#10252)
* work on minrf example

* more

* jit sample

* t is tensor not const

* fixes

* more convs

* fix dropout

* don't print

* 504

* big patch

* onehot

* touch

* use embeddings

* dumb uses final layer

* act

* non fl

* match

* tp

* 3

* of

* ppsz

* normal

* add adln

* no t

* weird transformer

* weird transformer

* contig

* actual speed fix

* dumb

* cb

* 0

* t is 0

* mort-t

* args

* dumb days are over

* readable

* contig

* no more t mask

* mask_t

* init to zero

* clean

* steps

* work

* tt

* t

* solid
2025-05-11 18:36:44 -07:00
chenyu
70c797b107 train bert tests (#10248)
added a working bert tiny test, and a failed bert FUSE_ARANGE test
2025-05-11 08:42:08 -04:00
George Hotz
b2df4cb696 add support for amdgpu-flat-work-group-size to AMD LLVM IR (#10246)
* add support for amdgpu-flat-work-group-size to AMD LLVM IR

* don't spam llvm init

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-05-10 19:11:10 -07:00
qazal
9210280811 add v_fmac_f16 vop3 instruction to remu (#10247)
* fmac vop3

* from the box
2025-05-10 23:48:25 +03:00
George Hotz
697259a8a1 amd_comgr_action_info_set_options was deprecated [pr] (#10245)
* amd_comgr_action_info_set_options was deprecated [pr]

* more standard
2025-05-10 11:59:04 -07:00
Kevin Buhler
2e0990c4e9 even spacing in viz nodes (#10168)
* even spacing in viz nodes

* precise dy value

* dominant-baseline text-after-edge

* add STROKE_WIDTH constant, delete dominant_baseline attr

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-05-10 10:35:10 +03:00
chenyu
d0e9b74f40 minor div_and_mod_folding cleanup [pr] (#10243)
remove type ignore and one walrus
2025-05-09 22:42:01 -04:00
Adam Van Ymeren
a28ca0680f update dead link (#10242) 2025-05-09 19:59:52 -04:00
nimlgen
2145bce3f9 usbgpu: copyin size is 16k (#10240)
* usbgpu: copyin size is 16k

* ush
2025-05-09 22:12:54 +03:00
Sieds Lykles
74e40aafa0 use cdiv in div and mod folding (#10216)
* use cdiv

* use cdiv and cmod there as well

* Add tests

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-09 12:37:24 -04:00
Sieds Lykles
8da9c070ca take gcd out of trunc div (#10238) 2025-05-09 12:08:10 -04:00
qazal
e2292f6663 TRACEMETA>=2 displays UOp metadata in VIZ (#10237) 2025-05-09 17:42:00 +03:00
qazal
d5686f33a9 delete KernelContext dataclass [pr] (#10236) 2025-05-09 17:36:21 +03:00
qazal
467daf8d4c remap UOp metadata in graph_rewrite_map [pr] (#10234)
* remap metadata in graph_rewrite_map [pr]

* fix

* merge loops

* UOp.metadata returns Metadata|None

* shorter
2025-05-09 17:20:53 +03:00
nimlgen
4c75b124b6 usb: copy into mv is faster (#10233)
* usb: copy into mv is faster

* missing

* bytes
2025-05-09 14:53:36 +03:00