Commit Graph

10417 Commits

Author SHA1 Message Date
qazal
230a369708 remove some IGNORE_OOB [pr] (#10142)
* remove some IGNORE_OOB

* remove fuzz_schedule stuff

* test with global

* add for amd ci
2025-05-03 01:16:14 +03:00
qazal
1ed5d733bd disable TRACK_MATCH_STATS for type_verify (#10141) 2025-05-02 20:59:19 +03:00
nimlgen
993f0a0e87 am: a bit faster alloc (#10138)
* am: a bit faster allocs

* am: faster allocs
2025-05-02 16:03:42 +03:00
nimlgen
81410befc2 am: remove sleep from wait_reg (#10139)
* am: remove sleep from wait_reg

* fst

* ooops
2025-05-02 15:46:29 +03:00
nimlgen
45bf7c5b81 am: add allocation bench (#10135)
* init allocation bench

* sorryg

* betetr
2025-05-02 13:51:07 +03:00
nimlgen
6a845c2de2 amd: fix sigs on xcc path (#10137) 2025-05-02 13:50:56 +03:00
nimlgen
bdd4dd9238 am: do not expect aligned size in valloc (#10136) 2025-05-02 12:19:59 +03:00
Ignacio Sica
8f79492c75 fix test_tensor_cores_codegen for ptx renderer (#10119) 2025-05-01 21:52:36 -03:00
nimlgen
30bd6a619f usb gpu (#8766)
* start gpu

* progress

* fixes

* read correct

* libusb

* libusb works

* support asm24

* hmm

* one access file

* fix extra

* start AMBar

* works on am

* back to usb

* patch fw

* full fast write into a bar

* ugh, minus one gpus, next please

* mute libusb for now

* usb for asm24

* 63

* hmm

* ops

* rescan

* and gpu shoudl be there

* enumerate them?

* usbgpu bus 4, 100% reliable (draft)

* lil

* works

* comments

* add DEBUG

* cleaner

* simplest

* Revert "simplest"

This reverts commit 1d00354c16.

* Revert "cleaner"

This reverts commit c5662de956.

* assert we find gpu

* that's simpler

* this back

* simpler?

* correcT

* work

* nonsense

* works with more checks

* this works

* the 6s in the right place

* reliable now

* fix after reboot

* set config

* 1s timeouts

* close to fw loading

* streams

* usbhub works

* endpoints

* fix

* want to test tiny10

* move to tiny 10

* fix gpu

* ugly speed

* smth

* mostly broken, but signals and dmas

* do not reset gpu every time

* changes to run kernels

* ugh, not working

* t10

* pg and sc files

* some prog

* um?

* somehow it works

* patched for 24

* some tries

* minimal

* moving

* back to working

* so sloooooow

* move to controller

* usb.py rewrite

* rework

* cleaner 1

* cleaner 2

* cleaner 3

* new abstractions

* aft merge

* init controller

* cleaner 4

* cleaner 5

* patcher + tiny changes

* ignore that

* cleaner 6

* after rebase

* cleaner 7

* bring it back

* start linter war

* linter 2

* autogen was missing

* fix autogen

* typing

* better?

* mypy

* extra/legacy rename and cleaner

* shuffle

* better printing

* tiny changes and tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-05-01 18:03:47 +03:00
nimlgen
7573c0ef4e amd,nv: use .cpu_view() in bind (#10131) 2025-05-01 17:46:12 +03:00
nimlgen
16e5376ae8 line limit 12800 for usb (#10130) 2025-05-01 16:57:44 +03:00
qazal
0c59c6b8c7 remove replace from Tensor assign [pr] (#10127)
* remove replace from Tensor assign

* assign is contiguous

* allow chaining view

* only assert axis
2025-05-01 19:37:55 +08:00
nimlgen
9caceda79a amd: comgr is not required (#10128) 2025-05-01 13:41:44 +03:00
nimlgen
c3d2e4a6e1 amd: use sdma to copy program (#10126)
* amd: use sdma to copy program

* rm

* ensure prog is copies

* match nv style
2025-05-01 13:04:22 +03:00
nimlgen
09f5be9bcb amd: finalize device in case of failures (#10124) 2025-05-01 10:41:15 +03:00
George Hotz
ef011ff5f9 flip Ops.COPY order [pr] (#10122)
* flip Ops.COPY order [pr]

* fix copy and support multi device copy in _device
2025-05-01 00:26:24 -04:00
chenyu
145e51247a split CAST and BITCAST in PYTHON [pr] (#10123)
CAST only needs truncate and does not require dtype fmt. added bfloat16 tests can run locally
2025-04-30 23:27:35 -04:00
Ignacio Sica
bf5fb97498 fix AMD_LLVM bf16 tc for gfx1100 (#10102)
* fix amd_llvm bf16 tc

* cleanup pattern
2025-04-30 20:06:38 -03:00
George Hotz
dd0070daab Revert "flip Ops.COPY order [pr] (#10120)" (#10121)
This reverts commit 984f09ac74.
2025-04-30 17:25:21 -04:00
George Hotz
984f09ac74 flip Ops.COPY order [pr] (#10120) 2025-04-30 16:50:18 -04:00
chenyu
17d4d258ea simple symbolic slice in llama [pr] (#10112)
support slice that has step None and stop > start
2025-04-30 14:36:35 -04:00
nimlgen
b583ece8f3 amd: replace AMD_DRIVERLESS with AMD_IFACE (#10116)
* amd: replace AMD_DRIVERLESS with AMD_IFACE

* docs

* print direct err for amd_iface

* print for all
2025-04-30 20:22:02 +03:00
nimlgen
0e1beaf44f nv: align copies + better test (#10118) 2025-04-30 20:09:53 +03:00
Ignacio Sica
2941537250 cast is noop if src has dtypes.void (#10110) 2025-04-30 13:55:41 -03:00
nimlgen
fcdda4fc09 am: move boot memory to vram start (#10115) 2025-04-30 19:12:19 +03:00
nimlgen
844d5577d8 hcq: make copy_bufs and kernargs_size params configurable per device (#10114) 2025-04-30 18:43:50 +03:00
nimlgen
2ec3b722e2 nv: fix copies larger than 4g (#10117) 2025-04-30 18:43:17 +03:00
George Hotz
d81acbeef6 multi: move shrink after copy (#10109)
* multi: move shrink after copy

* passing now
2025-04-30 10:29:51 -04:00
qazal
67bd8489ad grouper cleanups [pr] (#10113) 2025-04-30 18:54:47 +08:00
nimlgen
b4c9a3d8f4 hcq: use mmio iface in copies (#10111)
* hcq: use mmio iface in copies

* linter

* fix_am

* am
2025-04-30 11:05:13 +03:00
nimlgen
5c7d004da5 hcq: refactor int ptrs to hcqbuffers (#10105)
* hcq: refactor int ptrs to hcqbuffers

* more refactors

* linter

* use in allocator

* test fiz

* fx

* ops

* final?

* simpler

* keep this for now
2025-04-30 00:12:18 +03:00
chenyu
573bbb9746 Revert "remove TransformerBlock contiguous in llama (#10104)" (#10108)
This reverts commit b8d07dcc54.
2025-04-29 15:28:38 -04:00
chenyu
4a04098389 fix llama3 with nf4 quantize (#10107)
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
George Hotz
9c1b80499f names for graph rewrites + null device supports exp and friends (#10106) 2025-04-29 14:28:20 -04:00
chenyu
b8d07dcc54 remove TransformerBlock contiguous in llama (#10104) 2025-04-29 14:15:39 -04:00
Ignacio Sica
9d5677c12c fix ptx linearizer bug 2 [pr] (#9967)
* check for local buffer

* hotfix

* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
qazal
a59d18da21 hack for VIZ=1 with examples/llama (#10103)
* hack for VIZ=1 with examples/llama

* move it alongside BEAM=0
2025-04-29 23:42:17 +08:00
qazal
93bf8764f2 do not open devices in lowering (#10101)
* do not open devices in lowering [pr]

* ctx=opts

* ctx

* fuzz test
2025-04-29 23:18:16 +08:00
George Hotz
c3ff308abb range has only one src now [pr] (#10100)
* range has only one op now

* fix z3 checker

* ci fix

* needs shell

* try pip ensure update

* that ensurepip is useless

* upgrade pip before cache

* windows happy?
2025-04-29 10:31:05 -04:00
George Hotz
427471550a hotfix: amd tflops to 74 and some external_benchmark_sdxl_softmax stuff 2025-04-29 09:02:27 -04:00
Ignacio Sica
58cf8cd493 add support for "shared_mem" for LLVM (#10093)
* init llvm shared

* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
qazal
ad7546c931 assert in test_indexing_two_bind instead of silent fail (#10099)
* assert in test_indexing_two_bind instead of silent fail

* debuggable

* skip test_simple_train
2025-04-29 20:23:25 +08:00
George Hotz
cee220a1ab always expand ssa on wheres (#9697)
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-04-29 20:08:41 +08:00
qazal
3b67f56c02 kernelize some llama realizes (#10098) 2025-04-29 18:39:56 +08:00
qazal
cbf7347cd6 display viz rewrites with tabbing if they are subrewrites (#10097)
* display viz rewrites with tabbing if they are subrewrites

* update viz api
2025-04-29 17:57:21 +08:00
George Hotz
73c2f6602f test sdxl softmax (#10096) 2025-04-28 21:55:50 -04:00
George Hotz
eaceafecae do fusion locally (#10095)
* do fusion locally

* oops, that's the right way

* explicit delete closure
2025-04-28 20:45:37 -04:00
chenyu
3eba3d6ee9 don't pass model in convert_from_huggingface and convert_from_gguf (#10094)
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
a2d0684fc1 test_attention_simple_view (#10092)
* test_attention_simple_view

* correct comment
2025-04-28 20:01:22 -04:00
Ignacio Sica
bda116d773 fix use_tensor_cores propagation (#10048)
* propagate use_tensor_cores

* add use_tensor_core to arg in test and search

* bugfix

* get TC val from ContextVar in search

* revert minor space change

* add tc emulation test to ci and benchmark

* revert

* revert whitespace change

* remove test for ptx

* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00