qazal
230a369708
remove some IGNORE_OOB [pr] ( #10142 )
...
* remove some IGNORE_OOB
* remove fuzz_schedule stuff
* test with global
* add for amd ci
2025-05-03 01:16:14 +03:00
qazal
1ed5d733bd
disable TRACK_MATCH_STATS for type_verify ( #10141 )
2025-05-02 20:59:19 +03:00
nimlgen
993f0a0e87
am: a bit faster alloc ( #10138 )
...
* am: a bit faster allocs
* am: faster allocs
2025-05-02 16:03:42 +03:00
nimlgen
81410befc2
am: remove sleep from wait_reg ( #10139 )
...
* am: remove sleep from wait_reg
* fst
* ooops
2025-05-02 15:46:29 +03:00
nimlgen
45bf7c5b81
am: add allocation bench ( #10135 )
...
* init allocation bench
* sorryg
* betetr
2025-05-02 13:51:07 +03:00
nimlgen
6a845c2de2
amd: fix sigs on xcc path ( #10137 )
2025-05-02 13:50:56 +03:00
nimlgen
bdd4dd9238
am: do not expect aligned size in valloc ( #10136 )
2025-05-02 12:19:59 +03:00
Ignacio Sica
8f79492c75
fix test_tensor_cores_codegen for ptx renderer ( #10119 )
2025-05-01 21:52:36 -03:00
nimlgen
30bd6a619f
usb gpu ( #8766 )
...
* start gpu
* progress
* fixes
* read correct
* libusb
* libusb works
* support asm24
* hmm
* one access file
* fix extra
* start AMBar
* works on am
* back to usb
* patch fw
* full fast write into a bar
* ugh, minus one gpus, next please
* mute libusb for now
* usb for asm24
* 63
* hmm
* ops
* rescan
* and gpu shoudl be there
* enumerate them?
* usbgpu bus 4, 100% reliable (draft)
* lil
* works
* comments
* add DEBUG
* cleaner
* simplest
* Revert "simplest"
This reverts commit 1d00354c16 .
* Revert "cleaner"
This reverts commit c5662de956 .
* assert we find gpu
* that's simpler
* this back
* simpler?
* correcT
* work
* nonsense
* works with more checks
* this works
* the 6s in the right place
* reliable now
* fix after reboot
* set config
* 1s timeouts
* close to fw loading
* streams
* usbhub works
* endpoints
* fix
* want to test tiny10
* move to tiny 10
* fix gpu
* ugly speed
* smth
* mostly broken, but signals and dmas
* do not reset gpu every time
* changes to run kernels
* ugh, not working
* t10
* pg and sc files
* some prog
* um?
* somehow it works
* patched for 24
* some tries
* minimal
* moving
* back to working
* so sloooooow
* move to controller
* usb.py rewrite
* rework
* cleaner 1
* cleaner 2
* cleaner 3
* new abstractions
* aft merge
* init controller
* cleaner 4
* cleaner 5
* patcher + tiny changes
* ignore that
* cleaner 6
* after rebase
* cleaner 7
* bring it back
* start linter war
* linter 2
* autogen was missing
* fix autogen
* typing
* better?
* mypy
* extra/legacy rename and cleaner
* shuffle
* better printing
* tiny changes and tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-05-01 18:03:47 +03:00
nimlgen
7573c0ef4e
amd,nv: use .cpu_view() in bind ( #10131 )
2025-05-01 17:46:12 +03:00
nimlgen
16e5376ae8
line limit 12800 for usb ( #10130 )
2025-05-01 16:57:44 +03:00
qazal
0c59c6b8c7
remove replace from Tensor assign [pr] ( #10127 )
...
* remove replace from Tensor assign
* assign is contiguous
* allow chaining view
* only assert axis
2025-05-01 19:37:55 +08:00
nimlgen
9caceda79a
amd: comgr is not required ( #10128 )
2025-05-01 13:41:44 +03:00
nimlgen
c3d2e4a6e1
amd: use sdma to copy program ( #10126 )
...
* amd: use sdma to copy program
* rm
* ensure prog is copies
* match nv style
2025-05-01 13:04:22 +03:00
nimlgen
09f5be9bcb
amd: finalize device in case of failures ( #10124 )
2025-05-01 10:41:15 +03:00
George Hotz
ef011ff5f9
flip Ops.COPY order [pr] ( #10122 )
...
* flip Ops.COPY order [pr]
* fix copy and support multi device copy in _device
2025-05-01 00:26:24 -04:00
chenyu
145e51247a
split CAST and BITCAST in PYTHON [pr] ( #10123 )
...
CAST only needs truncate and does not require dtype fmt. added bfloat16 tests can run locally
2025-04-30 23:27:35 -04:00
Ignacio Sica
bf5fb97498
fix AMD_LLVM bf16 tc for gfx1100 ( #10102 )
...
* fix amd_llvm bf16 tc
* cleanup pattern
2025-04-30 20:06:38 -03:00
George Hotz
dd0070daab
Revert "flip Ops.COPY order [pr] ( #10120 )" ( #10121 )
...
This reverts commit 984f09ac74 .
2025-04-30 17:25:21 -04:00
George Hotz
984f09ac74
flip Ops.COPY order [pr] ( #10120 )
2025-04-30 16:50:18 -04:00
chenyu
17d4d258ea
simple symbolic slice in llama [pr] ( #10112 )
...
support slice that has step None and stop > start
2025-04-30 14:36:35 -04:00
nimlgen
b583ece8f3
amd: replace AMD_DRIVERLESS with AMD_IFACE ( #10116 )
...
* amd: replace AMD_DRIVERLESS with AMD_IFACE
* docs
* print direct err for amd_iface
* print for all
2025-04-30 20:22:02 +03:00
nimlgen
0e1beaf44f
nv: align copies + better test ( #10118 )
2025-04-30 20:09:53 +03:00
Ignacio Sica
2941537250
cast is noop if src has dtypes.void ( #10110 )
2025-04-30 13:55:41 -03:00
nimlgen
fcdda4fc09
am: move boot memory to vram start ( #10115 )
2025-04-30 19:12:19 +03:00
nimlgen
844d5577d8
hcq: make copy_bufs and kernargs_size params configurable per device ( #10114 )
2025-04-30 18:43:50 +03:00
nimlgen
2ec3b722e2
nv: fix copies larger than 4g ( #10117 )
2025-04-30 18:43:17 +03:00
George Hotz
d81acbeef6
multi: move shrink after copy ( #10109 )
...
* multi: move shrink after copy
* passing now
2025-04-30 10:29:51 -04:00
qazal
67bd8489ad
grouper cleanups [pr] ( #10113 )
2025-04-30 18:54:47 +08:00
nimlgen
b4c9a3d8f4
hcq: use mmio iface in copies ( #10111 )
...
* hcq: use mmio iface in copies
* linter
* fix_am
* am
2025-04-30 11:05:13 +03:00
nimlgen
5c7d004da5
hcq: refactor int ptrs to hcqbuffers ( #10105 )
...
* hcq: refactor int ptrs to hcqbuffers
* more refactors
* linter
* use in allocator
* test fiz
* fx
* ops
* final?
* simpler
* keep this for now
2025-04-30 00:12:18 +03:00
chenyu
573bbb9746
Revert "remove TransformerBlock contiguous in llama ( #10104 )" ( #10108 )
...
This reverts commit b8d07dcc54 .
2025-04-29 15:28:38 -04:00
chenyu
4a04098389
fix llama3 with nf4 quantize ( #10107 )
...
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
George Hotz
9c1b80499f
names for graph rewrites + null device supports exp and friends ( #10106 )
2025-04-29 14:28:20 -04:00
chenyu
b8d07dcc54
remove TransformerBlock contiguous in llama ( #10104 )
2025-04-29 14:15:39 -04:00
Ignacio Sica
9d5677c12c
fix ptx linearizer bug 2 [pr] ( #9967 )
...
* check for local buffer
* hotfix
* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
qazal
a59d18da21
hack for VIZ=1 with examples/llama ( #10103 )
...
* hack for VIZ=1 with examples/llama
* move it alongside BEAM=0
2025-04-29 23:42:17 +08:00
qazal
93bf8764f2
do not open devices in lowering ( #10101 )
...
* do not open devices in lowering [pr]
* ctx=opts
* ctx
* fuzz test
2025-04-29 23:18:16 +08:00
George Hotz
c3ff308abb
range has only one src now [pr] ( #10100 )
...
* range has only one op now
* fix z3 checker
* ci fix
* needs shell
* try pip ensure update
* that ensurepip is useless
* upgrade pip before cache
* windows happy?
2025-04-29 10:31:05 -04:00
George Hotz
427471550a
hotfix: amd tflops to 74 and some external_benchmark_sdxl_softmax stuff
2025-04-29 09:02:27 -04:00
Ignacio Sica
58cf8cd493
add support for "shared_mem" for LLVM ( #10093 )
...
* init llvm shared
* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
qazal
ad7546c931
assert in test_indexing_two_bind instead of silent fail ( #10099 )
...
* assert in test_indexing_two_bind instead of silent fail
* debuggable
* skip test_simple_train
2025-04-29 20:23:25 +08:00
George Hotz
cee220a1ab
always expand ssa on wheres ( #9697 )
...
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-04-29 20:08:41 +08:00
qazal
3b67f56c02
kernelize some llama realizes ( #10098 )
2025-04-29 18:39:56 +08:00
qazal
cbf7347cd6
display viz rewrites with tabbing if they are subrewrites ( #10097 )
...
* display viz rewrites with tabbing if they are subrewrites
* update viz api
2025-04-29 17:57:21 +08:00
George Hotz
73c2f6602f
test sdxl softmax ( #10096 )
2025-04-28 21:55:50 -04:00
George Hotz
eaceafecae
do fusion locally ( #10095 )
...
* do fusion locally
* oops, that's the right way
* explicit delete closure
2025-04-28 20:45:37 -04:00
chenyu
3eba3d6ee9
don't pass model in convert_from_huggingface and convert_from_gguf ( #10094 )
...
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
a2d0684fc1
test_attention_simple_view ( #10092 )
...
* test_attention_simple_view
* correct comment
2025-04-28 20:01:22 -04:00
Ignacio Sica
bda116d773
fix use_tensor_cores propagation ( #10048 )
...
* propagate use_tensor_cores
* add use_tensor_core to arg in test and search
* bugfix
* get TC val from ContextVar in search
* revert minor space change
* add tc emulation test to ci and benchmark
* revert
* revert whitespace change
* remove test for ptx
* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00