Commit Graph

1242 Commits

Author SHA1 Message Date
nimlgen
0788659d08 usbgpu: fast cold boot (#10260)
* usbgpu: fast cold boot

* cleaner

* assert

* xx

* compat

* fix

* fix
2025-05-14 14:58:55 +03:00
geohotstan
1c4ab6b991 ONNX add tests against ORT (#10270)
* start

* clean up

* indicate file location too
2025-05-13 04:03:52 -04:00
nimlgen
bb31cc4582 usbgpu: check hash in patcher (#10266) 2025-05-12 21:08:53 +03:00
George Hotz
8864ff894b hotfix: that repeat_kv belongs outside the if 2025-05-11 18:43:01 -07:00
George Hotz
98c84a711d min rectified flow example [pr] (#10252)
* work on minrf example

* more

* jit sample

* t is tensor not const

* fixes

* more convs

* fix dropout

* don't print

* 504

* big patch

* onehot

* touch

* use embeddings

* dumb uses final layer

* act

* non fl

* match

* tp

* 3

* of

* ppsz

* normal

* add adln

* no t

* weird transformer

* weird transformer

* contig

* actual speed fix

* dumb

* cb

* 0

* t is 0

* mort-t

* args

* dumb days are over

* readable

* contig

* no more t mask

* mask_t

* init to zero

* clean

* steps

* work

* tt

* t

* solid
2025-05-11 18:36:44 -07:00
qazal
9210280811 add v_fmac_f16 vop3 instruction to remu (#10247)
* fmac vop3

* from the box
2025-05-10 23:48:25 +03:00
nimlgen
116390083f nvme speed write example (#10230) 2025-05-09 14:20:01 +03:00
Xingyu
a21369d039 Enhance tensor random functions with dtype support (#10214)
* Enhance tensor random functions with dtype support
- Updated `aten.uniform_` and `aten.normal_` to include dtype parameter in backend.py
- Added unit tests for uniform and normal tensor generation with specific dtypes in test.py

* Refactor test name for clarity
- Renamed `test_normal_dtype` to `test_normal` in `extra/torch_backend/test.py`
- Aims to improve readability and better reflect the test's purpose
2025-05-08 20:48:07 -04:00
qazal
4ea3e373aa decode lds ops in remu (#10184) 2025-05-07 16:44:18 +08:00
Ignacio Sica
74c25bdc8b add support for ds_load_u8 in remu (#10180)
* add support for ds_load_u8 in remu

* add test for ds_load_u8
2025-05-06 20:31:00 +03:00
nimlgen
34d55857cf usbgpu: more devs in scan_pci (#10171) 2025-05-06 11:55:34 +03:00
nimlgen
30bd6a619f usb gpu (#8766)
* start gpu

* progress

* fixes

* read correct

* libusb

* libusb works

* support asm24

* hmm

* one access file

* fix extra

* start AMBar

* works on am

* back to usb

* patch fw

* full fast write into a bar

* ugh, minus one gpus, next please

* mute libusb for now

* usb for asm24

* 63

* hmm

* ops

* rescan

* and gpu shoudl be there

* enumerate them?

* usbgpu bus 4, 100% reliable (draft)

* lil

* works

* comments

* add DEBUG

* cleaner

* simplest

* Revert "simplest"

This reverts commit 1d00354c16.

* Revert "cleaner"

This reverts commit c5662de956.

* assert we find gpu

* that's simpler

* this back

* simpler?

* correcT

* work

* nonsense

* works with more checks

* this works

* the 6s in the right place

* reliable now

* fix after reboot

* set config

* 1s timeouts

* close to fw loading

* streams

* usbhub works

* endpoints

* fix

* want to test tiny10

* move to tiny 10

* fix gpu

* ugly speed

* smth

* mostly broken, but signals and dmas

* do not reset gpu every time

* changes to run kernels

* ugh, not working

* t10

* pg and sc files

* some prog

* um?

* somehow it works

* patched for 24

* some tries

* minimal

* moving

* back to working

* so sloooooow

* move to controller

* usb.py rewrite

* rework

* cleaner 1

* cleaner 2

* cleaner 3

* new abstractions

* aft merge

* init controller

* cleaner 4

* cleaner 5

* patcher + tiny changes

* ignore that

* cleaner 6

* after rebase

* cleaner 7

* bring it back

* start linter war

* linter 2

* autogen was missing

* fix autogen

* typing

* better?

* mypy

* extra/legacy rename and cleaner

* shuffle

* better printing

* tiny changes and tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-05-01 18:03:47 +03:00
chenyu
17d4d258ea simple symbolic slice in llama [pr] (#10112)
support slice that has step None and stop > start
2025-04-30 14:36:35 -04:00
nimlgen
fcdda4fc09 am: move boot memory to vram start (#10115) 2025-04-30 19:12:19 +03:00
chenyu
573bbb9746 Revert "remove TransformerBlock contiguous in llama (#10104)" (#10108)
This reverts commit b8d07dcc54.
2025-04-29 15:28:38 -04:00
chenyu
b8d07dcc54 remove TransformerBlock contiguous in llama (#10104) 2025-04-29 14:15:39 -04:00
qazal
3b67f56c02 kernelize some llama realizes (#10098) 2025-04-29 18:39:56 +08:00
chenyu
3eba3d6ee9 don't pass model in convert_from_huggingface and convert_from_gguf (#10094)
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
690dac79b5 don't modify the ranges on reduce rewrite (#10062)
* bug in div range folding

* simpler

* oh, this is right for indexing, but the div mod folding needs to be fixed

* reenable

* Passing test_complexity_w_unroll2 (#10068)

* Passing

* remove non_folded_divs

* Add check for negative tern in div folding

* Add test

* bump that limit

* fix casted

---------

Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
2025-04-28 12:01:19 -04:00
qazal
ac37510f60 remu: only write v_cmp result if exec is set (#10084) 2025-04-28 20:31:52 +08:00
qazal
d6b436a815 remu bugfix with -0.0 negation (#10082) 2025-04-28 15:46:42 +08:00
George Hotz
ea5dddc537 reduce collapse generic (#10045)
* reduce collapse generic

* new arange folder

* new range folding

* correct with sym

* all tests pass

* indexing ops passes

* failing tests

* fix tests, remove unused

* revert that

* torch indexing is fast

* skip on webgpu

* touchups

* comments
2025-04-26 09:13:24 -04:00
qazal
e1d2b64e92 remu new instructions (#10050)
* remu new instructions

* test_ds_store_half

* test_v_mul_f16
2025-04-26 02:04:12 +03:00
qazal
bba5d0a3e4 remu refactors (#10028)
* remu refactors

* scc is sgpr 253

* remove that

* rename to vcc_lo

* run cargo test in CI

* llvm-mc

* meh

* work

* work_group work 1

* seeded_lanes is dumb

* better than seeded_lanes

* does not need to be address

* 128 sgpr per wave

* scc is sgpr, we don't know which one

* null_src once more

* derive clone, wave init is cleaner

* init comes first
2025-04-26 04:31:10 +08:00
nimlgen
0fc85a2b0a hcqfuzz: init (#10049)
* hcqfuzz: init

* fix fuzz

* linter

* graph

* taht test

* update readme
2025-04-25 23:19:21 +03:00
chenyu
74c6cf8be3 lint mlperf model_train (#10038) 2025-04-24 16:19:44 -04:00
Nishant Rajadhyaksha
55942a8d8e [Bounty] moved index_tensor off cpu in torch_backend (#9916)
* moved index tensor off cpu in torch_backend

* added support for None based indexing

* fix_to_pass_tests

* fix segfault tests
2025-04-24 14:12:37 -04:00
qazal
0b482fb824 add RDNA3 parser to remu (#10025)
* llvm ref

* work

* all of them

* salu

* cleaner

* start

* vector ops

* done

* replace SMEM

* vopd

* sop1

* SOPC

* null stays null_src

* sopp

* SOPK

* sop2

* vop1

* vop2

* remove allow(dead_code)

* vopc
2025-04-24 21:34:07 +08:00
Sieds Lykles
e75be6eafc [bounty] [pr] index validation with z3 (#9981)
* index validation with z3

* Change comment

* toposort -> toposort()

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 08:06:08 -04:00
Park Jun
c3ad7b2a84 create randperm and support pytorch backend (#10019) 2025-04-24 07:29:02 -04:00
Matthew Daiter
b545338e59 isin_Tensor_out added (#10018) 2025-04-24 07:26:51 -04:00
nimlgen
1c5e353249 am: use mmio iface (#10012)
* am: use mmio iface

* linters

* fixes

* fixes + cleanups

* mute

* mypy

* style
2025-04-24 00:27:04 +03:00
Francis Lata
defa1e77f6 get the proper dataset count (#9962) 2025-04-21 12:11:37 -04:00
Francis Lata
d7e247f329 RetinaNet INITMLPERF support (#9950)
* fixes to make fake data work

* fix eval beam

* fix merge issue
2025-04-21 10:32:05 -04:00
akhuntsaria
2d423e6737 fix assertion message for supported device in export_model (#9957) 2025-04-21 09:23:44 -04:00
qazal
e20ef7196a Tensor.kernelize (#9845)
* add kernelize

* remove that

* kernelize returns self

* update abstractions2.py

* kernelize in test_schedule

* temp: assert BUFFER_VIEW's existence

* ASSIGN must have a buffer or subbuffer target

* assert and shrink

* fix

* padded setitem

* var

* toposort once

* extra

* base_buffer

* end with BUFFER_VIEW

* setitem for disk

* test_setitem_becomes_subbuffer

* mul slice test

* torch backend fix 1

* non-deterministic

* keep subbuffer
2025-04-20 20:53:49 +08:00
chenyu
6c30948df6 hand_coded_optimizations returns list[Opt] [pr] (#9938)
new api looks like `k.apply_opts(hand_coded_optimizations(k))`
2025-04-19 20:26:59 -04:00
chenyu
720f20865b remove required_optimizations (#9848) 2025-04-19 16:51:16 -04:00
qazal
16dfe0a902 upstream remu (#9921) 2025-04-18 01:57:36 +03:00
chenyu
f5256e0020 Kernel.apply_opts [pr] (#9917)
* Kernel.apply_opts [pr]

updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization

* not you yet
2025-04-17 08:00:56 -04:00
Xingyu
047c8fd70d Add amax support to Tensor operations in Torch Backend (#9905)
* Add amax support to Tensor operations
- Implemented amax function in backend.py for tensor max operations.
- Added unit tests for amax in test.py to ensure correct functionality.

* Fix formatting in amax output function
- Adjusted spacing in the amax output lambda function in backend.py
- Improved code readability for better maintenance
2025-04-16 10:35:50 +01:00
geohotstan
4e8f25109a Revert "ONNX add output shape validation (#9720)" (#9904)
This reverts commit ac713e04db.
2025-04-16 03:15:56 -04:00
nimlgen
7c466c24f7 am_smi: refactor to support arches (#9864)
* am_smi: refactor to support arches

* shorter
2025-04-12 20:37:01 +03:00
chenyu
8c6299bced move hand_coded_optimizations to heuristic.py [pr] (#9844)
* move hand_coded_optimizations to heuristic.py [pr]

also folded all long lines

* make a copy and rename self -> k

* fix test
2025-04-10 23:40:16 -04:00
Francis Lata
eb2e59db42 RetinaNet model type annotations and loss functions (#9822)
* add type annotations and loss functions for training

* combine sum of multiple dims inside loss functions
2025-04-10 00:31:37 -04:00
Francis Lata
7bb36d71b2 remove openimages iterate (#9820) 2025-04-09 22:54:12 -04:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f fix TF32 tensor core dropped in tc_sm89 (#9798)
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00