Commit Graph

1363 Commits

Author SHA1 Message Date
geohotstan
602a145f8f Add Tensor.unfold (#10518)
* yoinked 10272

* eitanturok's fixes

* hmmm should size be sint?

* add test
2025-05-26 11:15:44 -04:00
nimlgen
deb369417c am_smi: print device usage (#10520)
* am_smi: print device usage

* tiny comments
2025-05-26 17:17:56 +03:00
geohotstan
fd9f236a82 move test over (#10508) 2025-05-25 21:51:51 -04:00
George Hotz
941cbd3471 hotfix: amd works on arch linux w/o rocm 2025-05-24 16:47:13 -07:00
nimlgen
d90ddcc365 nv: blackwell support (#10487)
* nv: blackwell support

* fixes

* hm

* h

* fixes

* mypy

* xx

* yy

* arr

* revert

* oops

* unrelated
2025-05-24 18:23:53 +03:00
chenyu
dc6309242d WallTimeEvent for mlperf ci (#10506) 2025-05-24 10:56:03 -04:00
Panagiotis Kourouklidis
e21836952d mmapeak implementation for 7900 XTX (#10417)
* Add mmapeak implementation for 7900 XTX

* Change identation

* Use a template instead of multiple assebly files

* Fix output formatting

* Reduce register file bank conflicts

* More accurate measurement for quick instructions

* Add support for gfx1201

* RDNA4 wmma requires less VGRPs

* RDNA4 does not have s_cmpk instructions

* Add v_wmma_i32_16x16x32_iu4 for gfx1201

* Add sparse wmma instructions

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-05-23 16:26:12 -07:00
George Hotz
0a313d98a0 add rocm 6.4 support (#10491)
* add rocm 6.4 support

* update to newer amdcomgr, assert lang is right

* fix aux-triple
2025-05-23 16:20:54 -07:00
Xingyu
1e0a59aca4 fix: handle buffer size calculation in to_movement_ops and add scalar assignment test in torch_backend (#10464) 2025-05-22 10:54:13 -07:00
George Hotz
577a0b4cfa openpilot compile4 (wip) (#10407)
* openpilot compile4

* add copies

* remove junk
2025-05-22 10:47:34 -07:00
qazal
7720c1aef1 hotfix: remove viz_sz.py [pr] (#10446) 2025-05-21 14:17:42 +03:00
qazal
df4cbb69e9 move fuzz_schedule.py to extra [pr] (#10444) 2025-05-21 10:07:24 +03:00
qazal
8a6fb37560 move viz /prof to extra [pr] (#10401) 2025-05-18 23:25:59 +03:00
George Hotz
411392dfb7 move files into uop dir (#10399)
* move files into uop dir [pr]

* tinygrad.uop is a thing

* fix uop docs, no pr

* fix viz
2025-05-18 11:38:28 -07:00
qazal
17f0f5e764 add v_rcp_f32_e64 to remu (#10393)
* tests from the box

* add v_rcp_f32_e64 to remu

* f32::from_bits utils

* v_cndmask_b32 tests
2025-05-18 17:08:21 +03:00
Xingyu
286b0f4051 Add equal function implementation and corresponding test (#10351)
- Implemented a new function `equal` in the torch backend to compare two tensors for equality.
- Added unit tests for the `equal` function to verify its correctness with different tensor inputs.
2025-05-16 23:39:49 -07:00
Ignacio Sica
a54fd745c3 simpler barrier match in remu (#10339)
* s_barrier

* remove s_barrier from syncs
2025-05-16 14:40:58 +03:00
wozeparrot
1ed04f993b move benchmark stat tracking to influxdb (#10185) 2025-05-15 16:14:56 -07:00
Ignacio Sica
3c453e96a9 add ds_load_b96 and ds_store_b96 instructions (#10338) 2025-05-15 18:11:08 +03:00
qazal
be8202b293 add s_abs_i32 instruction to remu (#10334) 2025-05-15 16:47:58 +03:00
nimlgen
e00679dc92 am_smi: fix layout with sleep mode (#10300) 2025-05-14 15:44:42 +03:00
nimlgen
0788659d08 usbgpu: fast cold boot (#10260)
* usbgpu: fast cold boot

* cleaner

* assert

* xx

* compat

* fix

* fix
2025-05-14 14:58:55 +03:00
geohotstan
1c4ab6b991 ONNX add tests against ORT (#10270)
* start

* clean up

* indicate file location too
2025-05-13 04:03:52 -04:00
nimlgen
bb31cc4582 usbgpu: check hash in patcher (#10266) 2025-05-12 21:08:53 +03:00
George Hotz
8864ff894b hotfix: that repeat_kv belongs outside the if 2025-05-11 18:43:01 -07:00
George Hotz
98c84a711d min rectified flow example [pr] (#10252)
* work on minrf example

* more

* jit sample

* t is tensor not const

* fixes

* more convs

* fix dropout

* don't print

* 504

* big patch

* onehot

* touch

* use embeddings

* dumb uses final layer

* act

* non fl

* match

* tp

* 3

* of

* ppsz

* normal

* add adln

* no t

* weird transformer

* weird transformer

* contig

* actual speed fix

* dumb

* cb

* 0

* t is 0

* mort-t

* args

* dumb days are over

* readable

* contig

* no more t mask

* mask_t

* init to zero

* clean

* steps

* work

* tt

* t

* solid
2025-05-11 18:36:44 -07:00
qazal
9210280811 add v_fmac_f16 vop3 instruction to remu (#10247)
* fmac vop3

* from the box
2025-05-10 23:48:25 +03:00
nimlgen
116390083f nvme speed write example (#10230) 2025-05-09 14:20:01 +03:00
Xingyu
a21369d039 Enhance tensor random functions with dtype support (#10214)
* Enhance tensor random functions with dtype support
- Updated `aten.uniform_` and `aten.normal_` to include dtype parameter in backend.py
- Added unit tests for uniform and normal tensor generation with specific dtypes in test.py

* Refactor test name for clarity
- Renamed `test_normal_dtype` to `test_normal` in `extra/torch_backend/test.py`
- Aims to improve readability and better reflect the test's purpose
2025-05-08 20:48:07 -04:00
qazal
4ea3e373aa decode lds ops in remu (#10184) 2025-05-07 16:44:18 +08:00
Ignacio Sica
74c25bdc8b add support for ds_load_u8 in remu (#10180)
* add support for ds_load_u8 in remu

* add test for ds_load_u8
2025-05-06 20:31:00 +03:00
nimlgen
34d55857cf usbgpu: more devs in scan_pci (#10171) 2025-05-06 11:55:34 +03:00
nimlgen
30bd6a619f usb gpu (#8766)
* start gpu

* progress

* fixes

* read correct

* libusb

* libusb works

* support asm24

* hmm

* one access file

* fix extra

* start AMBar

* works on am

* back to usb

* patch fw

* full fast write into a bar

* ugh, minus one gpus, next please

* mute libusb for now

* usb for asm24

* 63

* hmm

* ops

* rescan

* and gpu shoudl be there

* enumerate them?

* usbgpu bus 4, 100% reliable (draft)

* lil

* works

* comments

* add DEBUG

* cleaner

* simplest

* Revert "simplest"

This reverts commit 1d00354c16.

* Revert "cleaner"

This reverts commit c5662de956.

* assert we find gpu

* that's simpler

* this back

* simpler?

* correcT

* work

* nonsense

* works with more checks

* this works

* the 6s in the right place

* reliable now

* fix after reboot

* set config

* 1s timeouts

* close to fw loading

* streams

* usbhub works

* endpoints

* fix

* want to test tiny10

* move to tiny 10

* fix gpu

* ugly speed

* smth

* mostly broken, but signals and dmas

* do not reset gpu every time

* changes to run kernels

* ugh, not working

* t10

* pg and sc files

* some prog

* um?

* somehow it works

* patched for 24

* some tries

* minimal

* moving

* back to working

* so sloooooow

* move to controller

* usb.py rewrite

* rework

* cleaner 1

* cleaner 2

* cleaner 3

* new abstractions

* aft merge

* init controller

* cleaner 4

* cleaner 5

* patcher + tiny changes

* ignore that

* cleaner 6

* after rebase

* cleaner 7

* bring it back

* start linter war

* linter 2

* autogen was missing

* fix autogen

* typing

* better?

* mypy

* extra/legacy rename and cleaner

* shuffle

* better printing

* tiny changes and tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-05-01 18:03:47 +03:00
chenyu
17d4d258ea simple symbolic slice in llama [pr] (#10112)
support slice that has step None and stop > start
2025-04-30 14:36:35 -04:00
nimlgen
fcdda4fc09 am: move boot memory to vram start (#10115) 2025-04-30 19:12:19 +03:00
chenyu
573bbb9746 Revert "remove TransformerBlock contiguous in llama (#10104)" (#10108)
This reverts commit b8d07dcc54.
2025-04-29 15:28:38 -04:00
chenyu
b8d07dcc54 remove TransformerBlock contiguous in llama (#10104) 2025-04-29 14:15:39 -04:00
qazal
3b67f56c02 kernelize some llama realizes (#10098) 2025-04-29 18:39:56 +08:00
chenyu
3eba3d6ee9 don't pass model in convert_from_huggingface and convert_from_gguf (#10094)
it only needs n_layers
2025-04-28 20:11:19 -04:00
George Hotz
690dac79b5 don't modify the ranges on reduce rewrite (#10062)
* bug in div range folding

* simpler

* oh, this is right for indexing, but the div mod folding needs to be fixed

* reenable

* Passing test_complexity_w_unroll2 (#10068)

* Passing

* remove non_folded_divs

* Add check for negative tern in div folding

* Add test

* bump that limit

* fix casted

---------

Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
2025-04-28 12:01:19 -04:00
qazal
ac37510f60 remu: only write v_cmp result if exec is set (#10084) 2025-04-28 20:31:52 +08:00
qazal
d6b436a815 remu bugfix with -0.0 negation (#10082) 2025-04-28 15:46:42 +08:00
George Hotz
ea5dddc537 reduce collapse generic (#10045)
* reduce collapse generic

* new arange folder

* new range folding

* correct with sym

* all tests pass

* indexing ops passes

* failing tests

* fix tests, remove unused

* revert that

* torch indexing is fast

* skip on webgpu

* touchups

* comments
2025-04-26 09:13:24 -04:00
qazal
e1d2b64e92 remu new instructions (#10050)
* remu new instructions

* test_ds_store_half

* test_v_mul_f16
2025-04-26 02:04:12 +03:00
qazal
bba5d0a3e4 remu refactors (#10028)
* remu refactors

* scc is sgpr 253

* remove that

* rename to vcc_lo

* run cargo test in CI

* llvm-mc

* meh

* work

* work_group work 1

* seeded_lanes is dumb

* better than seeded_lanes

* does not need to be address

* 128 sgpr per wave

* scc is sgpr, we don't know which one

* null_src once more

* derive clone, wave init is cleaner

* init comes first
2025-04-26 04:31:10 +08:00
nimlgen
0fc85a2b0a hcqfuzz: init (#10049)
* hcqfuzz: init

* fix fuzz

* linter

* graph

* taht test

* update readme
2025-04-25 23:19:21 +03:00
chenyu
74c6cf8be3 lint mlperf model_train (#10038) 2025-04-24 16:19:44 -04:00
Nishant Rajadhyaksha
55942a8d8e [Bounty] moved index_tensor off cpu in torch_backend (#9916)
* moved index tensor off cpu in torch_backend

* added support for None based indexing

* fix_to_pass_tests

* fix segfault tests
2025-04-24 14:12:37 -04:00
qazal
0b482fb824 add RDNA3 parser to remu (#10025)
* llvm ref

* work

* all of them

* salu

* cleaner

* start

* vector ops

* done

* replace SMEM

* vopd

* sop1

* SOPC

* null stays null_src

* sopp

* SOPK

* sop2

* vop1

* vop2

* remove allow(dead_code)

* vopc
2025-04-24 21:34:07 +08:00
Sieds Lykles
e75be6eafc [bounty] [pr] index validation with z3 (#9981)
* index validation with z3

* Change comment

* toposort -> toposort()

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-04-24 08:06:08 -04:00