Commit Graph

1363 Commits

Author SHA1 Message Date
George Hotz
68053d0510 dsp stuff / sniff ioctls from snpe (#9490)
* sniff ioctls from snpe

* dump input buffers

* snpe logs from dsp

* NHWC support

* knum 3

* this run?

* revert those

---------

Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
geohotstan
8c0d0a122c Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
Francis Lam
1e5d9ad8f7 extra/gemm/max_matmul: start of custom kernels for GEMM (#6926)
* extra/gemm/max_matmul: start of custom kernels for GEMM

* add an unoptimized FP16/FP16 MMA example

* add slow 3-stage fp16 acc example

* add correct 3-stage pipeline with unswizzled/flat smem input (slow)

* add acc fp16 example with 3 stages and swizzle (no bank conflicts)

* add max version of NV fp16_fp16_fp16

* fix up comments and removed unused code in max variations

* add start of no_xor example

* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
b1tg
a95b489a55 nanoGPT train works with tiny torch backend (#9283)
* train_shakespeare_char.py works

* move aten.where.self_out to tiny_backend_out

* fix memory leak

* corealize in the backward_hook

* Update backend.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
George Hotz
117b7a16ef VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
nimlgen
a82c9332d3 am: rename soc21 to soc (#9482) 2025-03-18 08:54:26 +08:00
Anish Umale
5e58f4b65b Tiny backend test_ops fix part 3 (#9483)
* extract straightforward things from https://github.com/tinygrad/tinygrad/pull/9302

* pass dtype and device for ones_like
2025-03-17 18:01:51 -04:00
TJ
9fcef4d009 add masked_select to tensor.py (#9468)
* add masked_select to tensor.py

* fix tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-17 16:05:36 -04:00
geohotstan
53d6f1e1bb Add bitonic cat sort (#9422)
* poc

* repeated values fail, sigh

* is this being timed out?

* fix up down names

* bitonic v2, does this run?

* bitonic v3, faster

* bitonic v3.1, faster

* bitonic v3.1.1, same speed unlucky

* support dim and indices

* bitonic v3.2, simpler code, TODO repeated indices

* bruv gimme green for once cmon

* cat (stack) implementation, slow but maybe one day when cat is fast meow

* revert to v3.2

* bitonic v4, who let the cats out edition

* clean up variable names

* figured out repeated indices :D

* ruff check --fix

* use sort for topk

* add Tensor.sort everywhere

* fix docs and add some types

* slightly better variable names

* am I doing torch inplace correctly?

* delegate sort to values_stable

* add a contig, faster first sort

* maybe don't test_inplace

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-17 12:01:23 -04:00
George Hotz
824c5f41ac dsp work try 3 (#9475)
* dsp work try 3

* padding
2025-03-17 16:42:12 +08:00
George Hotz
52ae9af4dd Fast DSP for MobileNetV2 (try 2) (#9467)
* Fast DSP for MobileNetV2 (try 2)

* enable fast path on uchar

* fix tests
2025-03-17 15:10:36 +08:00
George Hotz
09e7708b49 minimum change for rdna4 [pr] (#9455) 2025-03-16 13:39:24 +08:00
George Hotz
cb7a7f69c7 quantization preprocessor from DSP, should be universal (#9437)
* quantization preprocessor from DSP, should be universal

* touchups

* fix tests
2025-03-15 07:49:37 +08:00
chenyu
0e591baf43 redo simple_matmul change (#9450)
numpy does not support bfloat16
2025-03-14 17:53:52 -04:00
chenyu
b0f63d3c04 Revert "simple_matmul.py uses np to generate random (#9438)" (#9449)
This reverts commit 14018050c1.
2025-03-14 17:14:22 -04:00
Ignacio Sica
14018050c1 simple_matmul.py uses np to generate random (#9438)
* np generates randoms

* hotfix: use generator for int dtype

* float32 as default dtype for float generator

* use np.float32 instead of stirng

* add dtype= to integers generator

* change import _to_np_dtype source
2025-03-14 17:36:50 -03:00
geohotstan
0bed9b6cd2 benchmark huggingface onnx models (#8493)
* add ability to ORT=1

* test_vs_ort

* useless f

* actually have benchmark take in modelproto for more flexibility in huggingface stuff

* ok runs

* good

* oops fix benchmark_onnx __main__

* 224 as default

* add ORT=1 option to huggingface_onnx

* use Tensor to get_input

* add abilty to do single onnx model testing

* better names

* merge properly...

* copy in onnx_helpers

* better

* decent script

* need to add debug tool first

* new limit usage

* why did narrowing_error come back..

* pretty decent

* revert validate change

* more ops bug fixes

* revert unnecessary changes

* fix InstanceNorm too

* remove op from O4

* minimize diff

* address old feedback

* unsure of this, just revert

* remove that assert

* working attention

* to_python_const Attention

* cant init from np constant so just do this

* final

* fix bug in attention

* attention clean ups

* add hard TODOs and REPOPATH and TRUNCATE envvar

* fix input_ids default value

* final

* fix scatter

* cleaner _prepare_quantize

* use new attention and tempfile for huggingface script

* more stats

* update

* remove outdated code

* big refactor to something usable by CI

* booooooom

* clean up

* update to using yaml as env var input

* add dry run

* try

* valid pad

* use argparser and fix gather bug

* ignore all yaml

* tiny bit more polish

* woah ignoring all yaml was not right

* typo

* decouple huggingface_onnx_run debug run with huggingface_onnx_download

* bug fix for downloading single model

* WOOOO ok much better

* oops argparse 'required' is an invalid argument for positionals

* oops argparse 'required' is an invalid argument for positionals

* add assert

* fix types

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-12 20:13:12 -04:00
Priyank Patel
4714c4f9ad torch backend multigpu - add devices and tests (#9414)
* add multi-device support and tests

* simplify
2025-03-12 11:33:11 +08:00
uuuvn
e85001b6ee SQTT profiling (#9278)
* sqtt

* docs

* multi-device

* ProfileSQTTEvent

* exec update

* 256mb default

* don't let people hang their gpus

* bitfields from autogen

* asic info from mesa

* more bitfields from autogen

* SQTT_ITRACE_SE_MASK

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-11 13:19:56 +08:00
Priyank Patel
beed00eabe fix torch backend memory leak (#9395)
* fix leak, realize everything on torch optim step

* only realize a subset

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-11 10:48:20 +08:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
Priyank Patel
796c3bbb23 torch: support in-place operations on views (#9371)
* add torch inplace tests

* first set of tests passing

* wrap all inplace funcs, add more tests

* fixes and wrap more functions

* fix all uint8 tests to avoid slow tests

* fix the one test

* another test, another fix

* and one more, works for ddp now

* something on contiguous, cleanup

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-03-10 23:29:00 +08:00
George Hotz
25847080f0 olmoe (from stream, wip) (#9390)
* olmoest working (but not)

* it's correct

* compare ropes

* old code wasn't wrong

* default device

* no metal

* fix permute

* working

* more minimal
2025-03-10 13:46:33 +08:00
geohotstan
1d64c12f2b add Topk to tensor (#9343)
* terrible but somewhat working impl

* linux behaves differently than macos?

* slightly better impl

* small clean up; haven't figured this out yet

* better

* torch has different behavior on linux and macos for duplicated values

* add sum docs

* fix test

* add torch return_type test

* add an exception test

* wrap_fxn instead, and move op lower in order

* better repeated values test

* rerun ci
2025-03-09 20:01:42 -04:00
geohotstan
088d86691b fix onnx gather and onnx auto_pad VALID mode (#9375)
* fix gather and auto_pad

* long -> int64
2025-03-07 10:27:23 -05:00
uuuvn
b75f307234 amd: autogen ip bases (#9360) 2025-03-05 22:30:38 +03:00
Anish Umale
b3ac60ce53 Fix test_ops for tiny backend part 2 (#9358)
* extact functions from https://github.com/tinygrad/tinygrad/pull/9302

* revert gather and add aten.elu_backward

* address review

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-05 13:38:40 -05:00
Priyank Patel
f048256341 fix TORCH_DEBUG=1 sigsegv (#9352) 2025-03-05 12:24:53 +03:00
nimlgen
993ef42bd5 am: hdp cg (#9346) 2025-03-04 20:44:09 +03:00
hooved
01f7a4fadc tinychat in browser, Part 2: model export (#9274)
* load llama3-1B to WEBGPU device

* include compile script for loading llama3 to WEBGPU

* parametrize max_context in build_transformer fxn

* jit_model with two different args sets

* compile for webgpu, split weights

* load model weight parts in browser

* export all tensors from initialized transformer

* run transformer inference in browser

* enable tiktoken with llama bpe in browser

* count total tokens on client with tiktoken.js

* full client-side chat streaming, eliminate server

* revert change that enabled jitting with 2 argsets

* llama without Variable or cache_kv, for webgpu

* have client use mask tokens / whole context

* cleanup staged weights

* add tiktoken.js build script, README

* export CLANG for Q6_k to float32 decompression

* fix and test exported CLANG code for Q6_k to fp32

* revert changes to jit and export_model

* isolate clang export

* test Q6_K to float32 decompression in browser

* gguf_load now also returns t_infos and data_start

* prepare llama-1B Q6_K gguf chunks for browser

* cache and decompress quantized llama in browser

* enable separate deployment of large files

* fix kv cache and symbolic with llama wgpu

* eliminate browser lag during decompression

* hash metadata and weight chunks

* delete obsolete indexeddb cache to free disk

* add progress bar, track model download/decompress

* refactor progress callback

* skip buffer hash verification for speed

* Display progress for entire loading scope

* Report page load errors to user

* actually display errors

* skip prompt tokens already seen by model

* skip prefilling with last assistant message tokens

* on page load tell user if webgpu not enabled

* push deployed URL root to window.history

* make note of bug sources with TODO items

* isolate bug in CLANG with BEAM=2

* remove clang_bug.py from diff

* decompress q6k to f32 on webgpu instead of clang

* remove unused code

* inter-weight decomp with larger wgpu kernels

* parallelize decompression submissions

* refactor dequantize scheduling

* add progress bar back

* fix bug

* temp fix for loading GGUF Q6_K to fp16 not fp32

* fix rendering of exported CLANG

* remove weight casts, sketch js functions for clang

* get symbolic vars from jit_cache for model export

* include symbolic vars in exported CLANG

* render js for clang transformer

* toggle clang/webgpu deployment; refactor decomp

* compile and render clang Q6_K->fp16 and int8 quant

* fix rendered clang for abs(fp16), to work in wasm

* simplify clang js wrapping

* run compiled clang in worker

* prepare llama weights in workers, q6k to int8/fp16

* tinychat on clang in browser, f32/int8 weights

* move wasm inference to (now flexible) worker

* don't load redundant embeddings

* modest wasm perf gain with compile flags

* set default backend, enable backend choice/backup

* render symbolic vars in exported WEBGPU

* quantize webgpu llama to int8/f32

* improve UX arising from rendered WEBGPU

* clean up webgpu launch

* new weights split: smaller chunks, tinygrad quant.

* switch webgpu inference to int8 quant

* remove unneeded clang decompression

* eliminate unneeded kv cache transfer to wasm

* use 1 worker for simplified clang decompression

* display launch errors

* refactor: stream load weight chunks to WebGPU

* show loading chunk completion

* quantize embeddings to int8

* test float16 as input for quantization

* webgpu: use f16 source, int8 embed, eliminate q6k

* simplify split weights prep: all from state_dict

* revert change to nn.state.gguf_load

* remove unneeded decompression from webgpu client

* remove unneeded code

* decrease dl chunks from 47 to 16 MiB

* improve stability of webgpu loading on mobile

* autodetect mobile, improve load stability

* refactor: progress closure

* refactor: one unified progress bar

* remove unneeded code

* revert changes to tinygrad core library

* enforce ios18.3 nerfed max buf size

* BEAM=3 webgpu

* cache integrity, mobile save throttling

* improve mobile UX - no autozoom on prompt box

* clang: int8 from f16, remove q6k

* reduce concurrent dls on mobile to 2 for stability

* refactor: wasm backend with stream loading

* prevent race between wasm load and indexedb save

* split wasm kernels into separate modules

* js wrapper for multiple wasm module inference

* revert multi-module wasm to single module

* make mobile wasm load more stable/fast

* refactor: copy weights into wasm without crashes

* fix bug in download queue; increase mobile dls

* refactor exported clang wrapper, split weights

* remove unnecessary code

* greatly improve int8 quant quality with rounding

* eliminate mobile throttling

* increase webgpu context to 4096 tokens

* export webgpu js functions

* enable separate hosted weights for mobile/pc

* enable prompt-thread switching during generation

* stop generation when max_context is reached

* show progress bar for prefill

* tell user if webgpu fails, while wasm loads

* make loading messages more concise

* update font

* revert changes to tinychat python app launch

* cleanup quantization, add scale_dtype param

* cleanup kv cache code

* cleanup compile code

* link tok_embeddings with output in webgpu export

* refactor: export_model webgpu: symbolic vars

* refactor: export_model weight loading

* forgot to commit export_model.py

* change CLANG to CPU

* deal with pylint incorrectly failing tests

* simplify f-strings for older CI python version

* fix pre-python3.12 parser errors

* [Int32Array] not Int32Array

* cleanup webgpu compile after refactor export_model

* refactor WASM export into export_model

* merge WebGPU/WASM compile scripts

* simplify max_contexts for local deployment

* fix parser issues and whitespace

* deduplicate variable defs for non-wasm clang export

* cleanup code

* cleanup compile scripts

* simplify wasm inference wrapping

* simplify webgpu symbolic vars export

* refactor: unify export of symbolic variables

* simplify WASM export

* simplify clang/wasm export

* update README and build scripts

* separate files for browser/python apps

* restore original python tinychat app files

* browser and python tinychats share assets

* minor cleanup

* isolate compile/export model

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-04 15:53:30 +08:00
chenyu
019417743c ruff torch backend (#9341) 2025-03-03 15:15:23 -05:00
nimlgen
f9e4c638f1 torch_hook fixes (#9334) 2025-03-03 23:07:30 +03:00
Anish Umale
bafa40fe12 Tiny backend test_ops fix part1 (#9338)
* extract name methods from https://github.com/tinygrad/tinygrad/pull/9302

* t.grad.numpy() -> t.grad.cpu().numpy()

* revert TORCH_DEBUG change

* revert dtype change in aten.sum
2025-03-03 12:36:51 -05:00
Friedrich Carl Eichenroth
b4028e48ae Torch Backend Refinement (#9327)
* fix some torch tests

* fixup

* small change

* fixup

* fix test

* use default function

* add todo

* bunch of small changes

* fix tests

* more tests

* fix

* fix

* test fix

* simplify
2025-03-03 10:24:02 -05:00
George Hotz
a73d8717f3 fast amd gemm (#9318)
* 50 TFLOP AMD gemm

* add lds tiling

* register tiling

* flip locals

* work

* comment

* remove those
2025-03-03 12:01:14 +08:00
chenyu
ba4b8c2c23 Tensor.copysign (#9329) 2025-03-02 21:33:49 -05:00
Friedrich Carl Eichenroth
06ef9cc9f4 aten leaky_relu, div.out_mode, clamp_max, clamp_min, copysign (#9323)
* fix some torch tests

* fixup

* small change

* fixup

* fix test

* use default function

* add todo
2025-03-02 19:12:16 -05:00
nimlgen
91c421fb7d adaptive am_smi (#9319) 2025-03-02 15:45:07 +03:00
geohotstan
d9ec05cea6 Test Onnx quantization behavior (#9301)
* add DynamicDequantizeLinear and corresponding tests

* wow qlinearops are round away from zero

* this passes locally...

* again

* try

* try separate test

* round to even again

* also add QLinearMul

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-01 19:21:58 -05:00
Priyank Patel
f4148ac46a torch fix casting and add ops for sd vae(s) (#9297)
* torch fix copy casting and add upsample op

* update cast and add test

* fix lint

* add pad for sdxl vae to work
2025-03-01 08:49:10 -05:00
chenyu
38d7aae3b7 onnx fmod (#9307) 2025-02-28 14:09:22 -05:00
chenyu
3ae66e59a3 least_upper_float is at least default_float (#9303)
* least_upper_float is at least default_float

en route for div rounding mode. dtype of true int division would change from int32 to default_float, which matches torch too.

* fix bert acc
2025-02-28 10:41:56 -05:00
nimlgen
052722a7bc torch hook: address comments (#9295)
* torch hook: address comments

* failed test
2025-02-28 11:51:52 +03:00
Priyank Patel
8ae215dd3d torch backend fix manual seed warning (#9292) 2025-02-28 13:45:32 +08:00
George Hotz
ac40316692 hotfix: group cpu functions in torch backend 2025-02-28 10:39:00 +08:00
George Hotz
b32595dbbc torch examples (#9290)
* torch, fix examples/mnist

* fix vae torch example

* where out
2025-02-28 10:16:06 +08:00
hooved
3b9950241e tinychat in browser, Part 1: llama (#9273)
* load llama3-1B to WEBGPU device

* include compile script for loading llama3 to WEBGPU

* parametrize max_context in build_transformer fxn

* jit_model with two different args sets

* compile for webgpu, split weights

* load model weight parts in browser

* export all tensors from initialized transformer

* run transformer inference in browser

* enable tiktoken with llama bpe in browser

* count total tokens on client with tiktoken.js

* full client-side chat streaming, eliminate server

* revert change that enabled jitting with 2 argsets

* llama without Variable or cache_kv, for webgpu

* have client use mask tokens / whole context

* cleanup staged weights

* add tiktoken.js build script, README

* export CLANG for Q6_k to float32 decompression

* fix and test exported CLANG code for Q6_k to fp32

* revert changes to jit and export_model

* isolate clang export

* test Q6_K to float32 decompression in browser

* gguf_load now also returns t_infos and data_start

* prepare llama-1B Q6_K gguf chunks for browser

* cache and decompress quantized llama in browser

* enable separate deployment of large files

* fix kv cache and symbolic with llama wgpu

* eliminate browser lag during decompression

* hash metadata and weight chunks

* delete obsolete indexeddb cache to free disk

* add progress bar, track model download/decompress

* refactor progress callback

* skip buffer hash verification for speed

* Display progress for entire loading scope

* Report page load errors to user

* actually display errors

* skip prompt tokens already seen by model

* skip prefilling with last assistant message tokens

* on page load tell user if webgpu not enabled

* push deployed URL root to window.history

* make note of bug sources with TODO items

* isolate bug in CLANG with BEAM=2

* remove clang_bug.py from diff

* decompress q6k to f32 on webgpu instead of clang

* remove unused code

* inter-weight decomp with larger wgpu kernels

* parallelize decompression submissions

* refactor dequantize scheduling

* add progress bar back

* fix bug

* temp fix for loading GGUF Q6_K to fp16 not fp32

* fix rendering of exported CLANG

* remove weight casts, sketch js functions for clang

* get symbolic vars from jit_cache for model export

* include symbolic vars in exported CLANG

* render js for clang transformer

* toggle clang/webgpu deployment; refactor decomp

* compile and render clang Q6_K->fp16 and int8 quant

* fix rendered clang for abs(fp16), to work in wasm

* simplify clang js wrapping

* run compiled clang in worker

* prepare llama weights in workers, q6k to int8/fp16

* tinychat on clang in browser, f32/int8 weights

* move wasm inference to (now flexible) worker

* don't load redundant embeddings

* modest wasm perf gain with compile flags

* set default backend, enable backend choice/backup

* render symbolic vars in exported WEBGPU

* quantize webgpu llama to int8/f32

* improve UX arising from rendered WEBGPU

* clean up webgpu launch

* new weights split: smaller chunks, tinygrad quant.

* switch webgpu inference to int8 quant

* remove unneeded clang decompression

* eliminate unneeded kv cache transfer to wasm

* use 1 worker for simplified clang decompression

* display launch errors

* refactor: stream load weight chunks to WebGPU

* show loading chunk completion

* quantize embeddings to int8

* test float16 as input for quantization

* webgpu: use f16 source, int8 embed, eliminate q6k

* simplify split weights prep: all from state_dict

* revert change to nn.state.gguf_load

* remove unneeded decompression from webgpu client

* remove unneeded code

* decrease dl chunks from 47 to 16 MiB

* improve stability of webgpu loading on mobile

* autodetect mobile, improve load stability

* refactor: progress closure

* refactor: one unified progress bar

* remove unneeded code

* revert changes to tinygrad core library

* enforce ios18.3 nerfed max buf size

* BEAM=3 webgpu

* cache integrity, mobile save throttling

* improve mobile UX - no autozoom on prompt box

* clang: int8 from f16, remove q6k

* reduce concurrent dls on mobile to 2 for stability

* refactor: wasm backend with stream loading

* prevent race between wasm load and indexedb save

* split wasm kernels into separate modules

* js wrapper for multiple wasm module inference

* revert multi-module wasm to single module

* make mobile wasm load more stable/fast

* refactor: copy weights into wasm without crashes

* fix bug in download queue; increase mobile dls

* refactor exported clang wrapper, split weights

* remove unnecessary code

* greatly improve int8 quant quality with rounding

* eliminate mobile throttling

* increase webgpu context to 4096 tokens

* export webgpu js functions

* enable separate hosted weights for mobile/pc

* enable prompt-thread switching during generation

* stop generation when max_context is reached

* show progress bar for prefill

* tell user if webgpu fails, while wasm loads

* make loading messages more concise

* update font

* revert changes to tinychat python app launch

* cleanup quantization, add scale_dtype param

* cleanup kv cache code

* cleanup compile code

* link tok_embeddings with output in webgpu export

* refactor: export_model webgpu: symbolic vars

* refactor: export_model weight loading

* forgot to commit export_model.py

* change CLANG to CPU

* deal with pylint incorrectly failing tests

* simplify f-strings for older CI python version

* fix pre-python3.12 parser errors

* [Int32Array] not Int32Array

* cleanup webgpu compile after refactor export_model

* refactor WASM export into export_model

* merge WebGPU/WASM compile scripts

* simplify max_contexts for local deployment

* fix parser issues and whitespace

* deduplicate variable defs for non-wasm clang export

* cleanup code

* cleanup compile scripts

* simplify wasm inference wrapping

* simplify webgpu symbolic vars export

* refactor: unify export of symbolic variables

* simplify WASM export

* simplify clang/wasm export

* update README and build scripts

* separate files for browser/python apps

* restore original python tinychat app files

* browser and python tinychats share assets

* minor cleanup

* isolate diffs to llama files

* minor cleanup

* set default scale_dtype

* set default scale_dtype for NF4 quantization

* make quantization of tok_embeds optional

* match output with tok_embeds if not quantizing

* minor change
2025-02-27 15:57:37 -05:00
chenyu
184030168d fix aten.reflection_pad2d (#9289)
tested the torch doc example
2025-02-27 15:53:46 -05:00
chenyu
0de6585df0 fix aten.normal_ arg (#9288)
should be mean and std.
2025-02-27 15:36:25 -05:00
chenyu
8ee2b460ee Tensor.var_mean (#9287) 2025-02-27 15:15:31 -05:00