Commit Graph

10633 Commits

Author SHA1 Message Date
nimlgen
77a8430616 am: use smu based on discovery (#9441) 2025-03-15 02:10:45 +08:00
uuuvn
5ff90cb261 am: less magic values (#9440) 2025-03-15 02:10:35 +08:00
Ignacio Sica
459d0cd14f add arch to AMDRenderer and HIPRenderer (#9431) 2025-03-13 13:06:27 -03:00
nimlgen
357e364ab8 am: turn off unord dispatch (#9433) 2025-03-13 23:59:28 +08:00
chenyu
99b0287e4e add GROUP and GROUPTOP to test_arange (#9432)
it does not grow quadratically, but it's not 0 ops now
2025-03-13 11:28:38 -04:00
qazal
90ffa9bd45 swizzle without buffer ops try 2 [pr] (#9427)
* add DONT_PUSH_VIEWS to matchers

* swizzle without buffer ops try 2 [pr]

* swizzle reduceop

* simple failing test

* fix failing test

* s/on/for
2025-03-13 10:00:40 +01:00
qazal
4df2b6347d hotfix: bump tinybox red training CI timeout to 30 minutes (#9426) 2025-03-13 09:31:44 +01:00
George Hotz
931436204c hotfix: 12000 lines, for AMD stuff 2025-03-13 10:48:14 +08:00
George Hotz
bfc68d1953 add gep rules to simplify (#9419)
* add gep rules to simplify

* ws

* flipped direction
2025-03-13 09:46:25 +08:00
geohotstan
0bed9b6cd2 benchmark huggingface onnx models (#8493)
* add ability to ORT=1

* test_vs_ort

* useless f

* actually have benchmark take in modelproto for more flexibility in huggingface stuff

* ok runs

* good

* oops fix benchmark_onnx __main__

* 224 as default

* add ORT=1 option to huggingface_onnx

* use Tensor to get_input

* add abilty to do single onnx model testing

* better names

* merge properly...

* copy in onnx_helpers

* better

* decent script

* need to add debug tool first

* new limit usage

* why did narrowing_error come back..

* pretty decent

* revert validate change

* more ops bug fixes

* revert unnecessary changes

* fix InstanceNorm too

* remove op from O4

* minimize diff

* address old feedback

* unsure of this, just revert

* remove that assert

* working attention

* to_python_const Attention

* cant init from np constant so just do this

* final

* fix bug in attention

* attention clean ups

* add hard TODOs and REPOPATH and TRUNCATE envvar

* fix input_ids default value

* final

* fix scatter

* cleaner _prepare_quantize

* use new attention and tempfile for huggingface script

* more stats

* update

* remove outdated code

* big refactor to something usable by CI

* booooooom

* clean up

* update to using yaml as env var input

* add dry run

* try

* valid pad

* use argparser and fix gather bug

* ignore all yaml

* tiny bit more polish

* woah ignoring all yaml was not right

* typo

* decouple huggingface_onnx_run debug run with huggingface_onnx_download

* bug fix for downloading single model

* WOOOO ok much better

* oops argparse 'required' is an invalid argument for positionals

* oops argparse 'required' is an invalid argument for positionals

* add assert

* fix types

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-12 20:13:12 -04:00
chenyu
4992958dae update bert beam params (#9423)
BEAM_MIN_PROGRESS=5 for setup speed
2025-03-12 13:00:41 -04:00
qazal
12978f0d05 reorder contiguous/assign ast rules [pr] (#9420)
* apply setitem ShapeTracker when creating store [pr]

* comments + early contiguous remove

* better

* linter
2025-03-12 12:13:27 +01:00
George Hotz
5f6d5b057d expand index isn't grouping by access size (#9418)
* expand index isn't grouping by access size

* split_load_store

* scalar vec

* +correct_load_store

* vectorized and

* correct_load_store always

* simplify before divides
2025-03-12 17:24:10 +08:00
George Hotz
815ad0b7a8 support load/store grouping in DEVECTORIZE=0 (#9409) 2025-03-12 11:34:37 +08:00
Priyank Patel
4714c4f9ad torch backend multigpu - add devices and tests (#9414)
* add multi-device support and tests

* simplify
2025-03-12 11:33:11 +08:00
chenyu
22fc0a2e36 bert sum acc in half (#9412)
also BS=96
2025-03-11 23:03:15 -04:00
nimlgen
f995b465b8 am: set doorbell offsets to nb (#9413) 2025-03-12 10:35:47 +08:00
qazal
95e0f069be hotfix: gitignore *.log [pr] (#9410) 2025-03-11 21:39:19 +01:00
nimlgen
78ebade125 Merge pull request #9408 from nimlgen/hcq_progress_during_wait
hcq: reset timer on progress in singal.wait
2025-03-11 19:40:23 +08:00
George Hotz
e174c6c3bc new devectorizer (#9331)
* new devectorizer

* lidx

* test linearizer passes

* fix images

* fix unfoldable image load

* delete unused

* improve fix_unfoldable_image_load

* working for image

* fixup types

* fixup transcendental

* cast_vec

* cleaner transcendental

* skip failing test

* err, flip that

* not devec

* sqrt
2025-03-11 18:47:56 +08:00
qazal
69fac5fe89 Merge pull request #9407 from tinygrad/no_const_after_sym
no const/view in schedule sink after sym [pr]
2025-03-11 12:24:09 +02:00
nimlgen
4d09ea4c06 hcq: reset timer on progress in singal.wait 2025-03-11 10:02:14 +00:00
qazal
fa69fd3afc no const/view in schedule sink after sym [pr] 2025-03-11 10:58:38 +01:00
George Hotz
68f062c8be cast_vec on transcendental (#9406) 2025-03-11 17:30:46 +08:00
uuuvn
e85001b6ee SQTT profiling (#9278)
* sqtt

* docs

* multi-device

* ProfileSQTTEvent

* exec update

* 256mb default

* don't let people hang their gpus

* bitfields from autogen

* asic info from mesa

* more bitfields from autogen

* SQTT_ITRACE_SE_MASK

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-11 13:19:56 +08:00
George Hotz
2780e2027e devectorize prereqs [pr] (#9404) 2025-03-11 12:33:29 +08:00
Priyank Patel
beed00eabe fix torch backend memory leak (#9395)
* fix leak, realize everything on torch optim step

* only realize a subset

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-11 10:48:20 +08:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
qazal
59dfb234eb replace hardcoded ast with tensors in TestSwizzle [pr] (#9401) 2025-03-10 19:33:57 +01:00
Priyank Patel
796c3bbb23 torch: support in-place operations on views (#9371)
* add torch inplace tests

* first set of tests passing

* wrap all inplace funcs, add more tests

* fixes and wrap more functions

* fix all uint8 tests to avoid slow tests

* fix the one test

* another test, another fix

* and one more, works for ddp now

* something on contiguous, cleanup

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-03-10 23:29:00 +08:00
qazal
2afc7759a7 sink in kernel op [pr] (#9397)
* sink in kernel op [pr]

* metadata
2025-03-10 13:13:42 +01:00
George Hotz
25847080f0 olmoe (from stream, wip) (#9390)
* olmoest working (but not)

* it's correct

* compare ropes

* old code wasn't wrong

* default device

* no metal

* fix permute

* working

* more minimal
2025-03-10 13:46:33 +08:00
geohotstan
1d64c12f2b add Topk to tensor (#9343)
* terrible but somewhat working impl

* linux behaves differently than macos?

* slightly better impl

* small clean up; haven't figured this out yet

* better

* torch has different behavior on linux and macos for duplicated values

* add sum docs

* fix test

* add torch return_type test

* add an exception test

* wrap_fxn instead, and move op lower in order

* better repeated values test

* rerun ci
2025-03-09 20:01:42 -04:00
qazal
a1f41fadf6 test_schedule cleanups + add DONT_GROUP_REDUCES [pr] (#9392)
* test_schedule cleanups + add DONT_GROUP_REDUCES [pr]

* replace with test_swizzle_reduceop

* delete duplicate tests

* test_allow_push_permutes

* one kernel tests
2025-03-09 15:01:08 +01:00
wozeparrot
b6fe5ab4dd fix: correct gfx10 ctl stack size (#9384) 2025-03-09 13:03:20 +08:00
qazal
456697d0be always create kernels for assign/contiguous/copy [pr] (#9388) 2025-03-08 15:32:06 +01:00
qazal
286b480f82 do not replace assign with the offset buffer [pr] (#9387) 2025-03-08 11:57:44 +01:00
qazal
ecfccdea8e remove views from the kernel graph minimum diff (#9385)
* remove views from the kernel graph

* notes
2025-03-08 10:14:42 +01:00
qazal
0d2762c010 prep refactor for adding buffer ops last [pr] (#9383)
* prep refactor for adding buffer ops last [pr]

* freeze buffers

* add swizzle_reduceop

* shape for reduceop_view_right

* simpler elementwise_view_right

* add shapetracker to const

* only const

* from process replay
2025-03-08 08:00:14 +01:00
b1tg
bde0347618 amd: support relocatable elf (#9380)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-08 02:21:49 +08:00
nimlgen
243078dda9 am: optimize tlb usage (#9049)
* am: optimize tlb usage

* fxies

* comments

* tiny
2025-03-07 19:37:29 +03:00
qazal
46720294d6 reorder ScheduleItem creation [pr] (#9379) 2025-03-07 17:20:53 +01:00
qazal
dc89dae994 remove unmasked valid after swizzles (#9377) 2025-03-07 16:43:16 +01:00
geohotstan
088d86691b fix onnx gather and onnx auto_pad VALID mode (#9375)
* fix gather and auto_pad

* long -> int64
2025-03-07 10:27:23 -05:00
qazal
3565c08df5 refactor to kernel ast fixup [pr] (#9376) 2025-03-07 15:47:38 +01:00
hooved
304afe0d55 tinychat in browser, Part 3: browser app (#9276)
* load llama3-1B to WEBGPU device

* include compile script for loading llama3 to WEBGPU

* parametrize max_context in build_transformer fxn

* jit_model with two different args sets

* compile for webgpu, split weights

* load model weight parts in browser

* export all tensors from initialized transformer

* run transformer inference in browser

* enable tiktoken with llama bpe in browser

* count total tokens on client with tiktoken.js

* full client-side chat streaming, eliminate server

* revert change that enabled jitting with 2 argsets

* llama without Variable or cache_kv, for webgpu

* have client use mask tokens / whole context

* cleanup staged weights

* add tiktoken.js build script, README

* export CLANG for Q6_k to float32 decompression

* fix and test exported CLANG code for Q6_k to fp32

* revert changes to jit and export_model

* isolate clang export

* test Q6_K to float32 decompression in browser

* gguf_load now also returns t_infos and data_start

* prepare llama-1B Q6_K gguf chunks for browser

* cache and decompress quantized llama in browser

* enable separate deployment of large files

* fix kv cache and symbolic with llama wgpu

* eliminate browser lag during decompression

* hash metadata and weight chunks

* delete obsolete indexeddb cache to free disk

* add progress bar, track model download/decompress

* refactor progress callback

* skip buffer hash verification for speed

* Display progress for entire loading scope

* Report page load errors to user

* actually display errors

* skip prompt tokens already seen by model

* skip prefilling with last assistant message tokens

* on page load tell user if webgpu not enabled

* push deployed URL root to window.history

* make note of bug sources with TODO items

* isolate bug in CLANG with BEAM=2

* remove clang_bug.py from diff

* decompress q6k to f32 on webgpu instead of clang

* remove unused code

* inter-weight decomp with larger wgpu kernels

* parallelize decompression submissions

* refactor dequantize scheduling

* add progress bar back

* fix bug

* temp fix for loading GGUF Q6_K to fp16 not fp32

* fix rendering of exported CLANG

* remove weight casts, sketch js functions for clang

* get symbolic vars from jit_cache for model export

* include symbolic vars in exported CLANG

* render js for clang transformer

* toggle clang/webgpu deployment; refactor decomp

* compile and render clang Q6_K->fp16 and int8 quant

* fix rendered clang for abs(fp16), to work in wasm

* simplify clang js wrapping

* run compiled clang in worker

* prepare llama weights in workers, q6k to int8/fp16

* tinychat on clang in browser, f32/int8 weights

* move wasm inference to (now flexible) worker

* don't load redundant embeddings

* modest wasm perf gain with compile flags

* set default backend, enable backend choice/backup

* render symbolic vars in exported WEBGPU

* quantize webgpu llama to int8/f32

* improve UX arising from rendered WEBGPU

* clean up webgpu launch

* new weights split: smaller chunks, tinygrad quant.

* switch webgpu inference to int8 quant

* remove unneeded clang decompression

* eliminate unneeded kv cache transfer to wasm

* use 1 worker for simplified clang decompression

* display launch errors

* refactor: stream load weight chunks to WebGPU

* show loading chunk completion

* quantize embeddings to int8

* test float16 as input for quantization

* webgpu: use f16 source, int8 embed, eliminate q6k

* simplify split weights prep: all from state_dict

* revert change to nn.state.gguf_load

* remove unneeded decompression from webgpu client

* remove unneeded code

* decrease dl chunks from 47 to 16 MiB

* improve stability of webgpu loading on mobile

* autodetect mobile, improve load stability

* refactor: progress closure

* refactor: one unified progress bar

* remove unneeded code

* revert changes to tinygrad core library

* enforce ios18.3 nerfed max buf size

* BEAM=3 webgpu

* cache integrity, mobile save throttling

* improve mobile UX - no autozoom on prompt box

* clang: int8 from f16, remove q6k

* reduce concurrent dls on mobile to 2 for stability

* refactor: wasm backend with stream loading

* prevent race between wasm load and indexedb save

* split wasm kernels into separate modules

* js wrapper for multiple wasm module inference

* revert multi-module wasm to single module

* make mobile wasm load more stable/fast

* refactor: copy weights into wasm without crashes

* fix bug in download queue; increase mobile dls

* refactor exported clang wrapper, split weights

* remove unnecessary code

* greatly improve int8 quant quality with rounding

* eliminate mobile throttling

* increase webgpu context to 4096 tokens

* export webgpu js functions

* enable separate hosted weights for mobile/pc

* enable prompt-thread switching during generation

* stop generation when max_context is reached

* show progress bar for prefill

* tell user if webgpu fails, while wasm loads

* make loading messages more concise

* update font

* revert changes to tinychat python app launch

* cleanup quantization, add scale_dtype param

* cleanup kv cache code

* cleanup compile code

* link tok_embeddings with output in webgpu export

* refactor: export_model webgpu: symbolic vars

* refactor: export_model weight loading

* forgot to commit export_model.py

* change CLANG to CPU

* deal with pylint incorrectly failing tests

* simplify f-strings for older CI python version

* fix pre-python3.12 parser errors

* [Int32Array] not Int32Array

* cleanup webgpu compile after refactor export_model

* refactor WASM export into export_model

* merge WebGPU/WASM compile scripts

* simplify max_contexts for local deployment

* fix parser issues and whitespace

* deduplicate variable defs for non-wasm clang export

* cleanup code

* cleanup compile scripts

* simplify wasm inference wrapping

* simplify webgpu symbolic vars export

* refactor: unify export of symbolic variables

* simplify WASM export

* simplify clang/wasm export

* update README and build scripts

* separate files for browser/python apps

* restore original python tinychat app files

* browser and python tinychats share assets

* minor cleanup

* isolate app layer diff

* add .gitignore for generated files

* validate CPU/WEBGPU models in python

* prevent infinite generation if validation fails

* check if exported weight files are unique

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-07 15:07:33 +08:00
hooved
136cf7b8b1 hotfix: load >2 GiB from disk on macOS (#9361)
* enable loading >2 GiB buffer from disk on macOS

* handle None case raised by mypy

* add test

* revert fix to repro bug in CI

* tell CI to run a unit test for macOS

* reapply fix
2025-03-07 14:51:58 +08:00
Friedrich Carl Eichenroth
dbdefbbe54 Typed methods in tensor.py (#9356)
* types for tensor.py

* x

* more

* remove some casts

* more typing

* fix linting issues

* -1 line

* add last type

* cast 🤙🤙
2025-03-05 20:34:18 -05:00
nimlgen
77f7ddf62a gfx10 correct ctl stack size (#9365) 2025-03-05 23:04:16 +03:00
nimlgen
c8a74b11ed am: resize bar error msg (#9366) 2025-03-05 23:02:04 +03:00