Commit Graph

4618 Commits

Author SHA1 Message Date
George Hotz
2dea12832c add failing assign test (#3796)
* that was a hack

* tests to reveal the issue

* add assign for realized assign
2024-03-18 08:47:30 -07:00
George Hotz
086291e8c6 hotfix: add test for JIT reset 2024-03-17 21:35:49 -07:00
George Hotz
bf3e1c4df2 support pickling tensors and others (#3787)
* test pickle tensors

* pickle unrealized tensor

* pickle jit, don't save Device in every CompiledASTRunner

* real test of pickle, move delete
2024-03-17 18:29:14 -07:00
chenyu
639bd5dbfc move bf16 cast hack to Tensor.llvm_bf16_cast (#3788) 2024-03-17 18:51:22 -04:00
George Hotz
311cf2b7d3 Revert "threefry_2x32 (#2601)" (#3784)
This reverts commit db3de54bc4.
2024-03-17 10:27:20 -07:00
wozeparrot
db3de54bc4 threefry_2x32 (#2601)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-17 10:19:33 -07:00
George Hotz
53adcb34f5 remove hip backend (#3783)
* remove hip backend

* remove unused

* rhip

* more RHIP
2024-03-17 10:12:16 -07:00
qazal
e3e89c244b multioutput uoping infra (#3706)
* linearize multioutput

* add vars to copy
2024-03-15 21:56:59 -07:00
chenyu
8ea53951c1 bfloat16 Tensor.rand (#3764)
* Tensor.rand for bfloat16

for numpy based random, generate one for float then cast for bfloat16.

close #3653

* remove realize
2024-03-15 15:05:13 -04:00
chenyu
a2d3cf64a5 move is_dtype_supported to test.helpers (#3762)
* move is_dtype_supported to test.helpers

updated all places that check if float16 is supports

* fix tests
2024-03-15 14:33:26 -04:00
chenyu
922f8319cb Run test_real_world in METAL test (#3760)
* clean up test_real_world

* skip that

* JIT=2 for metal

* all device
2024-03-15 13:56:52 -04:00
nimlgen
ba79a3c09a some hsa lines saving + fixes (#3752)
* fix write to ring + some lines

* hsa driver test
2024-03-15 18:12:18 +03:00
George Hotz
ca19eb3e82 where fold try 2 (#3748)
* where fold try 2

* assign fold

* test_where_fold works

* add gated store support to ops_python

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-15 07:46:26 -07:00
nimlgen
6b8c66e04f fix broken loops in llvm (#3751) 2024-03-15 11:57:51 +03:00
chenyu
d3a6319630 bf16 tests in test_dtype.py (#3749)
With bf16 creation and bf16 to numpy, we can test bf16 in test_dtype.
Only support HIP now as it needs bf16 buffer support. Also the rtoal is slightly larger
2024-03-15 00:17:11 -04:00
Rohan Potdar
33c01c9db0 Fix kwargs in JIT (#3730)
* Update jit.py

* Update jit.py

* added failing test

* fix type error

* Revert to itertools

* fix sorted
2024-03-14 23:55:19 -04:00
George Hotz
641f347232 simple LoadOps.ASSIGN (#3745)
* simple LoadOps.ASSIGN

* skip that test

* don't assign in onnx ops gemm

* track cache usage

* recreate the lazybuffer to avoid the cache

* fix contigs

* skip that test

* lol

* better letters
2024-03-14 20:44:34 -07:00
chenyu
75d4344cda UOps.BITCAST (#3747)
* UOps.BITCAST

implicitly fixed no const folding for bitcast

* python backend

* ptx

* consistent llvm
2024-03-14 21:00:35 -04:00
chenyu
9a00a453c7 add test case for uop cast constant fold (#3746)
and a expected failed bitcast fold test case. Will fix with UOps.BITCAST refactor
2024-03-14 20:00:27 -04:00
chenyu
11c61ae044 Revert "fix const bitcast should not be constant folded (#3743)" (#3744)
This reverts commit 38ba277ac8.
2024-03-14 19:24:05 -04:00
George Hotz
d52d0b0efb test_assign_kv_cache 2024-03-14 16:17:20 -07:00
chenyu
38ba277ac8 fix const bitcast should not be constant folded (#3743)
* fix const bitcast should not be constant folded

* fixed const bf16 creation

* LLVM still broken
2024-03-14 19:13:52 -04:00
George Hotz
3527c5a9d2 add Tensor.replace (#3738)
* add Tensor.replace

* fix dtypes in that test

* should be replace

* and mixtral
2024-03-14 13:34:14 -07:00
chenyu
0ead0bdb65 script to benchmark beam v hcopt (#3737)
the goal is that big enough beam should be faster than hcopt/tc

also this failed on tc opt
NUM=2 FILTER_REDUCE=1 TEST_N=20 BEAM=4 DEBUG=2 python test/external/speed_beam_v_hcopt.py
2024-03-14 15:04:03 -04:00
chenyu
90e55a9fd1 fix buf_index not found case in _apply_tc_opt (#3739)
ValueError if src.src[0] is not a LOAD. Replaced with returning None in _apply_tc_opt and test to make sure the net output is KernelOptError.
2024-03-14 14:27:05 -04:00
nimlgen
6bf11a2ce3 fix incorrect direct store with gep (#3735)
* fix incorrect direct store with gep

* better comment

* phi as well

* dtype check there

* mypy happy?

* not used

* renames

* phi in phi
2024-03-14 20:58:50 +03:00
qazal
00c56db1a4 Fix JITItem count assert for HSAGraph (#3734)
* exclude HSA graph

* cant import HSAGraph directly
2024-03-14 14:12:35 +03:00
qazal
43953c0ba9 skip grouped store for umatching upcasts (#3723)
* skip if upcasts dont match

* outputs match now

* this ast is hardcoded

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-14 01:18:31 -04:00
David Hou
199f7c4342 MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf3.

Revert "log layer stats"

This reverts commit 9d98224585.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
George Hotz
56b914fc8c hotfix: test_assign_contiguous 2024-03-13 17:49:54 -07:00
chenyu
4d6ec41adb failed test cases for bf16 Tensor.full (#3729)
fixable with float const then cast to bf16. cast folding with bitcast is incorrectly skipped
2024-03-13 20:46:45 -04:00
George Hotz
838afbc351 assign tests (#3728) 2024-03-13 17:04:55 -07:00
chenyu
6793db169b bfloat16 tensor creation from list and numpy (#3724) 2024-03-13 18:44:05 -04:00
qazal
337cd53444 multioutput ScheduleItem (#3699)
* refactor realize.py

* update docs

* update test_sched

* update runners and devices

* update openpilot and unit tests

* cleanup runner lowering

* update more tests
2024-03-13 08:59:38 -07:00
nimlgen
08064a0e29 add SEED env to fuzz_linearizer (#3713)
* add SEED env to test/external/fuzz_linearizer.py

* found some

* more platforms
2024-03-13 18:08:42 +03:00
chenyu
e1b2a82d89 fix st.real_size can be nagative if valid is always false (#3708)
two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)
2024-03-12 20:34:07 -04:00
Francis Lam
b6e2495fdd kernel: limit shared memory usage when adding opts (#3705)
* kernel: limit shared memory usage when adding opts

* search: remove unnecessary limit on search space

apply_opt will do the more correct check
2024-03-12 17:06:21 -04:00
George Hotz
2024b24f35 add some graph tests (#3702)
* add some graph tests

* PatternMatcher class

* speedup

* const cast test

* fix tests

* itertools chain
2024-03-12 09:49:47 -07:00
chenyu
f599c6e7f4 test output dtypes matche in test_ops (#3703)
need to cast some torch output to int32 because torch default returns int64 for index related function

close #2797
2024-03-12 12:44:40 -04:00
chenyu
02ca067bdf use default_float.np to construct test data in test_ops (#3701)
first step of #2797
2024-03-12 11:58:20 -04:00
Patrick Tsai
971d7f5d7c O(n) arange attempt (#3530)
* It works?

* Clamp correctly

* Refactor

* Make code better

* Undo some stuff

* First step to trying to make floats work

* Floats work in Python op but not metal because int div is different

Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error

* arange does cumsum with ints and then multiplies by step

This is so loop optimization can remain int only

* Undo a lot of symbolic changes

* Final check

* Cleanup

* There can be multiple phis

* Fix multiple phi op removal

* const sets dtype correctly

* Fix bugs

* Fix a couple bugs and add loop vars to resolve

* missed one

* Don't trim too many ops

* Fix symbolic test

* Use ones instead of full

* Delete test

* Lint passes

* max node error

* Small updates to loop logic

* Remove unnecessary changes

* We are getting somewhere

* Simple case

* Fix

* rm, prn

* Better

* If NumNode doesn't work then continue

* clamp is needed for arange(256)

* Move everything into the optim fn

* Replace correctly

* Order optimizations better

* Delete

* mypy

* Test for simplification

* Rename

* Fix test

* update test description

* Undo more

* Cleanup

* No replaced_ops map

* Fix lint

* AssertionError

* back again

* Reinstate assertion

* Return true and make diff not as big

* Bigger range for test

* Change cumsum impl

* fix bug

* make big cumsum work

* lint

* Undo cumsum 2-stage removal

* No while helper

* optional min/max clamping

* floats work

* rm giant arange test

* fix python cast None

* Check phi parents

* one phi allowed per where

* Fix one phi per where

* Rework iteration

* Delete assertions

* convert to int

* Try mul -1 instead of neg for hip..?

* Remove one phi per where requirements

* one accum only

* Lint

* should simplify a loop at a time

* Don't get rid of loop explcitly

* Need to iterate backwards

* lint

* unary neg

* Make optim work for onnx and sum_pad_collapse

* Better message

* filter alu ops correctly

* Fix the limiter

* lint and simplify

* Add it back

* off by one error

* test wheres and phis

* test max ops and non-if stuff

* <=

* cast_scalar

* Oops

* Change test

* Pass loop uops instead of a modified map

* Cut param transfer between linearizer and uops

* Fix issues

* Fix lint

* fix efficientnet python 3.8 invalid syntax

* distinct vars in seen_vars

* accurate var names

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-11 16:09:20 -07:00
qazal
aec4c4f01b linearizer ast as a tuple of lazyops (#3689)
* multi store op linearizer

* currently we do only one output per kernel

* named opts
2024-03-11 15:39:04 -07:00
Skosh
e8c350fdac fix: make Tensor.rand produce correct values for float16 (#3654)
* fix: make Tensor.rand produce correct values for float16

Due to precision loss when casting to float16, the data distribution created by custom_random isnt correctly in the interval ]0, 1[, but instead in the interval ]0, 1], which causes the Tensor.randn to incorrectly generate values of infinity.

The solution uses a scaling value to make sure the values stay under 1, when using half precision.

Closes #3611

* update implementation to truncate to closest f16 value to 1

* chore: fix whitespace

* test larger distribution

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-10 18:48:00 -04:00
George Hotz
44a67bf783 constant folding (#3675)
* constant fold

* bool math

* fix ptx
2024-03-10 14:47:24 -07:00
George Hotz
25aede6fd9 truncate for exec_alu (#3674) 2024-03-10 14:19:04 -07:00
Francis Lata
957ae9b594 Fix Tensor's __repr__ for printing out grad (#3673)
* update check for Tensor's __repr__ with grad

* add test for repr with grad bugfix
2024-03-10 17:04:29 -04:00
George Hotz
69ca7f7bf9 changes for teenygrad (#3665)
* changes for teenygrad

* upd

* simpler test
2024-03-09 15:30:34 -08:00
Maximilian Wolf
8ae85b2cf5 add inference_mode context manager with decorator support (#3621)
* add inference_mode context manager with decorator support

* change val to mode for train and inference_mode

* fix wrong rename
2024-03-09 08:38:26 -08:00
Obada Khalili
b5cbf1792a Fix Tensor.cumsum when axis of length 0 is selected (#3473)
* fix Tensor.cumsum when axis of length 0 is selected

* add cumsum regression test

* define padding left size in a seperate line
2024-03-09 08:26:41 -08:00
chenyu
915f98791c use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00