Commit Graph

1015 Commits

Author SHA1 Message Date
chenyu
e12bc85014 use BS=128 and BS=768 for resent benchmark (#3815)
50% more hcopt perf with this one weird trick
2024-03-18 23:49:55 -04:00
chenyu
20681d5c4a remove old dist multigpu (#3811) 2024-03-18 18:31:05 -04:00
chenyu
5dd048a378 remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
wozeparrot
a0ab755317 threefry again (#3785)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

* feat: restore old

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-18 16:47:07 -04:00
chenyu
1711274654 7B llama on 4 gpus on benchmark (#3804) 2024-03-18 14:32:37 -04:00
George Hotz
311cf2b7d3 Revert "threefry_2x32 (#2601)" (#3784)
This reverts commit db3de54bc4.
2024-03-17 10:27:20 -07:00
wozeparrot
db3de54bc4 threefry_2x32 (#2601)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-17 10:19:33 -07:00
George Hotz
53adcb34f5 remove hip backend (#3783)
* remove hip backend

* remove unused

* rhip

* more RHIP
2024-03-17 10:12:16 -07:00
chenyu
77febb44e6 llama 7B on 6 gpus benchmark (#3773) 2024-03-16 11:38:52 -04:00
George Hotz
0870dd5b3b hotfix: switch resnet training from HIP -> HSA in CI 2024-03-15 13:35:52 -07:00
chenyu
8ea53951c1 bfloat16 Tensor.rand (#3764)
* Tensor.rand for bfloat16

for numpy based random, generate one for float then cast for bfloat16.

close #3653

* remove realize
2024-03-15 15:05:13 -04:00
chenyu
a2d3cf64a5 move is_dtype_supported to test.helpers (#3762)
* move is_dtype_supported to test.helpers

updated all places that check if float16 is supports

* fix tests
2024-03-15 14:33:26 -04:00
chenyu
922f8319cb Run test_real_world in METAL test (#3760)
* clean up test_real_world

* skip that

* JIT=2 for metal

* all device
2024-03-15 13:56:52 -04:00
George Hotz
5b3d8a886e split tinybox benchmark into two (#3741)
* split tinybox benchmark into two

* symlinks
2024-03-14 14:12:32 -07:00
David Hou
199f7c4342 MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf3.

Revert "log layer stats"

This reverts commit 9d98224585.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
chenyu
f30fb192b7 resnet eval on tinybox ci (#3714) 2024-03-13 13:26:30 -04:00
chenyu
d69170e27e add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00
chenyu
e10ee2ed3f llama beam tinybox ci (#3680) 2024-03-11 01:35:39 -04:00
chenyu
bad6adaf8c add mixtral and 6 gpus cifar to tinybox ci (#3676)
* add mixtral and 6 gpus cifar to tinybox ci

* print total ram used at the end of loading
2024-03-10 18:25:31 -04:00
qazal
bdd62c7fd8 make the bf16 include dynamic (#3642)
* dynamic prefix

* add common ones above

these are common dtypes

aesthetics

* regression test

fuzz it

test

* run in CI

* use .append

* faster
2024-03-07 10:31:35 -05:00
David Hou
0afaf70d57 lars optimizer + tests (#3631)
* lars optimizer + tests

* fix skip list!

* use id to compare in skip list

* go back to using set

* Tensor(bool) * Tensor(bool) is and

* don't lint external/mlperf_resnet

* whitespace

* add external_test_optim to opencl tests

* give mlperf task a name

* mlperf under onnx

* remove track_gnorm

* contiguous instead of realize

* assert momentum and weight decay positive

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-06 18:11:01 -05:00
George Hotz
81baf3eed3 bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
George Hotz
568353fa84 hotfix: bump line count to 6500 2024-03-06 07:52:18 -08:00
chenyu
3c3f846c45 tinybox benchmark with HSA (#3603)
* tinybox benchmark with HSA

* torch cuda init can fail

* no TORCHCUDA

* print torch version

* LD_PRELOAD="/opt/rocm/lib/libhsa-runtime64.so"
2024-03-05 11:03:52 -05:00
chenyu
957e9800f1 llama + beam to mac benchmark, full cifar to nvidia benchmark (#3612)
would merge if it's also ~1 minute. btw why is gpt2 beam not slower in the first beam run?
2024-03-04 21:35:57 -05:00
chenyu
c3b8d285aa cleanup uops (#3605)
using `is` to compare with enums, remove long lines and slightly more compact
2024-03-04 11:03:14 -05:00
chenyu
8e5d60a322 add more gpt2 variant in mac/nvidia benchmark (#3599) 2024-03-03 17:55:30 -05:00
George Hotz
770707b376 hotfix: gpuocelot no rebuild 2024-03-02 15:57:38 -08:00
Francis Lam
162dfb07d9 fuzz_linearizer: fix uops and add to test.yml (#3588) 2024-03-02 15:03:42 -08:00
Francis Lam
e17f1821a7 wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544) 2024-03-01 17:51:02 -08:00
chenyu
b7e555f6c0 run test_linearizer_failures on PYTHON backend (#3565)
* run test_linearizer_failures on PYTHON backend

only test 1, some have hanging issues and gated store is not implemented

* --durations=20

* two less slow ones
2024-03-01 17:00:18 -05:00
George Hotz
5a6e151844 no barrier side effect (#3550)
* no barrier side effect

* finish barrier removal
2024-02-29 18:10:04 -08:00
George Hotz
2c19ab6561 define var (#3548)
* define var

* remove vars from there

* fix python symbolic ops

* fix llvm

* pypath
2024-02-29 16:43:27 -08:00
chenyu
978a997d1f print nvidia-smi in CI benchmark (#3546) 2024-02-29 17:31:37 -05:00
George Hotz
e7cda40d52 Revert "hotfix: disable metal graph"
This reverts commit 3541602877.
2024-02-28 16:25:12 -08:00
George Hotz
3541602877 hotfix: disable metal graph 2024-02-28 10:33:34 -08:00
George Hotz
c34d382a1e bump to macos-14 M1 (#3520)
* bump to macos-14 M1

* bump cache key

* no -n auto

* jit=2

* real tensor cores
2024-02-28 10:28:25 -08:00
George Hotz
7698781389 Revert "wmma: add CUDA tensor core (#3464)" (#3474)
This reverts commit e9cef13f0b.
2024-02-22 11:58:16 +01:00
Francis Lam
e9cef13f0b wmma: add CUDA tensor core (#3464) 2024-02-22 11:57:08 +01:00
wozeparrot
57678012e1 Upload correct benchmark artifact (#3471)
* fix: correct filename

* fix: why is this .py?
2024-02-22 01:14:16 -05:00
chenyu
7c0fc40123 enable test IMAGE=2 PYTHON=1 python3 test/test_ops.py TestOps.test_simple_conv2d (#3468) 2024-02-21 18:30:12 -05:00
chenyu
77d2a4c12a regenerate kernel dataset after reduce arg to axis change (#3467)
```
./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```
2024-02-21 18:16:13 -05:00
George Hotz
871ba73e65 _reduce_op is axis based now (#3462)
* _reduce_op is axis based now

* axis_

* update lin failures

* disable that

* fix shape
2024-02-21 16:36:31 +01:00
chenyu
02683a8659 gate the cast before movements in lazy (#3452)
it made gpt2 slower (2ms -> 2.5ms on 3090, 7ms -> 8ms on M1 Max with BEAM=2).
disabled it in gpt2 benchmark before understanding the full issue
2024-02-20 09:36:22 -05:00
qazal
7864fb69d1 delete MovementOps (#3434)
* delete MovementOps

* keep extra/to_movement_ops.py
2024-02-19 23:21:44 +01:00
Patrick Tsai
ac9d94a068 Cast correctly in python emulator (dtype tests pass) (#3446)
* Cast correctly in python emulator

* Update test yml and fix lint

* make ruff pass

* mypy passes

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-02-19 13:34:02 +01:00
George Hotz
b1c0d8c99d remove cpu and torch backends (#3399)
* remove cpu and torch backends

* don't copy to cpu

* use clang instead of cpu

* multitensor gathers on the first device

* clang is cpu + use default

* fixup

* bugfix
2024-02-15 16:55:39 +01:00
Obada Khalili
75f7e21a80 Make tests in test/test_ops.py pass for Python emulator (#3384)
* fix OverflowError in UnaryOps.EXP2

* avoid accessing outputs for void uops

* skip execution for UOps.IF and UOps.ENDIF

* initialize bytearray to the correct size in UOps.DEFINE_LOCAL

* validate len of input that has .sz > 1

* remove comment in code

* reinitialize loop of already iterated

* validate first value in input to be a list for inputs with .sz > 1

* add python ops tests to CI

* skip long runtime tests for PYTHON backend

* respect dtype.sz arg in UOps.CONST, and remove incorrect validation in UOps.STORE

* use math.inf instead of float('int')

* handle 0 args to UnaryOPs.LOG2

* handle load op with default of .sz > 1

* initialize the loop correctly using UOps.LOOP arg

* remove unnecessary TODO comment

* remove newline

* select a subset of 22 ops tests to skip in CI when PYTHON=1

* handle gated UOps.LOAD referencing values that have .sz > 1

* Revert "select a subset of 22 ops tests to skip in CI when PYTHON=1"

This reverts commit 7674fee81d.

* skip tests in python backend CI command

* push fix lost in conflict resolve

* Revert "skip long runtime tests for PYTHON backend"

This reverts commit 5dd2a0376e.

* clear loop state after last iteration
2024-02-15 16:40:25 +01:00
qazal
49cb1fee54 run test_indexing on remu (#3404)
* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.
2024-02-15 11:52:40 +01:00
qazal
27f4de2ce4 delete half_prekernel (#3388)
* generic rendering of half and bf16

hotfix

* fix uops + regression test

* fix the test for metal's half4

* uop.uop fixup

* mypy with --strict-equality, fix ops_gpu
2024-02-14 15:40:48 +01:00