Commit Graph

618 Commits

Author SHA1 Message Date
chenyu
aa093efa43 fix handcode_resnet50_opt flops count (#4184) 2024-04-15 22:13:45 -04:00
chenyu
d5b67c1ca3 log resnet TRAIN_BEAM / EVAL_BEAM (#4181)
also run eval in benchmark mode if either one is positive
2024-04-15 19:29:08 -04:00
chenyu
6a2168e698 TRAIN_BEAM and EVAL_BEAM for resnet (#4177)
working on measuring compile time
2024-04-15 14:57:21 -04:00
David Hou
593c90d7d6 Resnet fp16 training with fp32 master weight copy (#4144)
* add casts to layers

* FLOAT flag

* detach

* no_grad for eval

* whitespace

* explicit fp32 initialization

* oops

* whitespace

* put back config['DEFAULT_FLOAT']

* bad

* live dangerously (don't hide bugs)

* don't bundle changes

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-14 11:25:08 -04:00
chenyu
e20d6f9221 correct resnet estimate time (#4169)
7.99 hours was rendered as 7h0m.
2024-04-14 02:21:46 -04:00
George Hotz
ebc94c9d6c rewrite the jit in the context of new schedule (#4162)
* rewrite the jit in the context of new schedule

* mypy better

* fix placeholder

* tests

* all functionality should work

* fix tests

* no CacheCollector
2024-04-12 21:54:36 -07:00
George Hotz
216eb235e5 hotfix: cast mnist to float 2024-04-09 19:30:03 -07:00
George Hotz
fea774f669 spend 5 lines to bring mnist into the repo (#4122) 2024-04-09 19:24:57 -07:00
chenyu
92c0675ccf setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
George Hotz
97c402d69e use imagenet spawn (#4096) 2024-04-06 08:34:10 -07:00
George Hotz
fffd9b05f5 mock mnist data for imagenet trainer (#4095)
* mock mnist data for imagenet

* move print and test

* needed to reshape
2024-04-06 08:08:40 -07:00
George Hotz
93824e59eb support MOCKDATA=1 for resnet (#4090)
* mockdata for resnet

* fix eval, revert hsa
2024-04-05 17:19:18 -07:00
George Hotz
bec2aaf404 add beautiful_mnist_multigpu example 2024-04-02 00:54:04 +00:00
chenyu
aa76d566c2 cleanup mamba (#4004)
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
2024-03-30 02:50:13 -04:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
chenyu
ecf38f498e beam search resnet eval too in BENCHMARK (#4000) 2024-03-29 21:07:23 -04:00
reddyn12
9b5e15db6e Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
David Hou
4b95350c41 fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
Francis Lam
16a1d43f6f llama: prevent device initialization outside of __main__ (#3966)
* llama: prevent device initialization outside of __main__

causes HSA resources leakages in child compile processes

* llama: fix loading with multiple devices
2024-03-27 19:19:38 -04:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
Arseny Kapoulkine
cb6e7b57a6 examples: Fix parameter bandwidth accounting for quantized LLama (#3930)
Instead of assuming every parameter is 2 bytes, just add up tensor sizes
in bytes
2024-03-25 18:41:05 -04:00
chenyu
d651835ef5 verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
chenyu
24d004a89b hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
2024-03-23 17:16:38 -04:00
chenyu
f7f67e0cc5 simple fix llama shard with quantize (#3882)
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.

70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.

`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`

13B on 6 gpus uses 47 GB v.s. 34 GB quantized
2024-03-22 18:15:37 -04:00
Francis Lam
a26090d404 search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
Anurag Lamsal
4e0819e40b fixing the benchmark not printing in handcode resnet50 opt example (#3850) 2024-03-21 00:55:31 -04:00
chenyu
9d1d08fbb0 show llama bandwith with timing (#3844) 2024-03-20 17:19:15 -04:00
chenyu
dccefab23f remove mixtral weight to clang first (#3792)
seems fine without it now
2024-03-17 23:33:17 -04:00
chenyu
5ac1fa933f apply the same fix_bf16 in llama and coder (#3789)
* apply the same fix_bf16 in llama and coder

did not realize the same logic was in llama too.
really fix #2775

* flag for native SUPPORT_BF16 cast
2024-03-17 21:25:24 -04:00
chenyu
639bd5dbfc move bf16 cast hack to Tensor.llvm_bf16_cast (#3788) 2024-03-17 18:51:22 -04:00
chenyu
9255332d9e use llvm as bridge to fix_bf16 loading (#3774)
This is how bf16 load is tested in test_bf16_disk_write_read now and it should fix #2775.
I tested that it fixed loading coder using PYTHON backend.

Will separate this special bf16 load v.s. regular bf16 support
2024-03-16 15:22:19 -04:00
chenyu
e1c5aa9cce estimated resnet training time for BENCHMARK (#3769) 2024-03-15 22:36:58 -04:00
chenyu
4bd5535d72 update mlperf resnet default hparams (#3758)
we might be able to have higher lr given smaller BS, but this is good.

Trained to 75.9%
https://wandb.ai/chenyuxyz/tinygrad-examples_mlperf/runs/xi2f48se/overview
2024-03-15 12:09:26 -04:00
George Hotz
641f347232 simple LoadOps.ASSIGN (#3745)
* simple LoadOps.ASSIGN

* skip that test

* don't assign in onnx ops gemm

* track cache usage

* recreate the lazybuffer to avoid the cache

* fix contigs

* skip that test

* lol

* better letters
2024-03-14 20:44:34 -07:00
chenyu
557c7a5c54 fix yolov8.py (#3742)
replaced an `assign` with `replace`, and add '.png' for output if input URL does not contain an extention
2024-03-14 17:33:45 -04:00
George Hotz
3527c5a9d2 add Tensor.replace (#3738)
* add Tensor.replace

* fix dtypes in that test

* should be replace

* and mixtral
2024-03-14 13:34:14 -07:00
David Hou
199f7c4342 MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf3.

Revert "log layer stats"

This reverts commit 9d98224585.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
chenyu
3d9b882d37 hotfix unlink /dev/shm/resnet_X if it already exists (#3726) 2024-03-13 18:53:03 -04:00
chenyu
ad1d873f8d fix llama shard convo mode (#3716) 2024-03-13 12:07:02 -04:00
qazal
337cd53444 multioutput ScheduleItem (#3699)
* refactor realize.py

* update docs

* update test_sched

* update runners and devices

* update openpilot and unit tests

* cleanup runner lowering

* update more tests
2024-03-13 08:59:38 -07:00
David Hou
2befdf86d9 dataloader worker/shm cleanup (#3710) 2024-03-12 21:44:24 -04:00
chenyu
b13457e4a7 explicit dtypes in hlb_cifar (#3707)
prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor
2024-03-12 18:20:23 -04:00
qazal
aec4c4f01b linearizer ast as a tuple of lazyops (#3689)
* multi store op linearizer

* currently we do only one output per kernel

* named opts
2024-03-11 15:39:04 -07:00
rnxyfvls
490c5a3ec3 examples/stable_diffusion: support model checkpoints without alphas_cumprod key (#3681)
* examples/stable_diffusion: support model checkpoints without alphas_cumprod key

(which is most models on civitai)

* fix indent

---------

Co-authored-by: a <a@a.aa>
2024-03-11 16:05:52 -04:00
chenyu
d69170e27e add llama 2 70B in ci and verify output (#3682)
* add llama 2 70B in ci and verify output

* ln -s llama2 dir
2024-03-11 12:48:22 -04:00