Commit Graph

1179 Commits

Author SHA1 Message Date
chenyu
a1940ced77 remove the assign hack in whisper (#4240)
no longer needed, the commented test case was removed too
2024-04-20 23:56:44 -04:00
chenyu
3f126c7664 fix examples vits / converstion.py (#4239)
it was passing a const numpy array into Tensor.arange
2024-04-20 23:29:12 -04:00
George Hotz
cd88afc98b datasets isn't a feature + filter docstrings (#4228)
* datasets isn't a feature

* filter docstrings in sz
2024-04-19 16:16:10 +04:00
George Hotz
d99b512084 llm.c timing (#4219)
* add timing info

* fix malloc

* 8s with beam
2024-04-19 12:43:21 +04:00
George Hotz
39b60a25f0 more llm c work (#4207)
* more llm c work

* print nicely

* fake load pretrained

* select warmups

* output c code
2024-04-18 22:20:44 +04:00
chenyu
f7416916df update resnet hparams based on BS=1632 RCP (#4210)
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json
2024-04-18 12:01:46 -04:00
George Hotz
fa57c3e7ce continue llm.c (#4190)
* continue llm.c

* export more

* progress on llm.c

* simpler optim, names work
2024-04-18 10:57:54 +04:00
Francis Lata
3644077a42 [MLPerf][UNet3D] Add DICE loss + metrics (#4204)
* add DICE loss and metrics

* update dice to include reference implementation's link

* remove unused imports

* remove unnecessary test file and update pred + label for metrics and losses test

* add tests to CI + add exclusion of mlperf_unet3d

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-17 20:09:33 -04:00
chenyu
cd801a15f3 scipy.signal.gaussian -> scipy.signal.windows.gaussian (#4205)
fixed unet3d model_eval, will add to CI after merging new dice loss
2024-04-17 19:15:37 -04:00
David Hou
1dbf3b2b19 Benchmarks for individual resnet layers (#4182)
* resnet individual layer benchmarks!

* small

* 1 and 2

* mem_used

* no ci

* better conv print

* defaults

* prints

* adjust

* adjust

* adjust

* benchmark only one layer example

* tensor.training, zero_grad, sum instead of mean, last mem, last kernel count

* default jitcnt=1

* scale flops/kernels with jitcnt

* add note about jitcnt memory

* touchup
2024-04-16 13:53:18 -04:00
George Hotz
55ae73e951 Replicate llm.c in tinygrad (#4179)
* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* test tolist

* simple fix for onnx test failures (#4186)

* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* bump line count to 7500

* simplest fix

* safenumpy tolist for now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

---------

Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
2024-04-16 15:40:48 +04:00
chenyu
aa093efa43 fix handcode_resnet50_opt flops count (#4184) 2024-04-15 22:13:45 -04:00
chenyu
d5b67c1ca3 log resnet TRAIN_BEAM / EVAL_BEAM (#4181)
also run eval in benchmark mode if either one is positive
2024-04-15 19:29:08 -04:00
chenyu
6a2168e698 TRAIN_BEAM and EVAL_BEAM for resnet (#4177)
working on measuring compile time
2024-04-15 14:57:21 -04:00
David Hou
593c90d7d6 Resnet fp16 training with fp32 master weight copy (#4144)
* add casts to layers

* FLOAT flag

* detach

* no_grad for eval

* whitespace

* explicit fp32 initialization

* oops

* whitespace

* put back config['DEFAULT_FLOAT']

* bad

* live dangerously (don't hide bugs)

* don't bundle changes

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-14 11:25:08 -04:00
chenyu
e20d6f9221 correct resnet estimate time (#4169)
7.99 hours was rendered as 7h0m.
2024-04-14 02:21:46 -04:00
George Hotz
ebc94c9d6c rewrite the jit in the context of new schedule (#4162)
* rewrite the jit in the context of new schedule

* mypy better

* fix placeholder

* tests

* all functionality should work

* fix tests

* no CacheCollector
2024-04-12 21:54:36 -07:00
George Hotz
216eb235e5 hotfix: cast mnist to float 2024-04-09 19:30:03 -07:00
George Hotz
fea774f669 spend 5 lines to bring mnist into the repo (#4122) 2024-04-09 19:24:57 -07:00
chenyu
92c0675ccf setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
George Hotz
97c402d69e use imagenet spawn (#4096) 2024-04-06 08:34:10 -07:00
George Hotz
fffd9b05f5 mock mnist data for imagenet trainer (#4095)
* mock mnist data for imagenet

* move print and test

* needed to reshape
2024-04-06 08:08:40 -07:00
George Hotz
93824e59eb support MOCKDATA=1 for resnet (#4090)
* mockdata for resnet

* fix eval, revert hsa
2024-04-05 17:19:18 -07:00
George Hotz
bec2aaf404 add beautiful_mnist_multigpu example 2024-04-02 00:54:04 +00:00
chenyu
aa76d566c2 cleanup mamba (#4004)
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
2024-03-30 02:50:13 -04:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
chenyu
ecf38f498e beam search resnet eval too in BENCHMARK (#4000) 2024-03-29 21:07:23 -04:00
reddyn12
9b5e15db6e Mamba Implementation (#3456)
* first commit

* state back to orig

* mamba comparisions

* rm file

* rename file

* use Tensor.einsum and mke default model 370M

* Cleaned code and made a comparision test

* Simplyfy pull request. Only has 1 mamba implementation now.

* Update prompt

* rm whitespaces

* last space

* remove Einops dependency

* rm unused code

* add tests

* rm print statement

* rm imports

* skip CLANG

* Update skipIf description

* skip model test in CI and add CLANG fix

* rm Device import

* don't be stupid

* Fix conv assign

When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token

* fix p1

* temp

* fix jit import

---------

Co-authored-by: schlimeszn <schlimeszn@gmail.com>
Co-authored-by: reddyn <nikidsniper@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 17:49:12 -07:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
David Hou
4b95350c41 fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
Francis Lam
16a1d43f6f llama: prevent device initialization outside of __main__ (#3966)
* llama: prevent device initialization outside of __main__

causes HSA resources leakages in child compile processes

* llama: fix loading with multiple devices
2024-03-27 19:19:38 -04:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
Arseny Kapoulkine
cb6e7b57a6 examples: Fix parameter bandwidth accounting for quantized LLama (#3930)
Instead of assuming every parameter is 2 bytes, just add up tensor sizes
in bytes
2024-03-25 18:41:05 -04:00
chenyu
d651835ef5 verify beautiful_mnist.py eval acc and put into benchmark ci (#3926)
* verify beautiful_mnist and put in ci

* 97.5 for eval verification
2024-03-25 16:47:49 -04:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
chenyu
24d004a89b hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
2024-03-23 17:16:38 -04:00
chenyu
f7f67e0cc5 simple fix llama shard with quantize (#3882)
copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.

70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.

`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize`

13B on 6 gpus uses 47 GB v.s. 34 GB quantized
2024-03-22 18:15:37 -04:00
Francis Lam
a26090d404 search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
Anurag Lamsal
4e0819e40b fixing the benchmark not printing in handcode resnet50 opt example (#3850) 2024-03-21 00:55:31 -04:00
chenyu
9d1d08fbb0 show llama bandwith with timing (#3844) 2024-03-20 17:19:15 -04:00
chenyu
dccefab23f remove mixtral weight to clang first (#3792)
seems fine without it now
2024-03-17 23:33:17 -04:00
chenyu
5ac1fa933f apply the same fix_bf16 in llama and coder (#3789)
* apply the same fix_bf16 in llama and coder

did not realize the same logic was in llama too.
really fix #2775

* flag for native SUPPORT_BF16 cast
2024-03-17 21:25:24 -04:00
chenyu
639bd5dbfc move bf16 cast hack to Tensor.llvm_bf16_cast (#3788) 2024-03-17 18:51:22 -04:00
chenyu
9255332d9e use llvm as bridge to fix_bf16 loading (#3774)
This is how bf16 load is tested in test_bf16_disk_write_read now and it should fix #2775.
I tested that it fixed loading coder using PYTHON backend.

Will separate this special bf16 load v.s. regular bf16 support
2024-03-16 15:22:19 -04:00
chenyu
e1c5aa9cce estimated resnet training time for BENCHMARK (#3769) 2024-03-15 22:36:58 -04:00
chenyu
4bd5535d72 update mlperf resnet default hparams (#3758)
we might be able to have higher lr given smaller BS, but this is good.

Trained to 75.9%
https://wandb.ai/chenyuxyz/tinygrad-examples_mlperf/runs/xi2f48se/overview
2024-03-15 12:09:26 -04:00
George Hotz
641f347232 simple LoadOps.ASSIGN (#3745)
* simple LoadOps.ASSIGN

* skip that test

* don't assign in onnx ops gemm

* track cache usage

* recreate the lazybuffer to avoid the cache

* fix contigs

* skip that test

* lol

* better letters
2024-03-14 20:44:34 -07:00