Commit Graph

104 Commits

Author SHA1 Message Date
chenyu
e701106a64 remove FUSE_ARANGE (#12511)
it was the default already
2025-10-08 04:54:07 -04:00
George Hotz
7129419500 fix cifar training in RANGEIFY (#12355)
* fix cifar training in RANGEIFY

* even more wino fuse

* bugfix

* test to show issue
2025-09-30 15:59:19 +08:00
Sieds Lykles
5b73076e48 assert benchmark times (#12042)
* assert jitted times in openpilot

* better error

* better error

* add ASSERT_MIN_STEP_TIME to more models

* t is step_times

* update benchmark times

* update times
2025-09-09 23:40:02 +02:00
chenyu
e39b25cd36 upcast float exp to at least float32 (#11758)
* upcast float exp to at least float32

* unlucky seed
2025-08-22 20:16:34 -04:00
Nino Risteski
bc15e98f5c clean up unused imports in examples and update CI linting (#11024)
* clean up unused imports in examples

* enable unused import checking in examples

* lint

* ignore F541 and F841 - focus on unused imports only

* clean up

* restore tinygrad.frontend.torch for TINY_BACKEND

* tiny change
2025-06-30 08:21:27 -07:00
Alexey Zaytsev
230ad3a460 [bounty] Don't use numpy inside hlb_cifar10 training loop (#10777)
* Don't use numpy inside hlb_cifar10 training loop

* Lint it

* jit it

* Drop the last half-batch

* Use gather for random_crop and reuse perms

* Wrap train_cifar in FUSE_ARANGE context

* No need to pass FUSE_ARANGE=1 to hlb_cifar10.py

* Add cutmix to jittable augmentations

* Remove .contiguous() from fetch_batches

* Fix indexing boundary

---------

Co-authored-by: Irwin1138 <irwin1139@gmail.com>
2025-06-23 17:24:56 -07:00
wozeparrot
0d86f8d375 fix failed threefry (#10646) 2025-06-05 17:17:42 -07:00
George Hotz
b3b43a82c4 remove Tensor.no_grad, it's meaningless now [pr] (#10556) 2025-05-28 22:20:02 -07:00
wozeparrot
12a1ccc680 clean: double import (#10345) 2025-05-15 20:15:09 -07:00
wozeparrot
1ed04f993b move benchmark stat tracking to influxdb (#10185) 2025-05-15 16:14:56 -07:00
George Hotz
e82ba1454b MultiLazyBuffer is UOp [pr] (#8662)
* MultiLazyBuffer is UOp [pr]

* this is new mlb

* this is the idea

* progress

* multitensor works

* more movement ops

* this

* MultiLazyBuffer is UOp

* cleanups

* multi axis

* fix more tests

* work

* not that

* add multi grad and move shard to ops

* mops not views

* no double contig

* sweet, all mt tests passing

* port old logic

* remove lbs

* fix realized

* whitespace

* assign tweak

* test_assign_kv_cache_multi passes

* fix is_realized

* fix JIT for multi

* just a few more lines i'll pay them back soon i swear please bro just a few more

* no split reduceop for multi
2025-01-24 13:28:55 +09:00
geohotstan
f8056a74d6 combine pad2d with pad (#7677)
* I have pad2d, I have pad, uuh~, pad2dpad~

* fix some small things

* strategically placed cast hack

* fix more

* fix more more

* tests

* periods
2024-11-14 17:56:02 +08:00
George Hotz
d9c62a33c3 add cifar to datasets.py (#6210) 2024-08-20 11:42:49 -07:00
George Hotz
8390feb7b9 optim.OptimizerGroup in hlb_cifar (#5401) 2024-07-11 20:14:36 -07:00
George Hotz
5ba611787d move image into tensor.py. delete features (#4603)
* move image into tensor.py

* change setup.py

* openpilot tests need pythonpath now
2024-05-15 10:50:25 -07:00
David Hou
c0a048c044 batchnorm d(var)/d(mean) = 0 (#4430)
* d(var)/d(mean) = 0

* drop the number in test_schedule!
2024-05-05 00:25:45 -04:00
David Hou
593c90d7d6 Resnet fp16 training with fp32 master weight copy (#4144)
* add casts to layers

* FLOAT flag

* detach

* no_grad for eval

* whitespace

* explicit fp32 initialization

* oops

* whitespace

* put back config['DEFAULT_FLOAT']

* bad

* live dangerously (don't hide bugs)

* don't bundle changes

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-14 11:25:08 -04:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
David Hou
4b95350c41 fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
chenyu
83f39a8ceb env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
chenyu
e22d78b3d2 training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda
2024-03-24 01:37:47 -04:00
Francis Lam
a26090d404 search: change to use "spawn" and limit the number of tasks per child (#3862)
also clean up some examples to use __main__ and not initialize
resources outside of main
2024-03-21 21:23:36 -07:00
chenyu
b13457e4a7 explicit dtypes in hlb_cifar (#3707)
prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor
2024-03-12 18:20:23 -04:00
David Hou
d16aa89561 don't allow MLB assigns with different axes (#3557)
* allow LB <- MLB assign, but don't reuse buffer

* update test

* update test

* assign assert axes are the same

* update tests to manually shard running stats

* unused import
2024-03-01 07:59:06 -05:00
David Hou
e5385eecfc UnsyncedBatchNorm with synced trainable weights for hlb cifar (#3472)
* UnsyncedBatchNorm with synced trainable weights for hlb cifar

* multitensor reshape tests

* test mlb assign change axis

* E501

* argfix axis

* don't import batchnorm from hlb_cifar in test_multitensor

* pass num_devices to UnsyncedBatchNorm in test, allow UnsyncedBatchNorm to be used with LB

* add backprop test for UnsyncedBatchNorm

* break out MLB assign and reshape changes

* manually shard running mean and running var

* don't shard unless syncbn=0

* replace nn.BatchNorm2d with UnsyncedBatchNorm

* don't increment num_batches_tracked if not tracking running stats

* update tests

* oops

* Revert "oops"

This reverts commit 5e8a67a535.

* Revert "update tests"

This reverts commit 7ebf65d89a.

* Revert "don't increment num_batches_tracked if not tracking running stats"

This reverts commit 78de0ea9ee.

* Revert "replace nn.BatchNorm2d with UnsyncedBatchNorm"

This reverts commit d03da53da7.

* don't increment num_batched_tracked if not tracking running stats

* oops

* test_batchnorm_axis

* compare against torch

* types

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-02-29 22:52:07 -05:00
chenyu
d8ad9e5660 verify eval acc for hlb_cifar training (#3344)
set to 93% to reduce flakiness for now
2024-02-07 19:19:59 -05:00
chenyu
18e854cdbf shrink MLB on sharded axis (#3255)
* shrink MLB on sharded axis

use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.

draft version in https://github.com/chenyuxyz/tinygrad/pull/109

* SYNCBN flag

* test unclean shrinks

* UnsyncedBatchNorm reuses BatchNorm

* more robust pad arg check

* better types

* more tests!

* 6 gpus in benchmark

* disable slow GPUS=6 benchmark
2024-01-31 21:48:25 -05:00
chenyu
b0a755288f cifar EVAL_BS set default value to BS (#3274)
less compile time for eval due to cache. 500 was a slow uneven number for 6 GPU too. eval time 5.9s -> 3.4s
2024-01-29 17:37:12 -05:00
chenyu
9e5409be6c cifar move GlobalCounters.reset() before shard (#3217)
* cifar move GlobalCounters.reset() before shard

also shard mini batch inplace

* don't eval with DISABLE_BACKWARD
2024-01-23 16:07:43 -05:00
chenyu
3c179cc27c cifar only shuffle data at epoch start (#3216)
save 1ms CPU time per batch. also only shuffle training set
2024-01-23 14:41:22 -05:00
chenyu
8465938d29 minor hlb_cifar cleanups (#3208)
mostly cosmetic. LATEBEAM=4 single 7900xtx 59.2 seconds
2024-01-22 12:38:39 -05:00
chenyu
827b7a3c64 cleanup pad_reflect and make_square_mask in hlb_cifar (#3206)
removed some complicated looking stuff. no wall time difference
2024-01-22 11:30:46 -05:00
chenyu
99884f4c98 cifar flags for RANDOM_CROP, RANDOM_FLIP, and CUTMIX (#3204)
experimenting with different setups, also would like to jit the data augmentation next
2024-01-22 01:12:51 -05:00
chenyu
836883fedc comment out cutmix in hlb_cifar (#3201)
it's no-op with multi gpu and less STEPS. also the patch was selected from the whole dataset, not from the same batch
2024-01-21 22:24:53 -05:00
chenyu
e52a609240 make WINO a context var, and LATEWINO in hlb_cifar (#3161) 2024-01-17 20:21:26 -05:00
chenyu
589c16756f hlb_cifar multi gpu training (#3150)
* cifar train with multi gpu

* GPUS=1 is noop
2024-01-16 14:38:45 -05:00
chenyu
b9d470577c gelu -> quick_gelu in hlb_cifar (#3147)
89 -> 86 seconds, same eval acc
2024-01-16 02:03:37 -05:00
chenyu
ec5a212b0a modernize hlb_cifar (#3146)
* modernize hlb_cifar

do more things in Tensor space instead of numpy, clean up dtypes and use more Tensor methods.

* eigens are float64
2024-01-16 01:35:11 -05:00
chenyu
22920a7e55 add LATEBEAM to hlb_cifar (#3142)
still too slow to search on tinybox though
2024-01-15 23:26:03 -05:00
Yixiang Gao
8e1fd6ae9d test works 2024-01-03 07:22:01 -08:00
Yixiang Gao
4f89f8b73a make sure the old hyp breaks the test 2024-01-03 07:13:54 -08:00
Yixiang Gao
b753d280f7 move hyp out of the train so it can be imported 2024-01-02 15:56:17 -08:00
Yixiang Gao
2e4d9ad936 adjsut div factor to avoid underflow 2024-01-02 13:47:13 -08:00
George Hotz
a280cfe169 move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
George Hotz
c81ce9643d move globalcounters to ops (#2960)
* move globalcounters to ops

* missed a few

* sick of that failing
2024-01-01 14:21:02 -08:00
chenyu
6d7e9e0a56 hotfix convert Y_train to int before passing into index (#2850) 2023-12-19 11:40:56 -05:00
chenyu
0723f26c80 dtypes.default_float and dtypes.default_int (#2824) 2023-12-18 12:21:44 -05:00
George Hotz
c6eb618013 tests from new lazy branch (#2774)
* tests from new lazy branch

* fix lin 11

* that was needed

* doesn't fail

* mark

* meant that

* llvm passes
2023-12-14 23:06:39 -08:00
qazal
ab2d4d8d29 Fix cl import in the copy_speed test and cifar example (#2586)
* fix CL import

* update test to only run on GPU

* update hlb_cifar too
2023-12-03 09:22:07 -08:00
George Hotz
2c363b5f0b new style device (#2530)
* cpu tests pass

* torch works

* works

* metal works

* fix ops_disk

* metal jit works

* fix openpilot

* llvm and clang work

* fix webgpu

* docs are rly broken

* LRU works on metal

* delete comment

* revert name to ._buf. LRU only on Compiled

* changes

* allocator

* allocator, getting closer

* lru alloc

* LRUAllocator

* all pass

* metal

* cuda

* test examples

* linearizer

* test fixes

* fix custom + clean realize

* fix hip

* skip tests

* fix tests

* fix size=0

* fix MOCKHIP

* fix thneed

* copy better

* simple

* old style metal copy

* fix thneed

* np reshape

* give cuda a device
2023-11-30 17:07:16 -08:00