Commit Graph

171 Commits

Author SHA1 Message Date
chenyu
7948b05738 fix uneven shard with shrink and pad args on sharded axis (#5131)
it's incorrect to assume all first (len(device)-1) shards would have the same size. e.g. size 2 shard 4 -> (1, 1, 0, 0)
2024-06-24 16:55:50 -04:00
chenyu
4a7d403777 cleanup test_multitensor (#5118)
renamed d_zero, d0, d1, d2, ... to d0, d1, d2, d3 and reused some multi device tuples
2024-06-23 20:54:22 -04:00
chenyu
c0ba5e0dfb multi copy_to_device return the copy on same device if possible (#5117)
previously it always returns from the first device
2024-06-23 20:25:56 -04:00
chenyu
b886d250fb improve test_dropout_on_shard (#4912)
tested some basic property, also minor formatting for a few Tensor.training setups
2024-06-11 11:36:02 -04:00
George Hotz
35e53c0809 add sharded arange test (#4908) 2024-06-11 10:58:33 +02:00
chenyu
e33efd6a3d test cases for multitensor adds const (#4892)
Tested const remained const in ast. Removed the TODO in _to_const_val too
2024-06-08 22:57:48 -04:00
nimlgen
e78a9bf3f2 support view in nv/amd (#4812)
* support view in nv/amd

* fix amd

* fix

* run test on nv/amd
2024-06-03 22:11:52 +03:00
qazal
637f482588 configure derandomizing CI tests (#4793) 2024-05-31 17:06:58 +03:00
George Hotz
07b350a8f4 new uops is an actual graph (#4560)
* new uops is an actual graph

* it's way slower

* simpler

* fix define acc

* render_loop unique

* ops test pass

* add pattern matcher back, there's bugs

* rewrite

* use priority queue

* recursive children

* fix tests

* fix tests with SINK

* fix abstractions

* fix assembly

* simpler

* link define_acc

* fix DEFINE_ACC placement

* type verify

* full cmp

* fix cmp

* ACCESS_ACC

* insert DEFINE_ACC

* fix PHI

* recursive rewrite

* fix many tests

* sum collapse

* more patterns

* correct change

* fold arange

* fix that lin test

* space

* big folding rule works

* close

* has more maxes, meh

* cached node replace

* set changed

* simplest folding yet

* works

* works

* DIV

* all tests pass

* del

* fuzz linearizer fails

* sum_collapse

* test depth 2 cf

* fix lin test 14

* fix clang depth

* disable that

* failure 14 is fixed

* fix ptx

* failure 27 is fixed

* fix llama

* run_cnt

* Revert "Optimize PTX gated loads index calculation (#4304)"

This reverts commit d97d5a7689.

* fix uops loop

* fix ptx bugs

* add barrier

* print

* mem_type in ptx direct

* bypass tests that fail in CI but pass locally

* ptx remove ptr_ar

* more ptx passing

* fix ptx tests

* assert compile support

* remove  model inference benchmark from red
2024-05-17 18:00:18 -07:00
nimlgen
eb9689336e nv mockgpu (#4600)
* mockgpu nv

* works

* comment that out

* fix merge

* setup gpuocelot

* install packages

* not run all of them

* passes

* fix ci

* almost

* should pass

* linter

* linter 2

* try this?

* ugn, not supported

* ci

* remove ticket from description

* better descs
2024-05-15 23:46:08 +03:00
George Hotz
5ba611787d move image into tensor.py. delete features (#4603)
* move image into tensor.py

* change setup.py

* openpilot tests need pythonpath now
2024-05-15 10:50:25 -07:00
George Hotz
2f970a4fc2 all realize 2 (#4527)
* all realize 2

* tests fixup

* fix more tests

* fix openpilot

* fix tests

* unneeded
2024-05-10 22:43:09 -07:00
George Hotz
89e119bc58 move Allocator to buffer.py (#4502)
* move Allocator to buffer.py

* move those to realize

* memory file

* cleanup
2024-05-09 19:45:56 -07:00
George Hotz
c9e84ed0da refactor to Program class (#4476)
* refactor to Program class

* switch to Program

* fix tests

* smaller diff

* self.p

* more tests

* fix metal test

* tests

* fix openpilot

* move that to linearizer

* p.launchdims
2024-05-09 17:29:07 -07:00
George Hotz
17faae091b optimizer shouldn't be run without training (#4460)
* optimizer shouldn't be run without training

* set training in relevant tests

* fix multitensor

* that too
2024-05-06 15:34:12 -07:00
George Hotz
9fc4465557 subbuffer support (#4397)
* subbuffer support

* diskbuffer offset

* cuda subbuffer works

* use subbuffer

* more subbuffer tests

* consecutive

* cast

* consec

* offset

* view is a better name

* offset is in nbytes

* fix view + memory planner

* delete unused DiskRunner

* reverse order

* no subbuffers on unrealized consts

* only enabled for disk

* don't reverse memory

* view supported devices

* pickle buffer view

* ring jit

* support extra view inputs in jit

* fix JIT=2 issue

* test copy jit

* p2p isn't an option anymore

* fix dep tracking issue

* fix mypy

* fix pickle

* from_nv is contents now
2024-05-03 18:05:57 -07:00
George Hotz
c8a2047377 testing for all reduce (#4387) 2024-05-02 06:34:10 -07:00
chenyu
f363f39e83 fix dtype of const folded sum (#4349)
const folding sum should return in the same dtype the same as regular sum, which can be different from input dtype
2024-04-29 11:40:45 -04:00
George Hotz
50e780a588 multitensor shouldn't recompile (#4164)
* multitensor shouldn't recompile

* type annotations

* fix tests

* outcount in reduce
2024-04-13 00:03:48 -07:00
uuuvn
2b81d9b334 Fix broken test (#4104) 2024-04-07 12:02:12 -04:00
uuuvn
bb7567b365 Fix metal (#4101) 2024-04-07 05:21:19 -07:00
George Hotz
a337922c44 more work on kfd (#4079)
* more work on kfd

* fix multitensor test on kfd

* stuff
2024-04-05 08:36:36 -07:00
chenyu
82440d3416 don't call contiguous for unpadded const into multi tensor (#4032)
* don't call contiguous for unpadded const into multi tensor

fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere

* ehh need a memory test for that

* simple sharded memory test
2024-04-01 19:22:14 -04:00
George Hotz
9eef44521b ScheduleItem uses Buffer (#3995)
* schedule Buffer

* update

* update tests

* master

* works

* remove LoadOps.WAIT

* fix compile2

* bad test

* rename and note
2024-03-29 20:50:27 -07:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
George Hotz
03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
wozeparrot
9a9cac58f9 add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
uuuvn
6729f20aab Ring allreduce try 2 (#3852)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

* RING=2 and use ContextVar

* DEBUG >= 2 and change name

* linter

* type

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
George Hotz
8cb5215885 Revert "Ring allreduce in multitensor (#3000)" (#3840)
This reverts commit c5bf9e4c96.
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96 Ring allreduce in multitensor (#3000)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-20 11:20:01 -07:00
George Hotz
3527c5a9d2 add Tensor.replace (#3738)
* add Tensor.replace

* fix dtypes in that test

* should be replace

* and mixtral
2024-03-14 13:34:14 -07:00
David Hou
199f7c4342 MLPerf Resnet (cleaned up) (#3573)
* this is a lot of stuff

TEST_TRAIN env for less data

don't diskcache get_train_files

debug message

no lr_scaler for fp32

comment, typo

type stuff

don't destructure proc

make batchnorm parameters float

make batchnorm parameters float

resnet18, checkpointing

hack up checkpointing to keep the names in there

oops

wandb_resume

lower lr

eval/ckpt use e+1

lars

report top_1_acc

some wandb stuff

split fw and bw steps to save memory

oops

save model when reach target

formatting

make sgd hparams consistent

just always write the cats tag...

pass X and Y into backward_step to trigger input replace

shuffle eval set to fix batchnorm eval

dataset is sorted by class, so the means and variances are all wrong

small cleanup

hack restore only one copy of each tensor

do bufs from lin after cache check (lru should handle it fine)

record epoch in wandb

more digits for topk in eval

more env vars

small cleanup

cleanup hack tricks

cleanup hack tricks

don't save ckpt for testeval

cleanup

diskcache train file glob

clean up a little

device_str

SCE into tensor

small

small

log_softmax out of resnet.py

oops

hack :(

comments

HeNormal, track gradient norm

oops

log SYNCBN to wandb

real truncnorm

less samples for truncated normal

custom init for Linear

log layer stats

small

Revert "small"

This reverts commit 988f4c1cf3.

Revert "log layer stats"

This reverts commit 9d98224585.

rename BNSYNC to SYNCBN to be consistent with cifar

optional TRACK_NORMS

fix label smoothing :/

lars skip list

only weight decay if not in skip list

comment

default 0 TRACK_NORMS

don't allocate beam scratch buffers if in cache

clean up data pipeline, unsplit train/test, put back a hack

remove print

run test_indexing on remu (#3404)

* emulated ops_hip infra

* add int4

* include test_indexing in remu

* Revert "Merge branch 'remu-dev-mac'"

This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.

fix bad seeding

UnsyncBatchNorm2d but with synced trainable weights

label downsample batchnorm in Bottleneck

:/

:/

i mean... it runs... its hits the acc... its fast...

new unsyncbatchnorm for resnet

small fix

don't do assign buffer reuse for axis change

* remove changes

* remove changes

* move LARS out of tinygrad/

* rand_truncn rename

* whitespace

* stray whitespace

* no more gnorms

* delete some dataloading stuff

* remove comment

* clean up train script

* small comments

* move checkpointing stuff to mlperf helpers

* if WANDB

* small comments

* remove whitespace change

* new unsynced bn

* clean up prints / loop vars

* whitespace

* undo nn changes

* clean up loops

* rearrange getenvs

* cpu_count()

* PolynomialLR whitespace

* move he_normal out

* cap warmup in polylr

* rearrange wandb log

* realize both x and y in data_get

* use double quotes

* combine prints in ckpts resume

* take UBN from cifar

* running_var

* whitespace

* whitespace

* typo

* if instead of ternary for resnet downsample

* clean up dataloader cleanup a little?

* separate rng for shuffle

* clean up imports in model_train

* clean up imports

* don't realize copyin in data_get

* remove TESTEVAL (train dataloader didn't get freed every loop)

* adjust wandb_config entries a little

* clean up wandb config dict

* reduce lines

* whitespace

* shorter lines

* put shm unlink back, but it doesn't seem to do anything

* don't pass seed per task

* monkeypatch batchnorm

* the reseed was wrong

* add epoch number to desc

* don't unsyncedbatchnorm is syncbn=1

* put back downsample name

* eval every epoch

* Revert "the reseed was wrong"

This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.

* cast lr in onecycle

* support fp16

* cut off kernel if expand after reduce

* test polynomial lr

* move polynomiallr to examples/mlperf

* working PolynomialDecayWithWarmup + tests.......

add lars_util.py, oops

* keep lars_util.py as intact as possible, simplify our interface

* no more half

* polylr and lars were merged

* undo search change

* override Linear init

* remove half stuff from model_train

* update scheduler init with new args

* don't divide by input mean

* mistake in resnet.py

* restore whitespace in resnet.py

* add test_data_parallel_resnet_train_step

* move initializers out of resnet.py

* unused imports

* log_softmax to model output in test to fix precision flakiness

* log_softmax to model output in test to fix precision flakiness

* oops, don't realize here

* is None

* realize initializations in order for determinism

* BENCHMARK flag for number of steps

* add resnet to bechmark.yml

* return instead of break

* missing return

* cpu_count, rearrange benchmark.yml

* unused variable

* disable tqdm if BENCHMARK

* getenv WARMUP_EPOCHS

* unlink disktensor shm file if exists

* terminate instead of join

* properly shut down queues

* use hip in benchmark for now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-14 00:53:41 -04:00
qazal
337cd53444 multioutput ScheduleItem (#3699)
* refactor realize.py

* update docs

* update test_sched

* update runners and devices

* update openpilot and unit tests

* cleanup runner lowering

* update more tests
2024-03-13 08:59:38 -07:00
chenyu
4552248c84 fix Tensor.to preserves grad.data (#3636) 2024-03-06 21:44:49 -05:00
chenyu
8f10bfa2ff ban __bool__ on Tensor (#3632)
* ban __bool__ on Tensor

avoid misuse

* test case

* fix tests

* fix more tests
2024-03-06 17:12:35 -05:00
Elias Wahl
a1507c7fd4 Fix Tensor.dropout() with multigpu (#3619)
* Tensor.rand with multilazybuffer

* remove recursive + test

* whitespace

* another whitespace. Sorry

* remove else

* Conconicalize multidevice tuple + Remove src
2024-03-05 18:26:21 -05:00
David Hou
d16aa89561 don't allow MLB assigns with different axes (#3557)
* allow LB <- MLB assign, but don't reuse buffer

* update test

* update test

* assign assert axes are the same

* update tests to manually shard running stats

* unused import
2024-03-01 07:59:06 -05:00
chenyu
cfd23f398d Revert "don't allow MLB assigns with different axes (#3483)" (#3554)
This reverts commit f19d8bb7b4.
2024-02-29 23:13:07 -05:00
David Hou
f19d8bb7b4 don't allow MLB assigns with different axes (#3483)
* allow LB <- MLB assign, but don't reuse buffer

* update test

* update test

* assign assert axes are the same
2024-02-29 23:04:12 -05:00
David Hou
e5385eecfc UnsyncedBatchNorm with synced trainable weights for hlb cifar (#3472)
* UnsyncedBatchNorm with synced trainable weights for hlb cifar

* multitensor reshape tests

* test mlb assign change axis

* E501

* argfix axis

* don't import batchnorm from hlb_cifar in test_multitensor

* pass num_devices to UnsyncedBatchNorm in test, allow UnsyncedBatchNorm to be used with LB

* add backprop test for UnsyncedBatchNorm

* break out MLB assign and reshape changes

* manually shard running mean and running var

* don't shard unless syncbn=0

* replace nn.BatchNorm2d with UnsyncedBatchNorm

* don't increment num_batches_tracked if not tracking running stats

* update tests

* oops

* Revert "oops"

This reverts commit 5e8a67a535.

* Revert "update tests"

This reverts commit 7ebf65d89a.

* Revert "don't increment num_batches_tracked if not tracking running stats"

This reverts commit 78de0ea9ee.

* Revert "replace nn.BatchNorm2d with UnsyncedBatchNorm"

This reverts commit d03da53da7.

* don't increment num_batched_tracked if not tracking running stats

* oops

* test_batchnorm_axis

* compare against torch

* types

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-02-29 22:52:07 -05:00
David Hou
5cfcc2a8d7 support MLB reshaping on-axis for evenly sharded (#3484)
* support MLB reshaping on-axis for evenly sharded

* update test

* not -> !=
2024-02-23 07:51:36 -05:00
George Hotz
2e60012bcf move create schedule and delete old API (#3377)
* move create schedule and delete old API

* fix test multitensor
2024-02-12 18:10:45 +01:00
chenyu
18e854cdbf shrink MLB on sharded axis (#3255)
* shrink MLB on sharded axis

use onehot structure to store the real partition. goal is unsynced batchnorm2d that can be run on multigpu for training.

draft version in https://github.com/chenyuxyz/tinygrad/pull/109

* SYNCBN flag

* test unclean shrinks

* UnsyncedBatchNorm reuses BatchNorm

* more robust pad arg check

* better types

* more tests!

* 6 gpus in benchmark

* disable slow GPUS=6 benchmark
2024-01-31 21:48:25 -05:00
George Hotz
47f9887ce4 hip events work (#3229)
* hip events work

* event
2024-01-24 11:49:53 -08:00
George Hotz
e2e4632aea LoadOps SYNC (#3223)
* LoadOps SYNC and WAIT

* no wait, only sync

* DEBUG >= 1

* track cross device
2024-01-23 21:59:18 -08:00
chenyu
c4b5661146 fuzz length for multitensor reduce test case (#3190)
so that the uneven case is not just with 0 length and can have other positve values
2024-01-20 00:44:38 -05:00
chenyu
fdb1c2b1d9 move reduce over 0 len axis logic to lazy.py (#3188)
* move reduce over 0 len axis logic to lazy.py

this fixed uneven shard reduce case if the uneven one has length 0

* fix interpreted backends

* fix backwards for 0 shape tensors too
2024-01-20 00:13:03 -05:00
George Hotz
254a7372fe buffer copy refactor (#3187) 2024-01-19 20:21:24 -08:00
chenyu
cb4cfc078a parameterize multitensor tests for reduce (#3181)
uneven shards reduce is incorrect now
2024-01-19 14:03:01 -05:00