Commit Graph

1248 Commits

Author SHA1 Message Date
George Hotz
d438d5698d bring buffer back to device (#4517) 2024-05-10 11:22:31 -07:00
George Hotz
1e843d495e cleaning up search with Program (#4500)
* cleaning up search

* fix tests

* test fix

* minor compiler cleanup
2024-05-09 19:01:53 -07:00
George Hotz
c9e84ed0da refactor to Program class (#4476)
* refactor to Program class

* switch to Program

* fix tests

* smaller diff

* self.p

* more tests

* fix metal test

* tests

* fix openpilot

* move that to linearizer

* p.launchdims
2024-05-09 17:29:07 -07:00
Francis Lam
c8595a9655 update sops.gz, fix tests and add new linearizer test (#4437)
* update sops.gz, fix tests and add new linearizer test

* remove METAL CI skip for test_failure_22

* re-add skip to METAL CI to test_failure_22
2024-05-05 17:31:25 -04:00
George Hotz
12be536c06 Clang graph (#4424)
* clang graph runner

* render_dtype

* name it ClangGraph

* JIT=2

* JIT=2 goes there

* JIT as context var
2024-05-05 09:54:12 -07:00
George Hotz
cb7289f9c9 remove clang program header (#4422)
* remove clang program header

* proper max

* bools are numbers

* fix compile enet
2024-05-04 08:38:01 -07:00
chenyu
22376e53b7 resnet mlperf logging (#4361)
* resnet mlperf logging

* cropping too much?
2024-05-02 00:00:04 -04:00
George Hotz
8bcf533a84 gitignore open-images-v6TEST 2024-05-01 13:55:38 +00:00
Elias Wahl
27613dd881 MLPerf BERT: Main training loop (#4288)
* BERT language modeling head + trunc normal initializers

* add train loop + helpers

* shuffle in dataloaders + slight changes in main loop

* beam change

* Minor changes

* random.shuffle

* HParam update

* Use deque for dataloader

* wandb bert project name

* half fixes

* BENCHMARK + remove epoch

* cast + print()

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-29 14:35:27 -04:00
geohotstan
bf412aeb80 use tolist instead of numpy for extracting parameters in onnx (#4333)
* still some numpy left

* all pass

* oops indent

* fix up safe_python

* to_python_const
2024-04-29 10:48:20 -04:00
Francis Lata
bb849a57d1 [MLPerf] UNet3D dataloader (#4343)
* add support for train/val datasets for kits19

* split dataset into train and val sets

* add tests for kits19 dataloader

* add MLPerf dataset tests to CI

* update unet3d model_eval script

* fix linting

* add nibabel

* fix how mock dataset gets created

* update ref implementation with permalink and no edits

* clean up test and update rand_flip implementation

* cleanups
2024-04-28 22:34:18 -04:00
chenyu
82d0ed3cf3 cap default dataset wikipedia max_workers to 32 (#4345)
64 on tinybox OOM
2024-04-28 21:55:21 -04:00
geohotstan
bc36940c28 fix (#4319) 2024-04-28 16:29:04 +08:00
chenyu
5ae252ae83 use at least float32 for optim.lr (#4297)
* use at least float32 for optim.lr

when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr.
it would have been upcasted later in actual weight update, but would have lost precision.
this improved resnet convergence significantly

* undo type annotation
2024-04-25 14:42:28 -04:00
George Hotz
38f97aa0fe rename rawbufs to bufs in ExecItem (#4274) 2024-04-24 11:27:27 +08:00
nimlgen
f3b4dff7c9 KFDProgram -> AMDProgram (#4268) 2024-04-24 00:29:50 +03:00
Elias Wahl
69341144ba Wikipedia preprocessing script (#4229)
* Preprocessing script

* short seq prob

* comments + env vars

* Add preprocessing reference. Add test

* lint fix + add eval test support

* whitespaces

* point to commit

* comment

* rename

* better comments
2024-04-23 10:28:01 -04:00
George Hotz
9a95781d51 renamed (#4260) 2024-04-23 09:00:28 +04:00
George Hotz
2ae4f45272 WIP PM4 Support (#4110)
* pm4 kernel launch works

* disable USE_THREAD_DIMENSIONS

* add kernel code

* work on real pm4

* pm4 signal

* same

* gate pm4

* hcq tests pass

* ops passes

* pm4 is closer

* pm4 debug (#4165)

* start debug tests passing

* prg

* smth

* hdp flush

* cleaner 1

* do not need this

* logs not need

* small things

* linter

* remove AQL

* test hcq

* fix tests

* it's subtracting, it shouldn't be -1

* pm4 changes (#4251)

* not need this anymore

* sdma signal with non atomic

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-04-23 08:31:27 +04:00
Francis Lam
bbb0ad4800 wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
nimlgen
e6227bdb15 nv driver (#4044)
* start

* fix err 93

* gpu

* ioctl mappings

* alloc like cuda

* semaphores

* wait for semaphores value

* start ops_nv

* very simple kernels work

* init several gpus

* qmd dumper

* dirty, but most of kernels work

* always all test_ops

* progress, more tests, stable

* test_ops passes, gpt2 works

but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated

* need better sync

* fix sync

* alloc2

* all tests pass!

* cleanup 1

* cleanup

* multigpu, simple transfer

* fix sync

* correct init

* nv_gpu autogen + sync bug fix

* clean extra/nv_gpu_driver

* p2p

* clean up

* remove old gen

* small fixes

* cleanup

* cleanup 2

* small fixes

* bigger queue size

* cleanups

* wait

* fixed signals for devs

* fix hang + parallel beam

* small fixes

* detect when local memory is big in kernel

* correct assert

* small fixes

* correct tls size est

* one va space

* less lines

* shorter

* save 2 lines

* save some lines

* remove type ignores

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-22 19:50:20 +04:00
Elias Wahl
2ecd61e3e2 monkey patching (#4214) 2024-04-18 19:20:52 -04:00
chenyu
cd801a15f3 scipy.signal.gaussian -> scipy.signal.windows.gaussian (#4205)
fixed unet3d model_eval, will add to CI after merging new dice loss
2024-04-17 19:15:37 -04:00
Elias Wahl
6eef8ee22a Wikipedia download script for MLPerf BERT training (#4202)
* wikipedia download script

* add link

* checksum valueError

* ops
2024-04-17 16:34:57 -04:00
Francis Lam
c91b7b1739 test: add fuzz_matmul and better debugging for simple_matmul (#4199)
also show unoptimized shape in verify_kernel
2024-04-16 23:40:31 -04:00
George Hotz
55ae73e951 Replicate llm.c in tinygrad (#4179)
* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* test tolist

* simple fix for onnx test failures (#4186)

* write llm.c and add a few new methods to tensor

* training works

* add jit

* tests for new functions

* bump line count to 7500

* simplest fix

* safenumpy tolist for now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>

---------

Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>
2024-04-16 15:40:48 +04:00
George Hotz
b7e281cf10 JitItem -> ExecItem (#4146)
* JitItem -> ExecItem

* execitem in realize

* cleaner

* JITRunner -> Runner
2024-04-11 08:24:57 -07:00
George Hotz
e79a11b99c hotfix: revert llama change 2024-04-10 20:13:15 -07:00
George Hotz
2e6c39b0b2 Do less realizes (#4141)
* less realize

* corealize jit inputs

* prints

* print before we run
2024-04-10 19:50:50 -07:00
geohotstan
fe88591890 update onnx to 1.16.0 (#4127)
* update

* pass tests and skip tests
2024-04-10 11:19:13 -04:00
Francis Lam
46850a0269 search: add a BEAM_COMPARE env to optionally not compare to hc/tc (#4107)
* search: add a BEAM_COMPARE env to optionally not compare to hc/tc

setting BEAM_COMPARE=0 will prevent additional memory allocation
needed to do the timing tests assuming the BEAM result is in
the diskcache.

* change to optionally use Buffer.allocate
2024-04-08 18:54:01 -04:00
chenyu
f8dc82a8a7 use single tensor for llama kv chache (#4108)
similar to optimization in gpt2
2024-04-08 00:38:32 -04:00
chenyu
92c0675ccf setitem initial support (#4093)
* wip setitem

it's an eager assign to output shapetracker view

* cleanups and tests

* more cleanups
2024-04-07 20:35:22 -04:00
geohotstan
183708b3fd broadcast expand to match torch (#4085)
* initial version

* heh gimme grrrreen

* version 2

* clean ups

* some test confusion

* fix onnx

* rename to _broadcast_tensors

* improved errors and test

* fixed?

* some test fixup

* version 3 lol

* comments

* cleaner

* add failure test for expand to 0 test

* 1 more assertRaises test

* make err msg better

* also rewrite the expand onnx op? :s
2024-04-07 16:23:13 -04:00
George Hotz
fffd9b05f5 mock mnist data for imagenet trainer (#4095)
* mock mnist data for imagenet

* move print and test

* needed to reshape
2024-04-06 08:08:40 -07:00
geohotstan
dafa42e864 clean up (#4081)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-05 11:57:44 -04:00
nimlgen
d6ba44bc1e kfd free buffers (#4027)
* kfd free buffers

* unmap

* all test passes

* better pm4

* forgot these

* invalidate only range

* better cache

* forgot

* comments

* fixes
2024-04-01 15:50:58 -07:00
Francis Lam
dcb58d3bed extra/gemm/simple_matvec: add simple_matvec.py (#4021)
we can test with this or add it to CI for benchmarks
2024-03-31 16:38:52 -04:00
chenyu
d3f27761b0 move const folding of ADD/SUB/MUL from tensor to lazy (#4020)
* move const folding of ADD/SUB/MUL from tensor to lazy

will do div and pow separately.

* fix onnx adding with None
2024-03-31 16:35:36 -04:00
George Hotz
2abb474d43 kfd driver wip (#3912)
* kfd driver wip

* cleanups

* kfd almost ready to ring doorbell

* ding dong?

* issues with signals

* something

* works

* ops kfd

* add amd_signal_t

* works...sometimes

* program runs

* _gpu_alloc cleanup

* cleanups

* work

* header + enable profiling (#3959)

* header + enable profiling

* just cleaner

* measure

* only local time domain

* remove old comments

* fix with master

* elf parsing (#3965)

* elf parsing

* fix kernels with private

* not used

* clean up

* clean up 2

* add flags

* kfd sdma (#3970)

* working sdma

* remove driver, shorter

* all commands we might need

* svm

* kfd remove hardcoded values (#4007)

* remove hardcoded values

* match above line

* 7k lines + revert hsa

* update that from origin

* fix sdma reg gen

* not the updated SDMA

* compiler_opts

* don't require kfd_ioctl

* get ioctls from python

* get ioctls from python

* remove build_sdma_command

* merge into 64-bit fields

* shorter

* fix property spelling and off by one

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-30 15:08:12 -07:00
Francis Lam
04746022b1 extra/gemm/hip_matmul: fix to use new HSA devices and no headers (#3999)
* extra/gemm/hip_matmul: fix to use new HSA devices and no headers

* remove compile_hip import
2024-03-30 15:42:23 -04:00
chenyu
c71627fee6 move GlobalCounter to helpers (#4002)
break circular import between ops and buffer
2024-03-30 00:30:30 -04:00
Akshit Talwar
0affbbf81c update amx gemm (#3991) 2024-03-29 11:45:03 -04:00
George Hotz
9a6ac2a50a create the buffer with the LazyBuffer (#3977)
* create the buffer with the LazyBuffer

* fixes

* hack underlying buffer when we change dtype

* we only care about allocated buffers

* asserts
2024-03-28 19:31:28 -07:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
David Hou
4b95350c41 fp16 resnet (without expand backwards sum in float, doesn't work) (#3816)
* fp16 resnet

* cast running mean and var back to default float

* extra cast

* check symbolic no overflow

* add linearizer failure

* loss scaler after grad contig

* oops

* i think this works

* don't loss scale fp32

* remove overflow test case

* remove symbolic bounds check

* loss scaler should be float

* temporarily disable padto cuz bug

shruggie

* make running stats in batchnorm float32?

* calculate lars stuff in fp32?

* oops

* remove most changes

* move loss scaler out of optimizer

* no more FP16 var

* oops

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-28 01:25:37 -04:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
George Hotz
68ca4d4276 split to schedule.py (#3949)
* split to schedule.py

* split
2024-03-26 21:02:46 -07:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
George Hotz
778d17fbd3 intel matmul (#3830)
* almost right

* intel xmx
2024-03-25 22:37:20 -07:00