Commit Graph

1242 Commits

Author SHA1 Message Date
George Hotz
0ba629c7b9 add world dataset (#2045) 2023-10-11 15:54:30 -07:00
George Hotz
0c3b6f13a8 Latest opt (#2044)
* split out actions

* rl algorithm
2023-10-11 15:46:14 -07:00
George Hotz
41bfeb2c1e start work on auto opt (#2034)
* start work on auto opt

* lin failure

* not beating hcopt

* greedy

* timing is fast

* codegen.search

* greedy search in handcode_opt

* track running gflops

* clean up those files

* no failure
2023-10-11 12:54:53 -07:00
chenyu
1c980517c5 s/var_vals_from_ast/vars_from_ast (#2038) 2023-10-10 20:21:55 -07:00
George Hotz
f139060103 Rewrite hand coded opt with action space (#2030)
* tests passing

* hand coded opt with new abstractions

* simpler opts

* split out tensor cores
2023-10-10 07:38:38 -07:00
George Hotz
16ca8410f8 op logger + replay (#2021)
* logops

* fix dtype printing

* needs inf

* ops dataset

* minor improvements

* 12k kernels

* opt can compile

* graph flops
2023-10-08 15:10:18 -07:00
George Hotz
8db92bd060 fix tvm gemm example 2023-10-08 05:57:41 -07:00
Francis Lam
dece9958f8 wmma: clean up to make WMMA arg order consistent (#2014)
also add cache defeat to extra/gemm/simple_matmul.py
2023-10-07 17:45:40 -07:00
George Hotz
6ee9cae44f don't extract CIFAR every time / use the cache 2023-10-07 12:33:50 -07:00
George Hotz
dea8bb0938 triton isn't tested, and allows this refactor (#2007)
* triton isn't tested

* cuda buffer
2023-10-07 07:29:59 -07:00
Roelof van Dijk
26fcc8dff6 fix: remove runtime imports (#1982)
fix: import what is used

probably monkeypatched

fix: import

revert selective import
2023-10-07 05:23:08 -07:00
George Hotz
f54959e5cd move print tree into graph (#2003)
* move print tree into graph

* add winograd profiling test

* change pre-commit to run ruff first
2023-10-07 04:39:21 -07:00
Ahmed Harmouche
2114dc13d1 Allow multi-input model export (#1995)
* Allow multi-input model export

* Add model export unit test

* Fix efficientnet compilation

* Only run model export test on JIT supported devices

* Skip export model test if not EXPORT_SUPPORTED_DEVICE
2023-10-07 04:13:34 -07:00
George Hotz
ffa33d743a good changes from openpilot_compile2 (#2000)
* good changed from openpilot_compile2

* float32 image type was wrong

* cleaner way to write that + a test
2023-10-06 13:33:24 -07:00
Francis Lam
0ba75c4370 optimizer: add matvec optimizations (#1972)
* optimizer: add matvec optimizations

* renderer: fix alignment of shared memory in opencl
2023-10-04 14:16:27 -07:00
George Hotz
de5d603ec1 corealize + remove realize from lazybuffer (#1968)
* corealize + remove realize from lazybuffer

* fix multigpu

* fix graph
2023-10-04 10:59:31 -07:00
nimlgen
2ea1dd3e87 no process() in Linearizer (#1966)
* no process() in Linearizer

* more process() clean up
2023-10-04 07:18:42 -07:00
George Hotz
717451a244 Revert "optimizer: add matvec optimizations (#1753)" (#1959)
This reverts commit f520323054.
2023-10-03 00:28:42 -07:00
Francis Lam
f520323054 optimizer: add matvec optimizations (#1753)
* optimizer: add matvec optimizations

* Update optimizer.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-03 00:01:59 -07:00
David Hou
8e9db88474 expand after expr_idxs in Linearizer.global_load (#1818)
* small changes

* expand in terms of substitute, directly expand g_idxs g_valid

* delete expand_ops

* don't compare using hash

* any instead of in

thanks gijskoning

Co-authored-by: Gijs Koning <gijs-koning@live.nl>

* support tc

* testing code

* no more create_rednode

* maxsize none in view/node

* oops

* undo

* typing

* oops

* oops

* lmao

* lmao

* add expand multi test

* Node.iter_idxs

* type

* type

* delete checks!

* clean up a little?

* expand_idx in symbolic

* un-golf

* play around with types >.>

* test_substitute and also remove an incorrect test?

* get rid of range

* Update symbolic.py

* split out view cache change

* split out flat components change

* reduce diff

* reduce diff

* add some float4 tests

* fix

---------

Co-authored-by: Gijs Koning <gijs-koning@live.nl>
2023-09-29 10:33:34 -07:00
Francis Lam
f445e056ed wmma: add test and tensor core shape (#1925) 2023-09-28 18:04:28 -07:00
Yixiang Gao
094d3d71be with Tensor.train() (#1935)
* add with.train

* remove the rest TODOs

* fix pyflake

* fix pyflake error

* fix mypy
2023-09-28 18:02:31 -07:00
George Hotz
c36d0e3bd8 tvm import hook 2023-09-28 09:24:32 -07:00
George Hotz
adab724caa schedule2, keep the tests working with small changes (#1932)
* lazy cleanups

* ast functions take in LazyOps

* op instead of self.op

* _base for mops

* fix contiguous

* start schedule

* test_schedule

* fix openpilot

* more tests

* bugfix and test skip

* work

* make sure things get freed

* fix zerosized tensors

* fix failing test

* fix ceil and friends

* fix openpilot

* disable training

* disable test collectives
2023-09-28 09:14:43 -07:00
nimlgen
45f02393f0 HipGraph support (#1880)
* init hip graph

* optimize args update

* cache symbolic in jit

* remove NOSTAT

* init BasicBatchExecutor

* symbolic infer cache per jit instance

* basicbatchexec is defualt for compiled

* batch_exec is taken from ASTRunner

* no infer cache

* batched execution of hip graph

* add comment about hip graph batches

* readable hip graph
2023-09-24 20:14:36 +08:00
Szymon Ożóg
58296c079d Make Triton work again (#1547)
* Move ops_triton to runtime and remove errors from deprecated code

* Remove deprecated AST Kernel

* Remove deprecated buffer

* Add TritonProgram

* Triton Buffer

* Use RawCUDABuffer

* triton_compile

* Added new parameter

* pass _buf to program

* remove deprecated include

* Added triton tests

* Deprecated includes removed

* remove double print

* Disable float4 support

* Disable float4 support

* variable load fix

* Track local size

* Add pycuda to triton dependencies

* Merge test.yml

* install cuda packages for testing

* merge double package install

* remove emulated from triton tests

* upscale local index to power of 2 and add masking

* cuda envs

* Add TernaryOps

* ConstOp loading

* proper function name

* remove deprecated variables

* get global program from name

* const ops match local shape

* Enable test_nn

* remove deprecated import

* fix linter error

* Add wait logic

* Add local size override

* accumulate local shapes instead of using max shape

* Merge triton tests into global tests

* fix envs in testing

* Old testing routine

* split file into renderer and program

* remove print and starting whitespace

* pretty ptx print on debug 5

* linter errors

* ignore triton saturation tests

* ignore test example

* remove pytorch cpu extra index

* Add triton to existing testing routine

* use triton tests

* disable cuda backend in triton tests

* use cudacpu in tests

* print used device

* Print device default

* Remove print

* ensure we are running triton backend

* update variable signatures

* update dtypes for load

* infinity render fixed

* limit global size

* negative infinity now properly rendered

* split chain with parentheses for and node

* Add option to disable shared memory, disable for triton

* missing import

* Properly index and mask conditional load

* use mask only if not loading a block pointer

* nan support

* fix symbolic tests to include chain split

* proper masking for stores

* Implemented bool dtype

* Add mod

* fix loads for variables with valid range

* merge triton with cuda runtime

* merge from master

* run triton tests with cuda

* Correct target when running from triton

* conftest with triton compiler config

* use triton nightly

* verbose tests for triton

* capture stdout

* fix function depth when exiting multiple loops

* add render valid function for readabilty

* fix mask for local loops

* add _arg_int32 datatype

* fix dims for conditional loads

* enable non float stores

* correct variable dtypes

* fix type for arg_int32

* remove junk

* Added get max function for range based var.max

* remove deprecated code

* Fix triton ptxas path

* Fix testing for CI

* clamp local size by max local size instead of always running max

* Disable matmul test in triton cpu

* rerun tests

* Disable broken test in triton cpu

* whitespace removed

* rerun tests again

* Disable TestSymbolicOps for triton

* update to new uops

* linter fix

* ignore test/extra

* linting fix

* Update tinygrad/renderer/triton.py

Co-authored-by: Gijs Koning <gijs-koning@live.nl>

* remove deprecated line

* quotes type fix

* linter

* Remove unnecesary lines

* UnaryOps.NEG

* dont define constants

* Linting fix

* Disable tests that are broken in ocelot

* remove trailing whitespace

* reduce line count

* linting fix

* update to new uast

* New looping style

* Update to new uast

* make AST runner work with triton

* linting fix

* set renderer var for testing

* disable local for ocelot

* reenable all tests for ocelot

* Pass shared to cuda

* Don't group if the backend doesn't support shared mem

* use working gpuocelot branch

* enable all tests

* enable local for ocelot

* cleanup

* Update test.yml

* update cache key

* reenable test symbolic and extra

* Update test.yml

* Revert "Update test.yml" (rerun tests)

This reverts commit 98c0630ee5.

* Revert "fix symbolic tests to include chain split"

This reverts commit 22a9a4c9cd.

* Revert "split chain with parentheses for and node"

This reverts commit 7499a7004e.

* use global size from linearizer

* rename newvar to dtype to match other renderers

* join program start lines

* simplify code that adds axis to local dims

* assign r[u] in ssa

* We no longer need to replace target in src

* we no longer need to cast indices to int by hand

* Update triton.py(rerun tests)

* Update triton.py(rerun tests)

* Update triton.py(rerun tests)

---------

Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-09-23 14:17:12 +08:00
qazal
d0e752003d fixes (#1893) 2023-09-22 07:20:27 +08:00
wozeparrot
009a99a0b1 feat: way cleaner hip wrapper (#1895) 2023-09-22 07:20:03 +08:00
kormann
864746d6aa polish print_tree (#1868)
* fix

* isinstance
2023-09-21 11:13:10 +08:00
chenyu
3ec301c2d7 apply view.py patch (#1844) 2023-09-10 17:32:15 -07:00
kormann
7ac65a93b4 utils.printtree (#1816)
* utils.printtree

* linter compliance

* rename to print_tree
2023-09-07 23:08:57 -07:00
George Hotz
4613c9e77c add tvm example, formatting (#1813)
* add tvm example

* no realize
2023-09-07 11:50:41 -07:00
Pavol Rusnak
52a92bf95d use class Foo: instead of class Foo(): (#1797)
* use class Foo: instead of class Foo():

* add ruff linter, copy settings from .flake8 to ruff.toml
2023-09-06 12:20:25 -07:00
geohotstan
9af5645ba3 onnx full passing (#1076)
* 1

* 83 failed

* learning how git works

* lol idk

* zero shape aaaa

* space lol

* aaa

* test check

* haha

* fixed gather

* 73 failing

* 71 failing

* 68 failing

* added some debug

* fking resize

* lol

* 62 failing

* 58 failling fucking did nearest resize hell yeah

* clean up

* 56 failing

* janitor duty

* lol

* 53 failing

* hi mom

* 50 failing

* added linear interp, but coord_trans is wrong

* did lin interpolation woohoo

* 43 failing

* 40 failing

* temporary Gather fix

* 39 failing

* fixed slice onnxver<10

* 37 failing

* 35 failing

* excluded tests that use float64

* 32 failing with hacks

* added _batchnorm() for 3D 5D batchnorm, 29 failing

* changed ALLOWED_KERNEL_COUNT from 199 to 207

* added improved Gather op, reverted ALLOWED_KERNEL_COUNT commit

* support Round op

* added storage_order/indices maxpool, 27 failing

* support maxunpool, 25 failures

* support Gradient, 23 failures

* merged new where

* added Adam

* cleanups

* added Momentum and Nesterov Momentum

* added Adagrad

* support sequence_type, 20 failing

* ugh git

* I give up on cubic interp :D, 9 failing

* sexy 1 liner gather, much improved, wow

* polished gather to make it shine bright like a diamond

* clean 1 liner for gather

* improved readability of gather

* uhh

* clean up

* more clean up

* WHITEspace

* implemented SoftmaxCrossEntropyLoss op

* added comments and cleaned up if statements

* update

* thank based wozeparrot for pow and new GatherElements

* CPU and TORCH all pass | cast float64 -> float32 for all fromCPU()

* _nearest_gather() failing on yolo

* reverted ops_cpu change and added assert in Resize

* added comments for resize for multiple channels

* oops

* merge

* test

* switched np.pad to Tensor.pad for constant padding

* gah

* gah2

* sexy reflect pad with movementops -> add

* delete commented out lines

* edge mode pad sexy as well

* trying out model_benchmark

* revert gitignore change lol

* init

* Revert "init"

This reverts commit 682bf2073a.

* wrote cast workaround for CPU, CPU and TORCH all pass

* wrote cast workaround for CPU, CPU and TORCH all pass

* skipped tests w/ 0 shape for METAL and GPU

* excluded tests for CLANG, CPU, TORCH, CLANG pass

* fixed hacky ConvTranspose

* gotta figure out autopad

* UOps.STORE support cast bool -> float

* small fix for fast gather

* reverted 0 shape skipped tests

* oops missed a file

* added comment

* fixed slice op hack

* First commit to pr

* More trig ops

* More trig ops

* format

* isinf support

* More ops

* changed onnx_ops to use our new gather :D

* Det op bug fix

* rebase

* fixed some tests

* det broken and slow

* fixed compress to use new gather

* implemented argmax argmin

* support variable types in type_proto

* support Upsample and Identity sequence

* we support float64 now and tinygrad support automatic broadcasting

* added EyeLike op

* resize does support multiple channels now actually

* yolov8 onnx runs successfully

* added batch size 1

* oops

* finally fixed type_proto I think

* fixed some llvm bugs

* del whitespaces

* added ZenginU Format PR

* test

* oops

* added float64 exclude tests back

* more skipped tests

* try

* ok openpilot pass

* flake8 pass

* woooooohooo

* revert external_model_benchmark changes

* perf tested gather

* removed promote types from ops_cpu

* numerical errors from 1681 is fixed

---------

Co-authored-by: ZenginU <umutzengin00@gmail.com>
2023-09-05 13:23:32 -07:00
George Hotz
56abe04e4b disable assembly (#1755) 2023-09-04 09:41:20 -07:00
wozeparrot
bf05534c6e hip multidevice (#1728)
* feat: hip multidevice support + p2p

* feat: default device
2023-09-01 06:46:13 -07:00
Karan Handa
a8aa13dc91 [ready] Replacing os with pathlib (#1708)
* replace os.path with pathlib

* safe convert dirnames to pathlib

* replace all os.path.join

* fix cuda error

* change main chunk

* Reviewer fixes

* fix vgg

* Fixed everything

* Final fixes

* ensure consistency

* Change all parent.parent... to parents
2023-08-30 10:41:08 -07:00
nimlgen
1c0449e190 add cache collector (#1595)
* init cache collector

* add test_cache_collector.py

* switch GlobalCounters.cache to CacheCollector

* init jit models test

* jitted SD

* add debug msg to print loaded bufs count

* moved cache collctor to jit

* clearer SD

* no double device import
2023-08-28 19:59:55 -07:00
George Hotz
a6d842af7a move device to ops (#1646)
* move device to ops

* mlops types

* 2 lines
2023-08-23 08:30:17 -07:00
George Hotz
718ced296c move state to nn/state (#1619) 2023-08-22 07:36:24 -07:00
Umut Zengin
f720682beb np.argmax to Tensor.argmax (#1608)
* to tensor argmax

* removed keepdim

* training update
2023-08-21 15:22:29 -07:00
Yixiang Gao
4d54afb6df sparse cat cross entropy (#1597)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs

* fix training loss

* add device
2023-08-21 14:14:54 -07:00
George Hotz
2e60920317 Revert "sparse cat cross entropy (#1591)" (#1596)
This reverts commit f0ee850e98.
2023-08-21 10:04:26 -07:00
Yixiang Gao
f0ee850e98 sparse cat cross entropy (#1591)
* add sparse cat cross entropy

* minor fix

* add log_softmax into loss function

* add test

* update docs
2023-08-21 09:56:41 -07:00
Yixiang Gao
8d6662a741 .cpu().numpy() -> .numpy() (#1594)
* .cpu().numpy() -> .numpy()

* restore ops_torch

* restore test_speed_v_torch
2023-08-21 09:53:29 -07:00
George Hotz
e464442adf WMMA for 7900XTX (#1563)
* go

* hip no LRU

* work

* works

* 16 TFLOPS

* 29 TFLOPS

* 30 TFLOPS

* never mind, it's 60 TFLOPS

* fix metal WMMA

* put hip alloc back
2023-08-19 09:07:23 -07:00
chenyu
ae39cf84ab Symbolic Shape JIT main PR (#1353)
* Symbolic Shape JIT

update tests

2 variables symbolic ops, adding more tests

test passing

cleanup

* more test cases

* single flag

* review update

* jit attention one piece

* realize

* symbolic_jit test for cuda

* old artifact

* works with cuda gpu but failed ci

* CUDACPU
2023-08-18 14:39:55 -07:00
wozeparrot
50decf0d45 train cifar using multigpu (#1529)
* feat: train cifar using multigpu

* feat: split eval batch across 5

* feat: cleaner allreduce

* feat: 93.88%

* feat: cleaner batch chunking from bert

* feat: cleaner grad sync

* feat: tinygrad argmax

* feat: make it work with different gpu counts

* feat: move some stuff into the normal __init__

* feat: autodetect gpu count

* feat: move import inside
2023-08-18 09:35:44 -07:00
wozeparrot
15150d60c4 fix: small fix for lru on hip (#1567) 2023-08-18 09:18:38 -07:00
Ethan Sorrell
cb62911f6b PTX Reintegration and Passing Tests (#1512)
* move assembly, assembly_ptx

* successful but broken rendering of ptx asm

* clear ins before render asm

* slightly less broken :')

* we needed thread syncs

* fix float16 loading, rounding modifiers and other casting stuff, passing casts_from_half

* Fix runtime_args for gpuocelot

* our casts were flipped on both ends

* more casting

* add ternary where op

* dealing with storing/loading bool

* add test for casting to bool from negative

* Fix args.valid on ConstOp

* add to CI, TODO: fix runtime_args for test_uops

* fix placement of runtime_args to work with lazy.Device

* undo ci changes so I can push

* fix lints

* start cleanup and fix things we broke fixing lints

* add checks for PTX specifc asm instructions

* revert added test -- doesn't pass on llvm

* skip tests for underflow,overflow

* another fix for how we're setting runtime args

* Less broken cleanup

* add to CI

* add more env variables for ci test

* fix ci to install pycuda for ptx

* ci: copy cuda test command

* cleanup

* assert to make sure we're actually running ptx in ci

* remove test assert

* move is_ptx arg

* move assembly, assembly_ptx back to extras

* fix imports

* initial merge fixes

* clear registers, fix UOps.LOAD with invalid value

* draft merge fixes

* remove prints

* quick lint and merge fixes

* cleanup

* remove PTXProgram wrapper

* final cleanup

* temp change for ci rerun

* ci rerun

* rollback ISA version
2023-08-16 16:20:20 -07:00