Commit Graph

476 Commits

Author SHA1 Message Date
chenyu
4ecd5789ab #include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00
chenyu
dc508022a9 clean up clang src header (#3925)
don't need to define int64 and uchar
2024-03-25 15:18:35 -04:00
nimlgen
f2a9ea4ea9 lru allocator for copyin host buffers (#3918)
* lru allocator for copyin host buffers

* linter happy
2024-03-25 15:57:18 +03:00
George Hotz
e0e234bf94 hotfix, str compare version for cuda 2024-03-24 20:35:24 -07:00
Arseny Kapoulkine
715850aef9 Fix sm89 PTX=1 compilation (#3915)
* Fix sm89 PTX=1 compilation

The minimum PTX version that supports sm89 is 7.8 (same version also
supports sm90); without this ptxas fails when running tinygrad with
PTX=1 on RTX 4090.

* Use int(arch[3:]) for forward compat with SM10.0 if that happens
2024-03-24 20:32:29 -07:00
chenyu
2e39f57594 move lines around in ops_python wmma (#3911) 2024-03-24 17:14:26 -04:00
chenyu
8c8b57fd5f cleanup ops python (#3908)
i just want to merge lars!
2024-03-24 11:36:31 -04:00
sekstini
7c3632fd1e add --minimal flag to nvrtc (#3899) 2024-03-23 16:38:31 -07:00
nimlgen
4e18dd78d3 faster program start in llvm (#3897) 2024-03-23 15:20:15 +03:00
George Hotz
46a3501cec nv ioctl sniffer (#3892)
* nv ioctl sniffer

* unused import

* Update __init__.py

* that work

* that fix it
2024-03-23 00:29:30 -07:00
nimlgen
16e31f7f0d init multidevice cuda graph (#3858)
* init multidevice cuda graph

* cuda just works!

* clean

* linter happier

* liners happy

* update transfer inputs

* do not change free

* useless check for cuda

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 13:49:48 -07:00
chenyu
1c51d586ea replace raise Exception with specific errors (#3874) 2024-03-22 12:32:21 -04:00
nimlgen
8ef5490ec8 cuda tranfer + async copyin (#3873) 2024-03-22 09:01:37 -07:00
Szymon Ożóg
624bc89910 PTX - implement float 4, ptr arithmetics and other speed improvements (#3775)
* ptx float4 implementation

* remove from cache when trimming uops

* Gate for float4

* Linting fix

* disable test reasonable time for ptx

* import getenv

* Update uops.py

* linter

* Add div test for half

* upcast if op does not support operation

* fix offset

* Run only if dtype supported

* zero out registers when accessing by pred + cleanup

* Remove trailing whitespace

* revert

* spacing fix

* move cache clearing outside loop

* did this suddenly start working?

* unused import removed

* Remove cast

* Use pattern matching

* linting

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-22 08:54:02 -07:00
George Hotz
f4055439dc don't include hip common (#3851)
* don't install hip common

* only that

* Revert "only that"

This reverts commit 85f22015d9.

* less

* needed

* sep comgr

* header file

* 6.0.2

* update hsa

* hsakmt

* Revert "hsakmt"

This reverts commit d3a118078e.
2024-03-22 08:50:50 -07:00
nimlgen
b78352b423 do not create structs every call in CUDAProgram (#3855)
* do not create structs in cuda

* fix graph

* linter

* do not exec twice

* fix graph
2024-03-21 17:51:40 +03:00
nimlgen
e5745c1a0d fix nan on multigpus cuda (#3854) 2024-03-21 15:21:55 +03:00
nimlgen
85691c8e20 fix hsa sync issue (#3847)
* fix hsa sync issue

* linter
2024-03-21 04:00:30 +03:00
nimlgen
2d54e4d747 clean up hsa driver (#3818)
* clean up driver

* remove returns
2024-03-20 00:17:41 +03:00
chenyu
5dd048a378 remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
nimlgen
629757eaa1 hotfix: update inputs of correct transfers in hsagraph (#3800)
* hotfix: update inputs of correct transfers in hsagraph

* test it

* run in ci?
2024-03-18 15:52:27 -04:00
qazal
d79a1d315b add outbufs back (#3803)
* update outcounts

* update JIT

* refactor search

* hsa uses outcount
2024-03-18 10:30:53 -07:00
nimlgen
e78df485c7 update inputs for transfers in hsagraph (#3560) 2024-03-18 18:01:04 +03:00
George Hotz
53adcb34f5 remove hip backend (#3783)
* remove hip backend

* remove unused

* rhip

* more RHIP
2024-03-17 10:12:16 -07:00
George Hotz
2a14d1b5e0 Revert "add outbufs info to CompiledASTRunner (#3781)" (#3782)
This reverts commit 722dd4276c.
2024-03-17 09:47:23 -07:00
qazal
722dd4276c add outbufs info to CompiledASTRunner (#3781)
* add outbufs

* Revert "add outbufs"

This reverts commit 5f4c0668f5.

* simplify
2024-03-17 07:52:20 -07:00
nimlgen
91e181ee02 make alignment readable (#3766) 2024-03-15 23:18:40 +03:00
nimlgen
ba79a3c09a some hsa lines saving + fixes (#3752)
* fix write to ring + some lines

* hsa driver test
2024-03-15 18:12:18 +03:00
George Hotz
ca19eb3e82 where fold try 2 (#3748)
* where fold try 2

* assign fold

* test_where_fold works

* add gated store support to ops_python

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-15 07:46:26 -07:00
chenyu
75d4344cda UOps.BITCAST (#3747)
* UOps.BITCAST

implicitly fixed no const folding for bitcast

* python backend

* ptx

* consistent llvm
2024-03-14 21:00:35 -04:00
nimlgen
4b01c44579 hotfix: sdma/aql are visible again (#3733) 2024-03-14 10:33:22 +03:00
nimlgen
0f050b1028 hsa profiler (#3711)
* hsa profiler

* simpler

* profile

* copy -> is_copy

* print when saved

* faster

* do not create structs

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-13 21:19:22 -07:00
qazal
337cd53444 multioutput ScheduleItem (#3699)
* refactor realize.py

* update docs

* update test_sched

* update runners and devices

* update openpilot and unit tests

* cleanup runner lowering

* update more tests
2024-03-13 08:59:38 -07:00
Francis Lam
b6e2495fdd kernel: limit shared memory usage when adding opts (#3705)
* kernel: limit shared memory usage when adding opts

* search: remove unnecessary limit on search space

apply_opt will do the more correct check
2024-03-12 17:06:21 -04:00
nimlgen
798970cfad fix gpu hangs when exiting while aql queues are executing (#3700) 2024-03-12 19:23:23 +03:00
nimlgen
dd1a1c12df rocm path in autogen (#3697) 2024-03-12 14:06:43 +03:00
Patrick Tsai
971d7f5d7c O(n) arange attempt (#3530)
* It works?

* Clamp correctly

* Refactor

* Make code better

* Undo some stuff

* First step to trying to make floats work

* Floats work in Python op but not metal because int div is different

Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error

* arange does cumsum with ints and then multiplies by step

This is so loop optimization can remain int only

* Undo a lot of symbolic changes

* Final check

* Cleanup

* There can be multiple phis

* Fix multiple phi op removal

* const sets dtype correctly

* Fix bugs

* Fix a couple bugs and add loop vars to resolve

* missed one

* Don't trim too many ops

* Fix symbolic test

* Use ones instead of full

* Delete test

* Lint passes

* max node error

* Small updates to loop logic

* Remove unnecessary changes

* We are getting somewhere

* Simple case

* Fix

* rm, prn

* Better

* If NumNode doesn't work then continue

* clamp is needed for arange(256)

* Move everything into the optim fn

* Replace correctly

* Order optimizations better

* Delete

* mypy

* Test for simplification

* Rename

* Fix test

* update test description

* Undo more

* Cleanup

* No replaced_ops map

* Fix lint

* AssertionError

* back again

* Reinstate assertion

* Return true and make diff not as big

* Bigger range for test

* Change cumsum impl

* fix bug

* make big cumsum work

* lint

* Undo cumsum 2-stage removal

* No while helper

* optional min/max clamping

* floats work

* rm giant arange test

* fix python cast None

* Check phi parents

* one phi allowed per where

* Fix one phi per where

* Rework iteration

* Delete assertions

* convert to int

* Try mul -1 instead of neg for hip..?

* Remove one phi per where requirements

* one accum only

* Lint

* should simplify a loop at a time

* Don't get rid of loop explcitly

* Need to iterate backwards

* lint

* unary neg

* Make optim work for onnx and sum_pad_collapse

* Better message

* filter alu ops correctly

* Fix the limiter

* lint and simplify

* Add it back

* off by one error

* test wheres and phis

* test max ops and non-if stuff

* <=

* cast_scalar

* Oops

* Change test

* Pass loop uops instead of a modified map

* Cut param transfer between linearizer and uops

* Fix issues

* Fix lint

* fix efficientnet python 3.8 invalid syntax

* distinct vars in seen_vars

* accurate var names

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-11 16:09:20 -07:00
Francis Lam
9f13960f72 search: catch RuntimeError when timing acted_lins (#3664)
when compilation succeeds, but runtime fails due to thread limits
on METAL, this allows a beam search to proceed, treating this
the same way as a compile failure.
2024-03-11 16:14:03 -04:00
nimlgen
76ade20b89 hsa driver tiny cleanups (#3684) 2024-03-11 22:32:43 +03:00
George Hotz
ac02e7347d ptx timing vs cuda timing (#3659) 2024-03-08 10:17:49 -08:00
uuuvn
daa4034e80 No more metal flakiness (#3643) 2024-03-08 08:54:44 -08:00
George Hotz
6e50582e62 working to improve ptx (#3647)
* working to improve ptx

* fix compile fail
2024-03-07 12:39:31 -08:00
George Hotz
81baf3eed3 bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
nimlgen
3db826e195 hsa in lin opts (#3602) 2024-03-04 06:17:32 -08:00
nimlgen
640dc0fc51 hsa flush hdp (#3591)
* hsa flush hdp

* use _alloc()
2024-03-03 04:55:07 -08:00
George Hotz
aa9b013d79 add constant folding for WHERE in uops (#3584)
* add constant folding for WHERE in uops

* prereqs for generic constant folding

* fix test

* disable slow overflow logic

* make that test faster
2024-03-02 10:37:14 -08:00
nimlgen
3b7e3fa2e4 fix sync in hsa graph (#3582) 2024-03-02 07:37:51 -08:00
Francis Lam
e17f1821a7 wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544) 2024-03-01 17:51:02 -08:00
chenyu
b7e555f6c0 run test_linearizer_failures on PYTHON backend (#3565)
* run test_linearizer_failures on PYTHON backend

only test 1, some have hanging issues and gated store is not implemented

* --durations=20

* two less slow ones
2024-03-01 17:00:18 -05:00
George Hotz
2c19ab6561 define var (#3548)
* define var

* remove vars from there

* fix python symbolic ops

* fix llvm

* pypath
2024-02-29 16:43:27 -08:00