Commit Graph

525 Commits

Author SHA1 Message Date
George Hotz
60e3aa5cb1 more docs (#4271)
* more work on docs

* CompilerOptions is dataclass
2024-04-24 10:52:42 +08:00
nimlgen
f3b4dff7c9 KFDProgram -> AMDProgram (#4268) 2024-04-24 00:29:50 +03:00
George Hotz
9a95781d51 renamed (#4260) 2024-04-23 09:00:28 +04:00
George Hotz
2ae4f45272 WIP PM4 Support (#4110)
* pm4 kernel launch works

* disable USE_THREAD_DIMENSIONS

* add kernel code

* work on real pm4

* pm4 signal

* same

* gate pm4

* hcq tests pass

* ops passes

* pm4 is closer

* pm4 debug (#4165)

* start debug tests passing

* prg

* smth

* hdp flush

* cleaner 1

* do not need this

* logs not need

* small things

* linter

* remove AQL

* test hcq

* fix tests

* it's subtracting, it shouldn't be -1

* pm4 changes (#4251)

* not need this anymore

* sdma signal with non atomic

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-04-23 08:31:27 +04:00
nimlgen
e6227bdb15 nv driver (#4044)
* start

* fix err 93

* gpu

* ioctl mappings

* alloc like cuda

* semaphores

* wait for semaphores value

* start ops_nv

* very simple kernels work

* init several gpus

* qmd dumper

* dirty, but most of kernels work

* always all test_ops

* progress, more tests, stable

* test_ops passes, gpt2 works

but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated

* need better sync

* fix sync

* alloc2

* all tests pass!

* cleanup 1

* cleanup

* multigpu, simple transfer

* fix sync

* correct init

* nv_gpu autogen + sync bug fix

* clean extra/nv_gpu_driver

* p2p

* clean up

* remove old gen

* small fixes

* cleanup

* cleanup 2

* small fixes

* bigger queue size

* cleanups

* wait

* fixed signals for devs

* fix hang + parallel beam

* small fixes

* detect when local memory is big in kernel

* correct assert

* small fixes

* correct tls size est

* one va space

* less lines

* shorter

* save 2 lines

* save some lines

* remove type ignores

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-22 19:50:20 +04:00
Micah Zoltu
7bc862767c Improves error message when CUDA module fails to load. (#4243) 2024-04-21 11:10:14 -04:00
George Hotz
b9570d6100 clean up update stats (#4226)
* WIP: clean up update stats

* line savings now

* fix graphs

* fix tests

* tighter prints

* remove extra jit=false

* debug=2 means wait

* that won't update stats

* still wait
2024-04-19 15:41:30 +04:00
nimlgen
4ed6b42a8a fix kernargs check in kfd (#4194) 2024-04-17 00:44:50 +03:00
George Hotz
b6e7243bfa hotfix: skip slow pre-commit test 2024-04-16 11:48:43 +04:00
nimlgen
24a27a01a9 hotfix: CUDA_P2P works (#4155) 2024-04-12 18:20:12 +03:00
nimlgen
5a57b48134 cuda p2p enable when available (#4153) 2024-04-12 16:21:54 +03:00
George Hotz
bbda20c0db CompiledASTRunner -> CompiledRunner (#4148) 2024-04-11 08:49:52 -07:00
George Hotz
b7e281cf10 JitItem -> ExecItem (#4146)
* JitItem -> ExecItem

* execitem in realize

* cleaner

* JITRunner -> Runner
2024-04-11 08:24:57 -07:00
George Hotz
081dd1573f hotfix: keep CUDA D2D copy behind the CUDA_P2P flag 2024-04-10 21:36:48 +00:00
George Hotz
af5984df43 cudagraph memcpy through host (#4137) 2024-04-10 13:17:17 -07:00
George Hotz
ee457a4b20 no more underlying diskbuffer, that's just the device (#4129) 2024-04-10 08:32:25 -07:00
Felix Kuehling
38ae4194a6 Fixes for ops_kfd (#4105)
* kfd_ops: Fix GPU node discovery on NUMA systems

Ignore potentially multiple CPU NUMA nodes and any GPU nodes that are
not accessible because of device cgroups.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>

* kfd_ops: Format the GFX arch target name correctly

The target version in sysfs properties is a decimal representation with
two digits per component.

The format for LLVM GFX target names is a bit quirky for historical
reasons. It uses one digit for the minor version and stepping. When it
ran out of decimal digits for the stepping on gfx90X it started using
hexadecimal there. But the major version is still decimal and went
double digit in GFX10.

Make sure to parse and format it accordingly for all supported GPUs.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>

---------

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
2024-04-09 13:21:21 -07:00
George Hotz
ae849d12d7 numpy device + pickle it (#4120) 2024-04-09 13:19:30 -07:00
andresgit
7fd12aba85 graph remove input buffer references (#4100)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-08 16:49:16 -04:00
George Hotz
444d2a7487 hotfix: fix SDMA read_pointer_address in KFD 2024-04-07 13:13:15 +00:00
uuuvn
bb7567b365 Fix metal (#4101) 2024-04-07 05:21:19 -07:00
George Hotz
8739d33fe9 kfd: disable copy_from_fd while debugging (#4091)
* kfd: disable copy_from_fd while debugging

* increase timeout to a minute
2024-04-05 18:02:58 -07:00
George Hotz
164329a8ea address kfd feedback (#4087)
* address kfd feedback

* signals cleanup

* signals cleanup

* handle 2 doorbell pages correctly

* signal reset cleanup

* signals cleanup

* more GTT

* cleanups

* minor cleanups
2024-04-05 15:24:41 -07:00
George Hotz
a337922c44 more work on kfd (#4079)
* more work on kfd

* fix multitensor test on kfd

* stuff
2024-04-05 08:36:36 -07:00
George Hotz
28ec6c67be hotfix: hlb_cifar KFD works 2024-04-05 02:19:14 +00:00
chenyu
1de9778949 import Buffer and BufferOption from tinygrad.buffer (#4076) 2024-04-04 22:12:23 -04:00
George Hotz
3de855ea50 don't use SVM memory in KFD (#4072)
* don't use SVM memory in KFD

* copy from fd

* cleanups

* transfer

* hacks

* ops_hsa

* tighter API
2024-04-04 17:33:21 -07:00
George Hotz
3e72d745ea hotfix: make KFD timings right 2024-04-04 05:55:29 +00:00
George Hotz
58d162315c Continuing KFD work (#4065)
* cleanups

* fix kernargs ptr

* mypy passes
2024-04-03 22:48:13 -07:00
chenyu
d219aba962 prepend CLANG_PROGRAM_HEADER in ClangCompiler.render instead of compile (#4063)
src header should be part of the rendered output, and DEBUG=4 includes the header this way
2024-04-03 23:17:56 -04:00
George Hotz
7181ffd630 HWCopyQueue in KFD (#4042)
* HWCopyQueue in KFD

* hw compute queue

* test

* move test

* more tests

* fix wait

* fix multimap

* mes crash

* tests pass but slow

* stuff is working

* one more test
2024-04-03 20:14:24 -07:00
Léo
e879e16c48 docs: add warning message for conda users when using METAL (#3917)
* docs: add warning message for conda users when using METAL

* fix: conda metal warning too long. disabled line length check

* docs: changed conda METAL warning to include DISABLE_COMPILER_CACHE=1

* fix(metal): now detecting invalid library magic

* format: removed noqa E501

* fix(metal): conda error line len

* fix: typo

---------

Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>
2024-04-02 09:22:24 -07:00
George Hotz
506b1c5892 multigpu works (#4040) 2024-04-02 08:29:37 -07:00
George Hotz
7425a0c646 CommandQueue is the future (#3950)
* start of command queue

* cq work

* runs

* cleanup

* outs set

* read is gone

* future buffer work

* command queue is better

* command queue works

* loadops

* delete unneeded

* command queue works

* upd

* fix tests

* use CommandQueue in compile

* delay sync
2024-04-01 17:35:48 -07:00
chenyu
0a34d6016b move exec_alu from uops to ops (#4033)
will use this for const folding in lazy too
2024-04-01 17:20:53 -07:00
nimlgen
d6ba44bc1e kfd free buffers (#4027)
* kfd free buffers

* unmap

* all test passes

* better pm4

* forgot these

* invalidate only range

* better cache

* forgot

* comments

* fixes
2024-04-01 15:50:58 -07:00
nimlgen
7fa233e8c9 kfd fix kernels with private memory (#4018)
* kfd fix kernels with private memory

* linter
2024-04-01 00:01:30 +03:00
George Hotz
2abb474d43 kfd driver wip (#3912)
* kfd driver wip

* cleanups

* kfd almost ready to ring doorbell

* ding dong?

* issues with signals

* something

* works

* ops kfd

* add amd_signal_t

* works...sometimes

* program runs

* _gpu_alloc cleanup

* cleanups

* work

* header + enable profiling (#3959)

* header + enable profiling

* just cleaner

* measure

* only local time domain

* remove old comments

* fix with master

* elf parsing (#3965)

* elf parsing

* fix kernels with private

* not used

* clean up

* clean up 2

* add flags

* kfd sdma (#3970)

* working sdma

* remove driver, shorter

* all commands we might need

* svm

* kfd remove hardcoded values (#4007)

* remove hardcoded values

* match above line

* 7k lines + revert hsa

* update that from origin

* fix sdma reg gen

* not the updated SDMA

* compiler_opts

* don't require kfd_ioctl

* get ioctls from python

* get ioctls from python

* remove build_sdma_command

* merge into 64-bit fields

* shorter

* fix property spelling and off by one

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-30 15:08:12 -07:00
nimlgen
478c040e1c hsa terminate without exceptions (#4006)
* hsa terminate without exceptions

* cleaner

* linter
2024-03-30 16:03:46 +03:00
chenyu
d9ff636cf5 use is to compare with enum (#3993)
* use is to compare with enum

currently it's mixed between `==` and `is`, moved all to `is`

* more
2024-03-29 13:02:56 -04:00
uuuvn
8a40d7d423 Shape changing bitcast and assert bitcast in disk (#3973)
* Shape changing bitcast

* only support it on disk

* basic test

* more tests

* RuntimeError instead of assert

* create unique temp files

* move tests that use disk to test_disk_tensor

* linter

* remove assert on error messages

* that's RuntimeError now

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-28 21:49:10 -07:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
George Hotz
2cfcb5623a hotfix: d was removed from buffer 2024-03-28 13:39:02 -07:00
George Hotz
42b9d999ea Buffer isn't always allocated (#3974)
* buffer alloc

* allocate

* missing allocates

* last one
2024-03-28 13:33:47 -07:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
chenyu
6c7df1445b enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
George Hotz
150ea2eb76 create engine folder and move code (#3948)
* retry

* older tf

* that
2024-03-26 20:38:03 -07:00
nimlgen
e2d6f76723 _alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
nimlgen
739f47eb0f check on cuEventSynchronize (#3933) 2024-03-26 16:14:38 +03:00
chenyu
4ecd5789ab #include <tgmath.h> in ops_clang (#3927)
* different clang sqrt/log2/exp2/sin function based on dtype

fixed softmax_argmax issue in #3552 for clang.

* tgmath.h

* revert those
2024-03-25 17:48:57 -04:00