* tensor cores
* Merge from master
* faster program start in llvm (#3897)
* Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.
* Delete stray line used for breaking tests
* Fix linter error by renaming twice-used variable
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* touchup einsum (#3900)
don't need rhs_letters
* hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
* replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
* add --minimal flag to nvrtc (#3899)
* wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
* include negative float in test_dtype (#3884)
* include negative float in test_dtype
* that is ub
* too annoying
* pack can overflow
* add to benchmark
* change var name to satisfy mypy
* spacing
* Update to new TensorCore format
* Spacing
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* mulacc
* Move more stuff to pattern matcher
* disable callable from the == check
* disable function passing in pattern matcher
* Add set of dtypes pattern matching + refactor mulacc pattern
* HWCopyQueue in KFD
* hw compute queue
* test
* move test
* more tests
* fix wait
* fix multimap
* mes crash
* tests pass but slow
* stuff is working
* one more test
* uops const fold rules to prevent tautological compare warnings
`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false
* not true for nan
* and nan does not work with llvm
* full truth table test
* revert a==a
* comments and indents
* docs: add warning message for conda users when using METAL
* fix: conda metal warning too long. disabled line length check
* docs: changed conda METAL warning to include DISABLE_COMPILER_CACHE=1
* fix(metal): now detecting invalid library magic
* format: removed noqa E501
* fix(metal): conda error line len
* fix: typo
---------
Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>
* Embedding is in one kernel
* embedding is one kernel
* rm extra line
* newline
* bert test counts state vars?
* add a test?
* move items around
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* don't call contiguous for unpadded const into multi tensor
fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere
* ehh need a memory test for that
* simple sharded memory test
* fix _to_const_val and const folding around it
is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad
* that test is folded
* test_const_folding
* kfd driver wip
* cleanups
* kfd almost ready to ring doorbell
* ding dong?
* issues with signals
* something
* works
* ops kfd
* add amd_signal_t
* works...sometimes
* program runs
* _gpu_alloc cleanup
* cleanups
* work
* header + enable profiling (#3959)
* header + enable profiling
* just cleaner
* measure
* only local time domain
* remove old comments
* fix with master
* elf parsing (#3965)
* elf parsing
* fix kernels with private
* not used
* clean up
* clean up 2
* add flags
* kfd sdma (#3970)
* working sdma
* remove driver, shorter
* all commands we might need
* svm
* kfd remove hardcoded values (#4007)
* remove hardcoded values
* match above line
* 7k lines + revert hsa
* update that from origin
* fix sdma reg gen
* not the updated SDMA
* compiler_opts
* don't require kfd_ioctl
* get ioctls from python
* get ioctls from python
* remove build_sdma_command
* merge into 64-bit fields
* shorter
* fix property spelling and off by one
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.