* uops const fold rules to prevent tautological compare warnings
`bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false
* not true for nan
* and nan does not work with llvm
* full truth table test
* revert a==a
* comments and indents
* docs: add warning message for conda users when using METAL
* fix: conda metal warning too long. disabled line length check
* docs: changed conda METAL warning to include DISABLE_COMPILER_CACHE=1
* fix(metal): now detecting invalid library magic
* format: removed noqa E501
* fix(metal): conda error line len
* fix: typo
---------
Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>
* Embedding is in one kernel
* embedding is one kernel
* rm extra line
* newline
* bert test counts state vars?
* add a test?
* move items around
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
* don't call contiguous for unpadded const into multi tensor
fixed multi const folding for sharded const.
still wip, need to be careful that this does not break multi device cache somewhere
* ehh need a memory test for that
* simple sharded memory test
* fix _to_const_val and const folding around it
is_unrealized_contiguous_const is too strict and almost never hit if const is expanded.
suffice to check if there's no pad
* that test is folded
* test_const_folding
* kfd driver wip
* cleanups
* kfd almost ready to ring doorbell
* ding dong?
* issues with signals
* something
* works
* ops kfd
* add amd_signal_t
* works...sometimes
* program runs
* _gpu_alloc cleanup
* cleanups
* work
* header + enable profiling (#3959)
* header + enable profiling
* just cleaner
* measure
* only local time domain
* remove old comments
* fix with master
* elf parsing (#3965)
* elf parsing
* fix kernels with private
* not used
* clean up
* clean up 2
* add flags
* kfd sdma (#3970)
* working sdma
* remove driver, shorter
* all commands we might need
* svm
* kfd remove hardcoded values (#4007)
* remove hardcoded values
* match above line
* 7k lines + revert hsa
* update that from origin
* fix sdma reg gen
* not the updated SDMA
* compiler_opts
* don't require kfd_ioctl
* get ioctls from python
* get ioctls from python
* remove build_sdma_command
* merge into 64-bit fields
* shorter
* fix property spelling and off by one
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
make it read nicer and cleanup some movement methods and math simplification.
790m, 1.4b, 2.8b model does not really run.
sampling is not implemented.
jit is incorrect.
some deadcode / wrong code path and copied from torch stuff stuff.
* always use f32 for source of randn
fixed bfloat16 randn to not have inf.
don't really care about float64. threefry is float32 based too
* HSA is broken
* test case Tensor.randn should be finite
there's a hack to fix float16, need a generic solution that works with bf16 and threefry
* skip not supported
* bfloat16 local is wrong
* skip RHIP
* Shape changing bitcast
* only support it on disk
* basic test
* more tests
* RuntimeError instead of assert
* create unique temp files
* move tests that use disk to test_disk_tensor
* linter
* remove assert on error messages
* that's RuntimeError now
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>