* kfd_ops: Fix GPU node discovery on NUMA systems
Ignore potentially multiple CPU NUMA nodes and any GPU nodes that are
not accessible because of device cgroups.
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
* kfd_ops: Format the GFX arch target name correctly
The target version in sysfs properties is a decimal representation with
two digits per component.
The format for LLVM GFX target names is a bit quirky for historical
reasons. It uses one digit for the minor version and stepping. When it
ran out of decimal digits for the stepping on gfx90X it started using
hexadecimal there. But the major version is still decimal and went
double digit in GFX10.
Make sure to parse and format it accordingly for all supported GPUs.
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
---------
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
* init
* add failed case
* fix: temp comment out MULACC cast
* is this right?
* add test case
* oops, forgot to get rid of temp test
* WOOOOOO TOOK OUT 2 TRANSPOSES IN GATHER YAY
* cleaner
* comment cleanup
* update docs
* resolve conflict
* oops
* SUPA FAST
* comment out a test
* del some print statements
* use new broadcast stuff
* more clean up
* move try except
* skip fancy indexing for python backend test_ops
* search: add a BEAM_COMPARE env to optionally not compare to hc/tc
setting BEAM_COMPARE=0 will prevent additional memory allocation
needed to do the timing tests assuming the BEAM result is in
the diskcache.
* change to optionally use Buffer.allocate
* initial version
* heh gimme grrrreen
* version 2
* clean ups
* some test confusion
* fix onnx
* rename to _broadcast_tensors
* improved errors and test
* fixed?
* some test fixup
* version 3 lol
* comments
* cleaner
* add failure test for expand to 0 test
* 1 more assertRaises test
* make err msg better
* also rewrite the expand onnx op? :s
the annoying thing to remove all FlopCounter is that for device that does not support local, matmul index alu is huge.
we can remove the dtype first.
sneak in updating `ruff` command to `ruff check`
* tensor cores
* Merge from master
* faster program start in llvm (#3897)
* Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.
* Delete stray line used for breaking tests
* Fix linter error by renaming twice-used variable
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* touchup einsum (#3900)
don't need rhs_letters
* hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
* replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
* add --minimal flag to nvrtc (#3899)
* wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
* include negative float in test_dtype (#3884)
* include negative float in test_dtype
* that is ub
* too annoying
* pack can overflow
* add to benchmark
* change var name to satisfy mypy
* spacing
* Update to new TensorCore format
* Spacing
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* mulacc
* Move more stuff to pattern matcher
* disable callable from the == check
* disable function passing in pattern matcher
* Add set of dtypes pattern matching + refactor mulacc pattern