* noop
* fix noop
* store cat is NOOP
* store dtype is void
* stores aren't passed through anymore
* meh, skip those for ptx
* correct ptx skip
* hl runs
* local metal on metal in uop syntax
* TODO: just put the axis_info in the kernelinfo
* local
* amd_matmul works @ 28 TFLOPS
* clean up matmul
* kernel8 works
* remove that
* locals
* axistype innovation
* work
* cleanup
* kernel3 regs
* cleanup kernel3
* work
* why is it broken
* no beam
* reenable
* permutes
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
* np generates randoms
* hotfix: use generator for int dtype
* float32 as default dtype for float generator
* use np.float32 instead of stirng
* add dtype= to integers generator
* change import _to_np_dtype source
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
* calling qualcomm dsp from python
* include so files
* add include file
* adsprpc.py
* running with adsprpc
* work
* 32-bit support in elf
* compilation works
* ion
* msm_ion
* working DSP backend
* getting 500 MFLOPS on matmul
* beam works with timing
* move to autogen
* disasm
* progress
* simple tests pass
* qcom_dsp
* more dsp autogen
* progress
* some progress
* works w/o lib
* checkpoint
* no lib
* ugh, better
* cleaner, but with lib. test good, but with the hack
* remove autogens
* small
* push
* simpler
* revert this
* run_3
* simpler
* android
* handle
* run it
* why?
* run2
* to gen
* cc
* cleaner
* elf
* part of autogen
* comemnt
* no lib
* autohen
* linter
* bug reproducer
* cleaner
* this repro is almost empty and doesn't work!!!!
* with this test_ops passes, no crashes anymore
* cleaner
* linter
* renames
* shorter
* remoev contextlib
* ugh
* myoy
* cleaner
* cleaner
* remove import
* conn
* import
* revert this
* remove heavy .so
* shorter alloc
* not tue anymore
---------
Co-authored-by: Comma Device <device@comma.ai>
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <george@comma.ai>
* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>