* Revert "late gate creation for STORE [run_process_replay] (#6373)"
This reverts commit c26744de9f.
* Revert "gated store rewrite to UOps.IF (#5976)"
This reverts commit 48061e8400.
* almost working with relu, even hackable... but acc size is wrong, fix needed
* upcast based on threads, change thread size to 4x4
* revert wrongfully commented assert
* fix tc load indexing
* modify for size 8
* fix bug for size 8
* Revert "fix bug for size 8"
This reverts commit cdb3f5df85.
* Revert "modify for size 8"
This reverts commit 3ef0904bd9.
* good kernel with changes in lowerer
* revert "good kernel with changes in lowerer"
This reverts commit 975e2b5a4e.
* good kernel for relu!
* refactor lowerer changes
* add amx context var to helper
* clean up amx flag
* improve lowerer changes readability
* improve check for amx
* revert lowerer if
* add float4 type rendering for clang
* add amx definitions
* enable indexing for clang if amx
* working amx example, wrong because of dims
* almost works for float 16, need to spot using double load in amx
* cleaner render_kernel
* revert chages in simple_matmul and delete env
* add new var upcast_offset to get_optimized_ast
* change axis for axes
* invert if in rendering phi
* fix some bugs
* fix linearizer tests
* fix vec/get pat for amx
* remove clang tc if amx is disabled
* add ops_python support
* refactor into one complementary function in ops_python
* add job for EMUALTE_AMX
* improve checking for AMX in UPCAST and TC extra ops
* fix lint issue
* commit before refactor into autocontained AMX
* start refactor by removing special rendering for AMX
* all ready for amx handcoded kernel
* working poc, most straightforward amx support
* avoid local opts for tc if amx
* fix merge bugs
* skip test for clang
* skip tc hand-coded opts if amx
* remove hardcoded ops_python values
* remove hardcoded sizes for amx kernel
* fix ops_python bug where dim was hard-coded
* change contract for vectorize
* working without changes in lowerer
* revert changes in gep rendering
* fix ops_python
* modify comment
* skip test if clang for different type accumulation
* move rename and bug for seperate pr
* fix wrong path for test
* addmm not implemented in torch for cpu
* change struct for vector; equally slow but cleaner
* revert modified test
* simply wmma rendering
* minor change
* noqa:501
* add length 16 for AMX
* fix vectorized half issue
* fix error
* remove comment
* change set for dedup
* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX
* add amx reference
* load acc into amx registers
* fix dtype rendering and remove noqa
* moved tests change into another pr
* add real AMX job for CI and fix bug
* fix ops_python bug
* fix test class
* remove real AMX tests and fix uops_stats test
* remove wrong test
* acc folding
* hotfix: bug
* fix float4 tests for amx
* hack for fixing flops counting
* hotfix: mypy
* add flop counts test for amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_amx
* improve test_float4_multidim_unaligned_load_amx
* nits tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* document UOps.IF [run_process_replay]
* this will be a block of STOREs after merge_gates
* now i can enable the assert
* more docs
* raw code block
* cname
* cleanup
Revert "cname"
This reverts commit d823f87561.
* Core change to gate stores in IFs
* Updates to cstyle renderer to handle IFs around STOREs
* Make uops asserts happy
* Add tests and fix newly broken tests
* make ruff happy
* make mypy happy
* Simplify renderer to have all gated stores use IF
* Revert some changes
* Make test_where_fold happy
* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out
* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE
* Re-change broken test
* Make ifs be grouped together
* get non-merged IFs working. ALl tests pass except grouping related ifs together
* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp
* Changes to uopgraph
* Simplify graph rewrite logic
* Changes to get test_padto_where_multireduce working
* Simplify uops.store renderer
* Make test_padto_where_multireduce pass but now other tests fail
* Clean up uopgraph from scrach work
* Ignore sudo IF srcs when rendering
* Attempt to fix llvm tests
* rm comment
* reduce lines
* Add line to make mypy happy :(
* llvmir fix pt 1
* Mods after rebasing to master
* Fix llvmir
* Fix ptx tests
* Fix other ptx tests
* Move changes from uops.py to ops.py
* rm uops.py
* Fix TestGateStoreRewrite tests
* Get multireduce tests working
* reset to remote branch
* Fix linearizer tests
* uop_graph test patch
* Add comment to create_gate
* hotfix: uncomment those tests
* Attempt to fix ptx tests by including whitespace inside if block
* Patch from remote tinybox. Tests passing here
* Min changes to get some ptx tests passsing
* Changes after rebase
* Exclude ifs and endifs from ptx
* IF conditional branching within ptx
* Save lines on delete_redundant_gates
* Simplify merge_gates
* rm noqa
* Remove unnecessary checks when merging gates
* Fix ops error msg
* Smarter check for if/endif in llvmir
* simplify delete redundant gates to only have 2 returns
* spacing
* Smarter check at beginning of merge_gates
* patches from comments
* Remove need for merge_gates
* include proper srcs in IF from the get-go
* test expand ifs dumb will result in 4 ifs, not 1 now
* Make tests happy
* Fix uops stats
* rm merge_gates method. Will add back in separate PR
* Spacing
* cleaner error msg
* Fix uops rendering when expanding. test_failure_43
* patch tests
* undo changes in delete_redundant_gates
* process replay attempt
* re-intro deletion of redundant gates
* fix addition of gates when they get nested in stores and loads
* patch tests
* smarter init of IF srcs when adding gate to STORE
* make ruff happy
* Resp to comment
* include all src[2]'s srcs in IF for gated store
* add reference of the storing value to the gate's src
* minor patch after rebasing
* change ptx renderer
---------
Co-authored-by: qazal <qazal.software@gmail.com>
* math trait [run_process_replay]
* const -> const_like
* Revert "const -> const_like"
This reverts commit 85727c83d3.
* add MathTrait to LazyBuffer
* clean up function
* fixup the rest of function
* fix custom function
* mlb math trait
* fix that test
* qcom: driver init
* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros
* autogen: add adreno commands and registers
* ops_qcom: QcomAllocator + signals
* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom
* qcom: we do not really need all these constants input/output is enough
* qcom: perfctr for CS (do not really need all the rest)
* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max
* qcom: explicitly set instruction len based on the shader size
* ops_qcom: Program init
extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader
* use data64_le from helpers
* ops_qcom: use fill_kernargs for filling i/o buffers
* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset
* new signals & fix exec
* add QCOM to the list of supported devices
* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM
* fix exec, synchronize before copyout
* correct setting num_units for ST_SHADER
* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway
* extract offsets to kernel arguments from opencl binary
* extract constants values and offsets from opencl binary
* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly
* align kernel name to 4 bytes when skipping kernel opencl struct
* skip to consts directly using an offset from opencl binary header
* fix alloc
* get halfreg and fullreg from opencl bin
* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE
* parse prg offset from open cl binary
* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG
* support for vals in _fill_kernargs
* support 16-bit constants
* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts
this helps to not fall down when executing big kernels
/* Don't time out if the context has disabled it */
if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
return;
* minor changes of _exec
* QCOMRenderer
* disable HCQGraph for demo. TOOD: support HCQ update api
* support HCQ
- remove copy queue
- add updates
- add strides for buffs and vars for QCOM
* bufs_stride
* clean ups
* linter
* call super().__init__(value) in QcomSignal
* disable=unused-import
* mypy
* type ignore when queue is on the device
* fix
* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7
* working timestamps
* free context after device is done
* move gpu stack to the device
* reserve some space with lib_gpu for gpu to write to
this fixes test_interpolate_bilinear
* exclude tests that fails with GPU=1 on qualcomm
* lint
* unmap mem in _gpu_free
* ctxt priority and preemtion policy
* remove old qcom
* pass size to self.device.allocator.free
* skip tests only on qcom
* use kgsl and adreno defines instead of numeric vals
* use allocator for allocating lib_gpu
* update to QcomArgsState from master
* intermediate commit while conquering images
* enable image tests on qcom
* fix shader disasm size, dump textures stuff
* working images
* allow signals to be 0
* set branchstack from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* set shared memory size from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* update images in QcomArgsState & less loc for images
* set stack sizes from OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* stack allocation based on OpenCL binary
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* better autogen for kgsl and adreno. no more bitshifts
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* cleanup commit for parse cl lib
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* dont forget actual generated files
* refactor + less loc
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* device.py back
* lint
* ruff
* timestamp divisor
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* fix tex fmt & round global size
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* dtypes
* 19.2MHz
* -1 loc in _update_exec
* remove noqa
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>