Commit Graph

4618 Commits

Author SHA1 Message Date
chenyu
1941e66cc9 real strides with uops (#6365)
* real strides with uops [run_process_replay]

* compare with old

* Revert "compare with old"

This reverts commit f53a8d4276.

* make those @unittest.expectedFailure
2024-09-09 03:06:27 -04:00
chenyu
ac98f5056e move lt-folding to a function [run_process_replay] (#6422)
and added more tests (some failed to match symbolic)
2024-09-09 02:04:52 -04:00
qazal
ff8a9ac3c1 test new style gated store rendering (#6413)
* test new style gated store rendering

* switch to lidx

* make lidx optional

* fixup [run_process_replay]
2024-09-09 13:59:22 +08:00
George Hotz
90fb17304f put rewrite back in ops [run_process_replay] (#6421) 2024-09-09 13:53:51 +08:00
qazal
442150a8df more ast_const for hardcoding consts [run_process_replay] (#6418) 2024-09-09 11:35:08 +08:00
chenyu
25af78c593 failed uop_symbolic divmod test by variable (#6414) 2024-09-08 23:08:58 -04:00
chenyu
ad05302232 tests of real_stride of symbolic shape (#6409)
these would have failed in #6365
2024-09-08 21:37:19 -04:00
qazal
935b4ddff6 use ast_const in test_linearizer asts [run_process_replay] (#6407) 2024-09-09 08:46:58 +08:00
qazal
9a67ec6174 refactor to list of kernels [run_process_replay] (#6403) 2024-09-08 17:19:45 +08:00
chenyu
7df4373fd9 tensor reduction touchup (#6402)
- fixing spacing
- use get_args to get valid Literal values and raise ValueError to match, and a test for that
- use `Y` to be consistent
2024-09-08 03:55:51 -04:00
Irakli Salia
2e01efc35f tensor roll (#6375)
* tensor roll function and tests

* fix type annotations

* reduce line count

* more readable
2024-09-07 05:14:28 +08:00
Tim Becker
dfb818788e Support reduction parameter in more loss functions (#6302) 2024-09-07 05:11:20 +08:00
chenyu
26c5d8346a remove Variable from UOp.DEFINE_VAR (#6393)
now it's just arg = (expr as str, min as UOp.const, max as UOp.const)
2024-09-06 05:55:19 -04:00
chenyu
9ed2b8b818 fix DEFINE_VAR setup in test_uop_graph [run_process_replay] (#6392)
making sure arg always have 3 items
2024-09-06 05:32:12 -04:00
George Hotz
282af21b95 hotfix: DEBUG_EXPAND -1 and NOOPT in benchmark schedule 2024-09-06 17:22:30 +08:00
chenyu
9a9fea7b8c move DEFINE_VAR min/max from src to arg (#6388)
new arg is (Variable, min as CONST, max as CONST)
2024-09-06 05:01:02 -04:00
qazal
f1bd2a5519 fix BUFFER_UOPS sts in verify_ast [run_process_replay] (#6389) 2024-09-06 16:59:22 +08:00
chenyu
cc05016fa8 move test_pattern_matcher to test/unit (#6386) 2024-09-06 03:22:43 -04:00
George Hotz
86d34daac9 UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383) 2024-09-06 12:38:35 +08:00
chenyu
002303c145 fix output of truncate_fp16 (#6381)
make sure the non-inf path returns the truncated value
2024-09-05 22:55:43 -04:00
George Hotz
c88329244b create rewrite.py [run_process_replay] (#6379)
* create rewrite.py [run_process_replay]

* fix tests

* not in rewrite or ops

* skip flaky test
2024-09-06 10:51:01 +08:00
George Hotz
66e7e51c79 Revert beam failure (#6376)
* Revert "late gate creation for STORE [run_process_replay] (#6373)"

This reverts commit c26744de9f.

* Revert "gated store rewrite to UOps.IF (#5976)"

This reverts commit 48061e8400.
2024-09-06 09:36:44 +08:00
ignaciosica
c15506fc35 [WIP] amx support as TC (#5693)
* almost working with relu, even hackable... but acc size is wrong, fix needed

* upcast based on threads, change thread size to 4x4

* revert wrongfully commented assert

* fix tc load indexing

* modify for size 8

* fix bug for size 8

* Revert "fix bug for size 8"

This reverts commit cdb3f5df85.

* Revert "modify for size 8"

This reverts commit 3ef0904bd9.

* good kernel with changes in lowerer

* revert "good kernel with changes in lowerer"

This reverts commit 975e2b5a4e.

* good kernel for relu!

* refactor lowerer changes

* add amx context var to helper

* clean up amx flag

* improve lowerer changes readability

* improve check for amx

* revert lowerer if

* add float4 type rendering for clang

* add amx definitions

* enable indexing for clang if amx

* working amx example, wrong because of dims

* almost works for float 16, need to spot using double load in amx

* cleaner render_kernel

* revert chages in simple_matmul and delete env

* add new var upcast_offset to get_optimized_ast

* change axis for axes

* invert if in rendering phi

* fix some bugs

* fix linearizer tests

* fix vec/get pat for amx

* remove clang tc if amx is disabled

* add ops_python support

* refactor into one complementary function in ops_python

* add job for EMUALTE_AMX

* improve checking for AMX in UPCAST and TC extra ops

* fix lint issue

* commit before refactor into autocontained AMX

* start refactor by removing special rendering for AMX

* all ready for amx handcoded kernel

* working poc, most straightforward amx support

* avoid local opts for tc if amx

* fix merge bugs

* skip test for clang

* skip tc hand-coded opts if amx

* remove hardcoded ops_python values

* remove hardcoded sizes for amx kernel

* fix ops_python bug where dim was hard-coded

* change contract for vectorize

* working without changes in lowerer

* revert changes in gep rendering

* fix ops_python

* modify comment

* skip test if clang for different type accumulation

* move rename and bug for seperate pr

* fix wrong path for test

* addmm not implemented in torch for cpu

* change struct for vector; equally slow but cleaner

* revert modified test

* simply wmma rendering

* minor change

* noqa:501

* add length 16 for AMX

* fix vectorized half issue

* fix error

* remove comment

* change set for dedup

* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX

* add amx reference

* load acc into amx registers

* fix dtype rendering and remove noqa

* moved tests change into another pr

* add real AMX job for CI and fix bug

* fix ops_python bug

* fix test class

* remove real AMX tests and fix uops_stats test

* remove wrong test

* acc folding

* hotfix: bug

* fix float4 tests for amx

* hack for fixing flops counting

* hotfix: mypy

* add flop counts test for amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_unaligned_load_amx

* nits tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-06 09:01:10 +08:00
qazal
c26744de9f late gate creation for STORE [run_process_replay] (#6373) 2024-09-06 03:32:19 +08:00
Ian Paul
48061e8400 gated store rewrite to UOps.IF (#5976)
* Core change to gate stores in IFs

* Updates to cstyle renderer to handle IFs around STOREs

* Make uops asserts happy

* Add tests and fix newly broken tests

* make ruff happy

* make mypy happy

* Simplify renderer to have all gated stores use IF

* Revert some changes

* Make test_where_fold happy

* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out

* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE

* Re-change broken test

* Make ifs be grouped together

* get non-merged IFs working. ALl tests pass except grouping related ifs together

* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp

* Changes to uopgraph

* Simplify graph rewrite logic

* Changes to get test_padto_where_multireduce working

* Simplify uops.store renderer

* Make test_padto_where_multireduce pass but now other tests fail

* Clean up uopgraph from scrach work

* Ignore sudo IF srcs when rendering

* Attempt to fix llvm tests

* rm comment

* reduce lines

* Add line to make mypy happy :(

* llvmir fix pt 1

* Mods after rebasing to master

* Fix llvmir

* Fix ptx tests

* Fix other ptx tests

* Move changes from uops.py to ops.py

* rm uops.py

* Fix TestGateStoreRewrite tests

* Get multireduce tests working

* reset to remote branch

* Fix linearizer tests

* uop_graph test patch

* Add comment to create_gate

* hotfix: uncomment those tests

* Attempt to fix ptx tests by including whitespace inside if block

* Patch from remote tinybox. Tests passing here

* Min changes to get some ptx tests passsing

* Changes after rebase

* Exclude ifs and endifs from ptx

* IF conditional branching within ptx

* Save lines on delete_redundant_gates

* Simplify merge_gates

* rm noqa

* Remove unnecessary checks when merging gates

* Fix ops error msg

* Smarter check for if/endif in llvmir

* simplify delete redundant gates to only have 2 returns

* spacing

* Smarter check at beginning of merge_gates

* patches from comments

* Remove need for merge_gates

* include proper srcs in IF from the get-go

* test expand ifs dumb will result in 4 ifs, not 1 now

* Make tests happy

* Fix uops stats

* rm merge_gates method. Will add back in separate PR

* Spacing

* cleaner error msg

* Fix uops rendering when expanding. test_failure_43

* patch tests

* undo changes in delete_redundant_gates

* process replay attempt

* re-intro deletion of redundant gates

* fix addition of gates when they get nested in stores and loads

* patch tests

* smarter init of IF srcs when adding gate to STORE

* make ruff happy

* Resp to comment

* include all src[2]'s srcs in IF for gated store

* add reference of the storing value to the gate's src

* minor patch after rebasing

* change ptx renderer

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-06 01:05:30 +08:00
nimlgen
a1a15b54c9 qcom cache flush (#6367)
* qcom cache flush

* bench

* linter

* move
2024-09-05 13:23:39 +03:00
chenyu
62f9f273f7 increase test_profile_multidev_transfer threshold (#6370)
flaky, bumpped to 16000 for CI
2024-09-05 05:49:32 -04:00
George Hotz
e882294c02 uops touchups [run_process_replay] (#6368)
* uops touchups [run_process_replay]

* those are classmethods

* oops, kwargs

* no kwargs there
2024-09-05 17:22:32 +08:00
George Hotz
a28ed7ba4d math trait [run_process_replay] (#6364)
* math trait [run_process_replay]

* const -> const_like

* Revert "const -> const_like"

This reverts commit 85727c83d3.

* add MathTrait to LazyBuffer

* clean up function

* fixup the rest of function

* fix custom function

* mlb math trait

* fix that test
2024-09-05 16:19:17 +08:00
George Hotz
4a51c28ee7 switch const to const_like [run_process_replay] (#6356)
* const like

* no more _const

* missed one

* mypy ops.py

* file missing

* const_like

* fix image and test uop graph [run_process_replay]

* fix ptx
2024-09-05 13:57:54 +08:00
George Hotz
0d6922edb4 faster local tests. copy torch permuted to defautl device [run_process_replay] (#6363) 2024-09-05 13:57:20 +08:00
chenyu
6fd24561d1 distribute MUL const into ADD for int (#6361)
pre-req for real_stride
2024-09-05 01:36:57 -04:00
qazal
e7f6b654ad cleanup uop eq asserts for swizzle [run_process_replay] (#6362)
* cleanup uop eq asserts for swizzle [run_process_replay]

* more stuff
2024-09-05 13:36:36 +08:00
Oleg Rybalko
64f1384f5b Einsum ellipsis support (#6333)
* working ellipsis expansion

* refactor

* fix commas in output

* add capital letters

* refactor
2024-09-05 10:08:55 +08:00
nimlgen
326a77336e qcom remove some tests skips (#6353) 2024-09-04 15:38:18 +03:00
qazal
99018a4aa1 minor schedule differ utils [run_process_replay] (#6348)
* minor schedule differ utils [run_process_replay]

* rm
2024-09-04 03:41:38 +08:00
nimlgen
3adb76894d validate image=2 float16=1 openpilot benchmark (#6346)
* validate image=2 float=16 openpilot

* linter

* linter2
2024-09-03 20:13:40 +03:00
qazal
2f00bf0c78 conv bw in one kernel with graph_rewrite (#6330)
* double reduce merger

* add test_fold_conv_relu_backward_ast_rewrite

* a correctness test to iterate on

* merge axes the other way around

* better
2024-09-03 03:53:53 +08:00
Vyacheslav Pachkov
4c33192a8b add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
George Hotz
406ec8240e hotfix: lin_fail_41 passes on my M3 Max 2024-08-31 11:46:46 -07:00
Roelof van Dijk
ad4b3b457f bump limit for test_llama_embedding_opt (#6332) 2024-08-31 10:03:43 -04:00
George Hotz
72939901fc hotfix: ebs print kernel names 2024-08-29 21:20:36 -07:00
George Hotz
365babe391 precompute early_reject [run_process_replay] (#6327)
* precompute early_reject [run_process_replay]

* features for ebs

* fix ocelot cache
2024-08-29 18:26:24 -07:00
George Hotz
385904526f remove more rules [run_process_replay] (#6326)
* remove more rules [run_process_replay]

* disable invalid test

* ptx needs that str
2024-08-29 16:27:10 -07:00
qazal
539654fbe1 graph_rewrite complexity tests [run_process_replay] (#6317) 2024-08-29 22:39:08 +03:00
qazal
07942ef361 Proposal: Better UOps.SWIZZLE (#6309)
* better UOps.SWIZZLE

* test_swizzle_rewrite

* add it to docs

* show a diff

* a lil more verbose

* two teeny notes

* hotfix: sink
2024-08-29 15:39:48 +03:00
qazal
dd4e5f1c8d process replay rewrite (#6284)
* process replay rewrite

p2

* start some unittests + exceptions and exits

* shebang

* remove extra kernel init
2024-08-29 15:08:27 +03:00
pedro
7de4eac8f7 add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308)
* add `nearest` mode to interpolate

matching pytorch `nearest` which is knowingly buggy

+ relevant TestsOps

* add `nearest-exact` mode to interpolate

matching pytorch `nearest-exact`

+ relevant TestOps

* fix uint8 bilinear interpolation

by matching custom torch implementation

* implement uint8 lerp with torch interpolation trick

without converting it to float
2024-08-28 21:59:51 -07:00
qazal
ec34d9ee36 start benchmarking ast graph rewrite (#6297)
* ast_rewrite to ctx var

* add external_benchmark_ast

* refactor to asts

* track lazybuffers

* more work

* record checkpoint

* cleanup
2024-08-27 18:18:44 +03:00
Max-We
ab2714423b Add einsum tests (#6286)
Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>
2024-08-26 09:09:25 -07:00