Commit Graph

2445 Commits

Author SHA1 Message Date
qazal
c26744de9f late gate creation for STORE [run_process_replay] (#6373) 2024-09-06 03:32:19 +08:00
Ian Paul
48061e8400 gated store rewrite to UOps.IF (#5976)
* Core change to gate stores in IFs

* Updates to cstyle renderer to handle IFs around STOREs

* Make uops asserts happy

* Add tests and fix newly broken tests

* make ruff happy

* make mypy happy

* Simplify renderer to have all gated stores use IF

* Revert some changes

* Make test_where_fold happy

* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out

* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE

* Re-change broken test

* Make ifs be grouped together

* get non-merged IFs working. ALl tests pass except grouping related ifs together

* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp

* Changes to uopgraph

* Simplify graph rewrite logic

* Changes to get test_padto_where_multireduce working

* Simplify uops.store renderer

* Make test_padto_where_multireduce pass but now other tests fail

* Clean up uopgraph from scrach work

* Ignore sudo IF srcs when rendering

* Attempt to fix llvm tests

* rm comment

* reduce lines

* Add line to make mypy happy :(

* llvmir fix pt 1

* Mods after rebasing to master

* Fix llvmir

* Fix ptx tests

* Fix other ptx tests

* Move changes from uops.py to ops.py

* rm uops.py

* Fix TestGateStoreRewrite tests

* Get multireduce tests working

* reset to remote branch

* Fix linearizer tests

* uop_graph test patch

* Add comment to create_gate

* hotfix: uncomment those tests

* Attempt to fix ptx tests by including whitespace inside if block

* Patch from remote tinybox. Tests passing here

* Min changes to get some ptx tests passsing

* Changes after rebase

* Exclude ifs and endifs from ptx

* IF conditional branching within ptx

* Save lines on delete_redundant_gates

* Simplify merge_gates

* rm noqa

* Remove unnecessary checks when merging gates

* Fix ops error msg

* Smarter check for if/endif in llvmir

* simplify delete redundant gates to only have 2 returns

* spacing

* Smarter check at beginning of merge_gates

* patches from comments

* Remove need for merge_gates

* include proper srcs in IF from the get-go

* test expand ifs dumb will result in 4 ifs, not 1 now

* Make tests happy

* Fix uops stats

* rm merge_gates method. Will add back in separate PR

* Spacing

* cleaner error msg

* Fix uops rendering when expanding. test_failure_43

* patch tests

* undo changes in delete_redundant_gates

* process replay attempt

* re-intro deletion of redundant gates

* fix addition of gates when they get nested in stores and loads

* patch tests

* smarter init of IF srcs when adding gate to STORE

* make ruff happy

* Resp to comment

* include all src[2]'s srcs in IF for gated store

* add reference of the storing value to the gate's src

* minor patch after rebasing

* change ptx renderer

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-06 01:05:30 +08:00
nimlgen
a1a15b54c9 qcom cache flush (#6367)
* qcom cache flush

* bench

* linter

* move
2024-09-05 13:23:39 +03:00
chenyu
62f9f273f7 increase test_profile_multidev_transfer threshold (#6370)
flaky, bumpped to 16000 for CI
2024-09-05 05:49:32 -04:00
George Hotz
e882294c02 uops touchups [run_process_replay] (#6368)
* uops touchups [run_process_replay]

* those are classmethods

* oops, kwargs

* no kwargs there
2024-09-05 17:22:32 +08:00
George Hotz
a28ed7ba4d math trait [run_process_replay] (#6364)
* math trait [run_process_replay]

* const -> const_like

* Revert "const -> const_like"

This reverts commit 85727c83d3.

* add MathTrait to LazyBuffer

* clean up function

* fixup the rest of function

* fix custom function

* mlb math trait

* fix that test
2024-09-05 16:19:17 +08:00
George Hotz
4a51c28ee7 switch const to const_like [run_process_replay] (#6356)
* const like

* no more _const

* missed one

* mypy ops.py

* file missing

* const_like

* fix image and test uop graph [run_process_replay]

* fix ptx
2024-09-05 13:57:54 +08:00
George Hotz
0d6922edb4 faster local tests. copy torch permuted to defautl device [run_process_replay] (#6363) 2024-09-05 13:57:20 +08:00
chenyu
6fd24561d1 distribute MUL const into ADD for int (#6361)
pre-req for real_stride
2024-09-05 01:36:57 -04:00
qazal
e7f6b654ad cleanup uop eq asserts for swizzle [run_process_replay] (#6362)
* cleanup uop eq asserts for swizzle [run_process_replay]

* more stuff
2024-09-05 13:36:36 +08:00
Oleg Rybalko
64f1384f5b Einsum ellipsis support (#6333)
* working ellipsis expansion

* refactor

* fix commas in output

* add capital letters

* refactor
2024-09-05 10:08:55 +08:00
nimlgen
326a77336e qcom remove some tests skips (#6353) 2024-09-04 15:38:18 +03:00
qazal
99018a4aa1 minor schedule differ utils [run_process_replay] (#6348)
* minor schedule differ utils [run_process_replay]

* rm
2024-09-04 03:41:38 +08:00
nimlgen
3adb76894d validate image=2 float16=1 openpilot benchmark (#6346)
* validate image=2 float=16 openpilot

* linter

* linter2
2024-09-03 20:13:40 +03:00
qazal
2f00bf0c78 conv bw in one kernel with graph_rewrite (#6330)
* double reduce merger

* add test_fold_conv_relu_backward_ast_rewrite

* a correctness test to iterate on

* merge axes the other way around

* better
2024-09-03 03:53:53 +08:00
Vyacheslav Pachkov
4c33192a8b add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
George Hotz
406ec8240e hotfix: lin_fail_41 passes on my M3 Max 2024-08-31 11:46:46 -07:00
Roelof van Dijk
ad4b3b457f bump limit for test_llama_embedding_opt (#6332) 2024-08-31 10:03:43 -04:00
George Hotz
72939901fc hotfix: ebs print kernel names 2024-08-29 21:20:36 -07:00
George Hotz
365babe391 precompute early_reject [run_process_replay] (#6327)
* precompute early_reject [run_process_replay]

* features for ebs

* fix ocelot cache
2024-08-29 18:26:24 -07:00
George Hotz
385904526f remove more rules [run_process_replay] (#6326)
* remove more rules [run_process_replay]

* disable invalid test

* ptx needs that str
2024-08-29 16:27:10 -07:00
qazal
539654fbe1 graph_rewrite complexity tests [run_process_replay] (#6317) 2024-08-29 22:39:08 +03:00
qazal
07942ef361 Proposal: Better UOps.SWIZZLE (#6309)
* better UOps.SWIZZLE

* test_swizzle_rewrite

* add it to docs

* show a diff

* a lil more verbose

* two teeny notes

* hotfix: sink
2024-08-29 15:39:48 +03:00
qazal
dd4e5f1c8d process replay rewrite (#6284)
* process replay rewrite

p2

* start some unittests + exceptions and exits

* shebang

* remove extra kernel init
2024-08-29 15:08:27 +03:00
pedro
7de4eac8f7 add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308)
* add `nearest` mode to interpolate

matching pytorch `nearest` which is knowingly buggy

+ relevant TestsOps

* add `nearest-exact` mode to interpolate

matching pytorch `nearest-exact`

+ relevant TestOps

* fix uint8 bilinear interpolation

by matching custom torch implementation

* implement uint8 lerp with torch interpolation trick

without converting it to float
2024-08-28 21:59:51 -07:00
qazal
ec34d9ee36 start benchmarking ast graph rewrite (#6297)
* ast_rewrite to ctx var

* add external_benchmark_ast

* refactor to asts

* track lazybuffers

* more work

* record checkpoint

* cleanup
2024-08-27 18:18:44 +03:00
Max-We
ab2714423b Add einsum tests (#6286)
Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>
2024-08-26 09:09:25 -07:00
chenyu
b76f0c875e lazy const fold idiv 1 (#6285) 2024-08-26 10:29:59 -04:00
chenyu
af7c04ff57 Tensor.__floordiv__ (#6283)
support Tensor.__floordiv__ and friends
2024-08-26 09:43:40 -04:00
qazal
d2f8eeed2e make [compare_schedule] the default [run_process_replay] (#6273)
* make [compare_schedule] the default

* capture ctx

* logging

* set capture to false
2024-08-26 21:40:03 +08:00
CaltropHungerton
002f60b4c3 fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192)
* fix wmma flop counting on intel, add count tests

* half

* add half gemm

* Update test.yml

* one test

* Update test_uops_stats.py

* Update test_uops_stats.py

* Update test_uops_stats.py

* smaller matrix, use unittest skipUnless decorator
2024-08-25 18:37:05 -07:00
qazal
f0cc8ca5f2 generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278) 2024-08-25 11:02:17 +03:00
gswangg
3cf507ae7f remove extra.ops and LazyOp support from Kernel (#6267)
* remove extra.ops and BufferOps

* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal
ccb05d8baa fixup neg tests [run_process_replay] (#6268) 2024-08-24 16:35:43 +03:00
gswangg
ea76b93814 migrate test_linearizer_dumb.py to UOp AST (#6241)
* add imports and update test_unmerged_ifs to UOp AST

* test_max_simplify_and_cancel

* test_expander_new_srcs

* test_llama_embedding

* test_unaligns_idxs

* test_unrolled_float4_align

* test_upcasted_stores_out_of_order

* remove LazyOp

* remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD
2024-08-24 16:27:29 +03:00
gswangg
e44653e25a migrate test_linearizer_failures.py to UOp AST (#6240)
* add imports and update test_failure_1 to UOp AST

* update test_failure_2 with UOp AST

* update test_failure_3

* test_failure_5

* test_failure_6

* test_failure_7

* test_failure_8

* test_failure_9

* test_failure_10

* test_failure_11

* test_failure_12

* test_failure_12_multireduce

* uncomment skip and migrate test_failure_13

* test_failure_14

* test_failure_15

* test_failure_16

* test_failure_17

* test_failure_18

* test_failure_19

* test_failure_20

* test_failure_21

* test_failure_22

* test_failure_23

* test_failure_24

* test_failure_25

* test_failure_26

* test_failure_27

* test_failure_28

* test_failure_29

* test_failure_30

* test_failure_31

* test_failure_32

* test_failure_33

* test_failure_34

* test_failure_36

* test_failure_37

* test_failure_38

* test_update_39

* test_failure_40

* test_failure_41

* test_failure_42

* test_failure_43

* test_failure_44

* test_failure_45

* test_failure_46

* test_failure_47

* test_failure_48

* test_failure_49

* test_failure_50

* remove LazyOp

* reskip test_failure_22

* remove extra/ops

* replace ReduceOps with BinaryOps

* fixup that import

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-24 16:26:58 +03:00
gswangg
1dc6040877 migrate test_search.py to UOp AST (#6245)
* add imports and update test_kernel_count with UOp AST

* test_filter_global_buffer

* remove LazyOp

* remove extra.ops and ReduceOps

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-24 16:13:53 +03:00
qazal
ae23540d6e refresh process replay schedule ref in reset.py (#6265) 2024-08-24 16:12:51 +03:00
gswangg
7be5eede71 migrate test_linearizer_overflows.py to UOp AST (#6244)
* add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST

* test_overflow_2

* test_overflow_3

* test_overflow_4

* test_overflow_5

* test_overflow_6

* test_overflow_7

* TestLinearizerOverflowAlt::test_overflow_1

* TestLinearizerOverflowAlt::test_overflow_2

* remove LazyOp

* remove extra.ops

* remove ReduceOps
2024-08-24 16:10:29 +03:00
chenyu
943ab97d24 fix Tensor.prod for multitensor (#6264) 2024-08-24 08:52:24 -04:00
qazal
bcb2f1caa3 init REDUCE_AXIS with BinaryOps (#6256)
* REDUCE_AXIS arg with BinaryOps

* more work in kernel.py
fixup sops.gz

* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
chenyu
da5cf11859 fix acc init value for MUL (#6263) 2024-08-23 23:19:44 -04:00
George Hotz
26498b322e add BEAM to external_benchmark_schedule.py 2024-08-23 18:10:46 -07:00
George Hotz
53a73038e3 hotfix: TestGraphRewriteEfficiency.test_create_many_uops 2024-08-23 15:51:57 -07:00
chenyu
590c0922b6 Tensor.prod (#6250)
* Tensor.prod

a new reduce op!

* onnx ReduceProd
2024-08-23 10:06:32 -04:00
qazal
78d6bd8b41 start graph rewrite in the scheduler (#6248)
* start graph rewrite in the scheduler

* test: enable it

* test timings

* only fails in multi reduce

* more isolated tests
2024-08-23 13:15:55 +03:00
George Hotz
238896ca02 loooking into graph rewrite speed (#6239)
* loooking into graph rewrite speed

* track, replace is slow

* if all same, no permutations [run_process_replay]

* types so compile works

* no implied comprehension

* TRACK_MATCH_STATS=2
2024-08-22 13:17:55 -07:00
chenyu
e745e16441 remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
nimlgen
6c4ddd6260 hcq skip tests when no multidev (#6235)
* hcq skip tests when no multidev

* linter

* a bit higher tinout
2024-08-22 18:27:16 +03:00
chenyu
08539f08b0 fix UOp repr with Variable in arg (#6236) 2024-08-22 11:06:33 -04:00