Commit Graph

5780 Commits

Author SHA1 Message Date
nimlgen
bf645d62b3 qcom docs (#6338) 2024-09-02 20:42:20 +03:00
nimlgen
d22b46a2ac qcom in benchmarks (#6337) 2024-09-02 19:59:11 +03:00
Vyacheslav Pachkov
4c33192a8b add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
nimlgen
8e2a3fc165 raise lines count to 9300 for qcom (#6336) 2024-09-02 18:57:57 +03:00
George Hotz
e6ae332a26 hotfix: FIX_METAL_ICB isn't needed on M3 2024-08-31 11:50:02 -07:00
George Hotz
406ec8240e hotfix: lin_fail_41 passes on my M3 Max 2024-08-31 11:46:46 -07:00
Roelof van Dijk
ad4b3b457f bump limit for test_llama_embedding_opt (#6332) 2024-08-31 10:03:43 -04:00
George Hotz
72939901fc hotfix: ebs print kernel names 2024-08-29 21:20:36 -07:00
George Hotz
365babe391 precompute early_reject [run_process_replay] (#6327)
* precompute early_reject [run_process_replay]

* features for ebs

* fix ocelot cache
2024-08-29 18:26:24 -07:00
George Hotz
385904526f remove more rules [run_process_replay] (#6326)
* remove more rules [run_process_replay]

* disable invalid test

* ptx needs that str
2024-08-29 16:27:10 -07:00
George Hotz
23081c4580 folding rules cleanup [run_process_replay] (#6325)
* folding rules cleanup [run_process_replay]

* tighten combine
2024-08-29 15:15:44 -07:00
George Hotz
56cd25e43f dearg consts [run_process_replay] (#6324) 2024-08-29 15:00:22 -07:00
nimlgen
9b616cb33e HCQArgsState lifetime docs (#6323) 2024-08-30 00:31:49 +03:00
Roelof van Dijk
56b7fadc2f perf: skip type verify with -O (#6319) 2024-08-29 13:47:51 -07:00
qazal
7a08b881ed st_fixup explicit UOp init [run_process_replay] (#6320) 2024-08-29 23:21:10 +03:00
qazal
539654fbe1 graph_rewrite complexity tests [run_process_replay] (#6317) 2024-08-29 22:39:08 +03:00
qazal
07942ef361 Proposal: Better UOps.SWIZZLE (#6309)
* better UOps.SWIZZLE

* test_swizzle_rewrite

* add it to docs

* show a diff

* a lil more verbose

* two teeny notes

* hotfix: sink
2024-08-29 15:39:48 +03:00
qazal
8c50ef8b7c start uop docs (#6291)
* start uop docs

* only need show_labels

* sink comes first

* hotfix: invalid

* touchups

* 2 space indent works

* limit some buffer uops

* better BARRIER doc, Op -> UOp when it makes sense.

* make KernelInfo optional

* more work

relative links don't work

* this can be local in multi reduce+pads

* add UOps.SHAPETRACKER details

* UOps.CONST both types

* nit: local buffer isn't device Buffer, habit

* nit2: dtype -> DType
2024-08-29 15:22:39 +03:00
qazal
dd4e5f1c8d process replay rewrite (#6284)
* process replay rewrite

p2

* start some unittests + exceptions and exits

* shebang

* remove extra kernel init
2024-08-29 15:08:27 +03:00
pedro
7de4eac8f7 add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308)
* add `nearest` mode to interpolate

matching pytorch `nearest` which is knowingly buggy

+ relevant TestsOps

* add `nearest-exact` mode to interpolate

matching pytorch `nearest-exact`

+ relevant TestOps

* fix uint8 bilinear interpolation

by matching custom torch implementation

* implement uint8 lerp with torch interpolation trick

without converting it to float
2024-08-28 21:59:51 -07:00
George Hotz
638b4843da fix for metal ICB issue on M1/M2 [run_process_replay] (#6313)
* this is a working fix

* better comment

* repro
2024-08-28 21:31:14 -07:00
wozeparrot
cb61cfce24 feat: example and extra tweaks (#6310) 2024-08-28 19:26:11 -07:00
wozeparrot
ea5b7910b7 AMD support gfx103x (#5926) 2024-08-28 14:17:08 -07:00
gswangg
94a72d44d2 update CI tests in extra with UOp AST (#6290) 2024-08-28 22:26:50 +03:00
Tobias Fischer
3517aa89d9 sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
Roelof van Dijk
85591bd1ae no need for functools here (#6303) 2024-08-28 01:19:57 -07:00
nimlgen
b1e5343133 nv better error msg for p2p failure (#6301)
* nv better error msg for p2p failure

* linetr

* from

* mypy
2024-08-28 01:40:45 +03:00
nimlgen
ac303146ca nv sure qmd addr less than 40bits (#6288) 2024-08-27 20:47:38 +03:00
George Hotz
5ed6c6ef3e hotfix: 220V 15A -> 220V 20A 2024-08-27 10:20:43 -07:00
qazal
ec34d9ee36 start benchmarking ast graph rewrite (#6297)
* ast_rewrite to ctx var

* add external_benchmark_ast

* refactor to asts

* track lazybuffers

* more work

* record checkpoint

* cleanup
2024-08-27 18:18:44 +03:00
qazal
552fbd5527 update llm.c with UOp ast [run_process_replay] (#6296) 2024-08-27 15:04:54 +03:00
Tobias Fischer
211bfb6d8a fixed batched clip computation (#6292) 2024-08-26 20:48:15 -04:00
ignaciosica
3918f6eea0 refactor amd render_kernel (#6223)
* refactor amd render_kernel

* fix spacing

* add half alias back

* use itemsize * 8 insted of fixed values

* reverting becasue it broke as no longer 32 was default

* remove comment

* remove nested tuples

* hotfix: prefix.append

* hotfix2: is not None

* more diff cleanups

* hotfix 4: spacing changes must not be in the same diff

* revert wmma dtype rendering

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-08-27 00:28:36 +08:00
ignaciosica
3132449086 refactor _make_{cuda/clang}_dtype into render_vector_prefix (#6287) 2024-08-26 09:14:44 -07:00
Max-We
ab2714423b Add einsum tests (#6286)
Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>
2024-08-26 09:09:25 -07:00
chenyu
b76f0c875e lazy const fold idiv 1 (#6285) 2024-08-26 10:29:59 -04:00
chenyu
af7c04ff57 Tensor.__floordiv__ (#6283)
support Tensor.__floordiv__ and friends
2024-08-26 09:43:40 -04:00
qazal
d2f8eeed2e make [compare_schedule] the default [run_process_replay] (#6273)
* make [compare_schedule] the default

* capture ctx

* logging

* set capture to false
2024-08-26 21:40:03 +08:00
qazal
067aeaeb2f single arange fusion with graph rewrite (#6160) 2024-08-26 18:18:16 +08:00
qazal
b4381e9777 uop output_st is Optional [run_process_replay] (#6282) 2024-08-26 17:58:55 +08:00
qazal
1c0456af89 add UOps.SWIZZLE (#6271)
* add UOps.SWIZZLE

* flip swizzle init

* generic st_fixup
2024-08-26 16:08:51 +08:00
CaltropHungerton
002f60b4c3 fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192)
* fix wmma flop counting on intel, add count tests

* half

* add half gemm

* Update test.yml

* one test

* Update test_uops_stats.py

* Update test_uops_stats.py

* Update test_uops_stats.py

* smaller matrix, use unittest skipUnless decorator
2024-08-25 18:37:05 -07:00
Tobias Fischer
331b0f5477 new clip gather (#6277) 2024-08-25 19:27:24 -04:00
qazal
f0cc8ca5f2 generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278) 2024-08-25 11:02:17 +03:00
qazal
70015bd89c move permute_reduces to uop movementops [run_process_replay] (#6272) 2024-08-25 10:25:51 +03:00
chenyu
b86907c6c7 UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] (#6276) 2024-08-24 21:39:50 -04:00
chenyu
00282afa41 identity element of binary ops (#6275)
helper for the number reduce acc is inited to (0 for ADD, 1 for MUL and -inf for MAX)
2024-08-24 18:10:19 -04:00
qazal
ee245b48a9 refactor reduceop swizzling (prep for UOps.SWIZZLE) [compare_schedule] (#6269) 2024-08-24 18:17:19 +03:00
gswangg
3cf507ae7f remove extra.ops and LazyOp support from Kernel (#6267)
* remove extra.ops and BufferOps

* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal
ccb05d8baa fixup neg tests [run_process_replay] (#6268) 2024-08-24 16:35:43 +03:00