Commit Graph

1242 Commits

Author SHA1 Message Date
George Hotz
2f2f933e50 fix buffer shape regression from onnx (#6678) 2024-09-23 16:58:42 +08:00
qazal
982086f54c UOps.VALID try 2 (#6623)
* make UOps.VALID compile

* fixable tests

* bufs dedup

* cleanup the CONST spec

* regenerate dataset with graph_rewrite

```py
def rewrite_const(const:UOp, st_src:UOp) -> UOp:
  st: ShapeTracker = st_src.arg
  return UOp(UOps.VALID, dtypes.bool, (st.to_uop(),)).where(UOp.const(const.dtype, const.arg), UOp.const(const.dtype, 0))
pm = PatternMatcher([(UPat(UOps.CONST, name="const", src=(UPat(UOps.SHAPETRACKER, name="st_src"),)), rewrite_const)])
```

* rm arg

* remove arg

* revert arg removal

This reverts commit 2c35c75c95.

* red test_pickle_define_var
2024-09-21 14:19:25 +08:00
Tobias Fischer
c1bbd15bd9 Sharded SDXL Inference (#6328)
* initial sharding fixes

* sigma device fix

* emptyline space fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-21 01:26:43 -04:00
nimlgen
641586cb87 qcom updated ioctl (#6627) 2024-09-20 18:02:40 +08:00
George Hotz
c4d5575c61 beat mlx at resnet 18 (#6611)
* work to beat mlx at resnet18 [run_process_replay]

* pruning

* wino sometimes

* shorter

* comment
2024-09-20 11:28:01 +08:00
qazal
9295bc0189 viz more work [run_process_replay] (#6568)
* infra

* found it

* real work

* bring those back

* cleanup test_viz

* comment that out
2024-09-17 19:27:09 +08:00
nimlgen
81a4a9623c add qcom dsp runtime (#6112)
* calling qualcomm dsp from python

* include so files

* add include file

* adsprpc.py

* running with adsprpc

* work

* 32-bit support in elf

* compilation works

* ion

* msm_ion

* working DSP backend

* getting 500 MFLOPS on matmul

* beam works with timing

* move to autogen

* disasm

* progress

* simple tests pass

* qcom_dsp

* more dsp autogen

* progress

* some progress

* works w/o lib

* checkpoint

* no lib

* ugh, better

* cleaner, but with lib. test good, but with the hack

* remove autogens

* small

* push

* simpler

* revert this

* run_3

* simpler

* android

* handle

* run it

* why?

* run2

* to gen

* cc

* cleaner

* elf

* part of autogen

* comemnt

* no lib

* autohen

* linter

* bug reproducer

* cleaner

* this repro is almost empty and doesn't work!!!!

* with this test_ops passes, no crashes anymore

* cleaner

* linter

* renames

* shorter

* remoev contextlib

* ugh

* myoy

* cleaner

* cleaner

* remove import

* conn

* import

* revert this

* remove heavy .so

* shorter alloc

* not tue anymore

---------

Co-authored-by: Comma Device <device@comma.ai>
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <george@comma.ai>
2024-09-13 21:01:33 +03:00
George Hotz
bdd0c06f29 add void type to uop (#6471)
* unwrap_dtype maybe

* uopgraph stuff that hardcoded None

* test_ops passes

* dtypes.py fixups

* update test_linearizer and friends

* more ast updates

* test_beam and test_schedule too

* add void type to uop [run_process_replay]

* remove dumb casts

* start making it green

* more cast cleanups

* more cls methods to fix

* regenerate dataset

* split UOp and NOp const

* maybe that too

* fix docs

* update test_uop_symbolic

* test_verify_ast

* new sops with no diff

* meh, type_ignore is alright

* remove that assert

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-11 18:16:28 +08:00
Francis Lata
b7ce9a1530 UNet3D MLPerf (#3470)
* add training set transforms

* add DICE cross entropy loss

* convert pred and label to Tensor when calculating DICE score

* cleanups and allow train dataset batching

* fix DICE CE loss calculation

* jitted training step

* clean up DICE CE loss calculation

* initial support for sharding

* Revert "initial support for sharding"

This reverts commit e3670813b8.

* minor updates

* cleanup imports

* add support for sharding

* apply temp patch to try to avoid OOM

* revert cstyle changes

* add gradient acc

* hotfix

* add FP16 support

* add ability to train on smaller image sizes

* add support for saving and loading checkpoints + cleanup some various modes

* fix issue with using smaller patch size + update W&B logging

* disable LR_WARMUP_EPOCHS

* updates

* minor cleanups

* cleanup

* update order of transformations

* more cleanups

* realize loss

* cleanup

* more cleanup

* some cleanups

* add RAM usage

* minor cleanups

* add support for gradient accumulation

* cleanup imports

* minor updates to not use GA_STEPS

* remove FP16 option since it's available now globally

* update multi-GPU setup

* add timing logs for training loop

* go back to using existing dataloader and add ability to preprocess data to save time

* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation

* free train and eval steps memory

* cleanups and scale batch size based on the number of GPUs

* fix GlobalCounters import

* fix seed

* fix W&B setup

* update batch size default size

* add back metric divergence check

* put back JIT on UNet3d eval

* move dataset preprocessing inside training code

* add test for dice_loss

* add config logging support to W&B and other cleanups

* change how default float is getting retrieved

* remove TinyJit import duplicate

* update config logging to W&B and remove JIT on eval_step

* no need for caching preprocessed data anymore

* fix how evaluation is ran and how often

* add support for LR scaling

* fix issue with gaussian being moved to scipy.signal.windows

* remove DICE loss unit test

* fix issue where loss isn't compatible with multiGPU

* add individual BEAM control for train and eval steps

* fix ndimage scipy import

* add BENCHMARK

* cleanups on BENCHMARK + fix on rand_flip augmentation during training

* cleanup train and eval BEAM envs

* add checkpointing support after every eval

* cleanup model_eval

* disable grad during eval

* use new preprocessing dataset mechanism

* remove unused import

* use training and inference_mode contexts

* start eval after benchmarking

* add data fetching time

* cleanup decorators

* more cleanups on training script

* add message during benchmarking mode

* realize when reassigning LR on scheduler and update default number of epochs

* add JIT on eval step

* remove JIT on eval_step

* add train dataloader for unet3d

* move checkpointing to be done after every epoch

* revert removal of JIT on unet3d inference

* save checkpoint if metric is not successful

* Revert "add train dataloader for unet3d"

This reverts commit c166d129df.

* Revert "Revert "add train dataloader for unet3d""

This reverts commit 36366c65d2.

* hotfix: seed was defaulting to a value of 0

* fix SEED value

* remove the usage of context managers for setting BEAM and going from training to inference

* support new stack API for calculating eval loss and metric

* Revert "remove the usage of context managers for setting BEAM and going from training to inference"

This reverts commit 2c0ba8d322.

* check training and test preprocessed folders separately

* clean up imports and log FUSE_CONV_BW

* use train and val preprocessing constants

* add kits19 dataset setup script

* update to use the new test decorator for disabling grad

* update kits19 dataset setup script

* add docs on how to train the model

* set default value for BASEDIR

* add detailed instruction about BASEDIR usage

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-10 04:37:28 -04:00
George Hotz
904f6a63fa Revert "Revert "cleanup process_replay/* namings [run_process_replay] (#6429)…" (#6442)
This reverts commit eda177da84.
2024-09-10 07:00:16 +08:00
George Hotz
dbd4536167 Revert "add UOps.VALID (#6387)" (#6441)
This reverts commit 8186e4e7d6.
2024-09-09 21:33:00 +08:00
George Hotz
eda177da84 Revert "cleanup process_replay/* namings [run_process_replay] (#6429)" (#6437)
This reverts commit f4e83b30b4.
2024-09-09 18:52:36 +08:00
qazal
f4e83b30b4 cleanup process_replay/* namings [run_process_replay] (#6429) 2024-09-09 16:59:04 +08:00
George Hotz
8186e4e7d6 add UOps.VALID (#6387)
* uops valid

* broke full_shape

* fixup that st (hardcoded asts still red)

* fixup DEFINE_VAR

debug

more debug

* start moving stuff to ast_const

* move test_linearizer

* move test_linearizer_failures to ast_const

* fixup test_schedule

* small diff change

* regenerate dataset

* fixup test_multitensor

* regen dataset try 2

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-09 16:58:43 +08:00
qazal
c5bae55ec8 new generate_dataset.sh (#6423)
* new generate_dataset.sh

* keep those there

* test: rm expected failures

* rename to extract
2024-09-09 15:13:07 +08:00
geohotstan
65da03e186 remove _slice [run_process_replay] (#6395)
* try

* pass

* clean up

* done

* I'm becoming dumber

* clean up 2

* remove useless max

* useless but make computer brrr [run_process_replay]

* try process replay

* try again

* 1 less line, just use pad2d
2024-09-08 09:12:39 +08:00
George Hotz
8f6d0485e7 hotfix: resnet to obj.device 2024-09-06 13:06:02 +08:00
George Hotz
9d72119a0c minor resnet cleanups (#6382)
* minor resnet cleanups

* that should have been long

* jit

* meh
2024-09-06 12:50:21 +08:00
George Hotz
86d34daac9 UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383) 2024-09-06 12:38:35 +08:00
George Hotz
72be31cb56 remove mla [run_process_replay] (#6357)
* remove mla

* other bad uses of const
2024-09-05 10:37:46 +08:00
Vyacheslav Pachkov
4c33192a8b add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
wozeparrot
cb61cfce24 feat: example and extra tweaks (#6310) 2024-08-28 19:26:11 -07:00
gswangg
94a72d44d2 update CI tests in extra with UOp AST (#6290) 2024-08-28 22:26:50 +03:00
Tobias Fischer
3517aa89d9 sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
Tobias Fischer
211bfb6d8a fixed batched clip computation (#6292) 2024-08-26 20:48:15 -04:00
Tobias Fischer
331b0f5477 new clip gather (#6277) 2024-08-25 19:27:24 -04:00
qazal
bcb2f1caa3 init REDUCE_AXIS with BinaryOps (#6256)
* REDUCE_AXIS arg with BinaryOps

* more work in kernel.py
fixup sops.gz

* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
qazal
0d4887e9df use UOps.WMMA everywhere (#6255)
* add UOps.WMMA_AXIS

* delete ReduceOps.WMMA from ops
2024-08-23 15:03:26 -04:00
chenyu
590c0922b6 Tensor.prod (#6250)
* Tensor.prod

a new reduce op!

* onnx ReduceProd
2024-08-23 10:06:32 -04:00
chenyu
e745e16441 remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
Francis Lam
7376b67e36 extra/gemm/triton_nv_matmul: fix Program arguments (#6212)
remove op_estimate
2024-08-20 14:05:38 -07:00
Francis Lata
8fd8b970b0 update URL to eval cases from recent MLPerf file movements (#6201) 2024-08-20 08:43:13 -04:00
chenyu
9db2d0d5c6 fix some type error in onnx [run_process_replay] (#6153) 2024-08-17 19:54:20 -04:00
chenyu
7c9c8ce22f use TensorProto enum in onnx dtype mapping [run_process_replay] (#6151) 2024-08-17 17:58:40 -04:00
George Hotz
9bc81c6db4 UOps.SHAPETRACKER (#6129)
* UOps.SHAPETRACKER [run_process_replay]

* no process replay
2024-08-16 23:26:34 -07:00
George Hotz
89c7989659 no shapetracker in ops [run_process_replay] (#6117) 2024-08-16 17:23:27 -07:00
George Hotz
74ee9febec remove iter from uopgraph (#6110)
* remove iter from uopgraph

* linearize returns uops

* fix tests

* linearize in linearize

* tests fix

* touchup

* test failures
2024-08-16 15:58:29 -07:00
qazal
28c75bf2a6 merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal
c23d44c779 AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
CaltropHungerton
38fb1e14a2 Intel XMX Tensor Core Support (#5622)
* fixed xmx demo

* i think i'm invoking the DPAS but it's slow

* compiler build arg to stop register spilling, indicated where to fix flop counter

* don't mind this

* do NOT mind me

* do not mind me

* do not view

* i will add bf16 later

* in process of figuring out tc fields

* we figured out the fields!!!

* added check for cl device vendor, added seperate IntelRenderer

* remove tc thread_local_aliases

* cleaning debris before draft pr

* edits for linter

* deduping and checking device extensions

* i will find more line reductions in other places

* before merge upstream

* double grf size in compiler to fix register spilling (bandaid), device checking changes

* tc python emulation

* fixed emulation

* tests for emulated intel tensor core

* TC=0, 1 working on upstream, fixed perf

* test

* debris

* check for specialized cl device when we canonicalize device

* bf16 support, tc=3 test added

* address tests

* revert half2 loads on intel tc, cleanup

* linter

* fold_expanded revert

* lint, whitespace fix

* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too

* make line shorter, no need for noqa E501

* removed device intel

* fix python emulation

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
nimlgen
7ab531aede autogen cleanup (#6064)
* start autogen cleanup

* nvgpu

* better?

* better

* amd part

* gpu regen

* fix mockgpu amd

* nv

* amd fix linter

* remove import

* ugh

* nv on master

* amd on master
2024-08-14 20:20:35 +03:00
wozeparrot
059cf2a90d feat: autogen from kernel register offset headers (#6056) 2024-08-12 14:08:35 -07:00
wozeparrot
dc2617bffd feat: use more correct reg for local dims (#6048) 2024-08-12 11:15:37 -07:00
chenyu
e6c7c3e499 update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
wozeparrot
d269bc95fa faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
George Hotz
bc55c8a30e pmatmul example + GB/s bugfix [run_process_replay] (#5974)
* pmatmul example + bugfix

* improve pmatmul

* Update real_pmatmul.py
2024-08-07 22:32:11 -07:00
George Hotz
bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
wozeparrot
5808e8a30f mockgpu remu changes (#5925) 2024-08-05 19:26:58 -07:00
wozeparrot
6740a0a6a0 hip_ioctl changes (#5917) 2024-08-05 11:58:38 -07:00
chenyu
996ff0c135 pow(2) -> square in RMSNorm [run_process_replay] (#5901)
reads nicer in metadata
2024-08-04 14:21:31 -04:00