Commit Graph

1363 Commits

Author SHA1 Message Date
chenyu
e745e16441 remove UnaryOps.NEG (#6238)
* Remove UnaryOps.NEG

generated new dataset with
```
time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh
gzip /tmp/sops
mv /tmp/sops.gz extra/datasets/
```

* fix that
2024-08-22 14:21:39 -04:00
Francis Lam
7376b67e36 extra/gemm/triton_nv_matmul: fix Program arguments (#6212)
remove op_estimate
2024-08-20 14:05:38 -07:00
Francis Lata
8fd8b970b0 update URL to eval cases from recent MLPerf file movements (#6201) 2024-08-20 08:43:13 -04:00
chenyu
9db2d0d5c6 fix some type error in onnx [run_process_replay] (#6153) 2024-08-17 19:54:20 -04:00
chenyu
7c9c8ce22f use TensorProto enum in onnx dtype mapping [run_process_replay] (#6151) 2024-08-17 17:58:40 -04:00
George Hotz
9bc81c6db4 UOps.SHAPETRACKER (#6129)
* UOps.SHAPETRACKER [run_process_replay]

* no process replay
2024-08-16 23:26:34 -07:00
George Hotz
89c7989659 no shapetracker in ops [run_process_replay] (#6117) 2024-08-16 17:23:27 -07:00
George Hotz
74ee9febec remove iter from uopgraph (#6110)
* remove iter from uopgraph

* linearize returns uops

* fix tests

* linearize in linearize

* tests fix

* touchup

* test failures
2024-08-16 15:58:29 -07:00
qazal
28c75bf2a6 merge uops with ops (#6111)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-08-16 18:17:57 -04:00
qazal
c23d44c779 AST is UOp (#6030)
* most of the work from the uops2 branch

* schedule

* realize

* kernel

* lowerer

* search

* green

* merge uops with ops

* Revert "merge uops with ops"

This reverts commit 1408a59f12.

* fix benchmark

* remove extra dedup
2024-08-16 22:09:00 +03:00
CaltropHungerton
38fb1e14a2 Intel XMX Tensor Core Support (#5622)
* fixed xmx demo

* i think i'm invoking the DPAS but it's slow

* compiler build arg to stop register spilling, indicated where to fix flop counter

* don't mind this

* do NOT mind me

* do not mind me

* do not view

* i will add bf16 later

* in process of figuring out tc fields

* we figured out the fields!!!

* added check for cl device vendor, added seperate IntelRenderer

* remove tc thread_local_aliases

* cleaning debris before draft pr

* edits for linter

* deduping and checking device extensions

* i will find more line reductions in other places

* before merge upstream

* double grf size in compiler to fix register spilling (bandaid), device checking changes

* tc python emulation

* fixed emulation

* tests for emulated intel tensor core

* TC=0, 1 working on upstream, fixed perf

* test

* debris

* check for specialized cl device when we canonicalize device

* bf16 support, tc=3 test added

* address tests

* revert half2 loads on intel tc, cleanup

* linter

* fold_expanded revert

* lint, whitespace fix

* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too

* make line shorter, no need for noqa E501

* removed device intel

* fix python emulation

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-08-16 09:19:21 -07:00
nimlgen
7ab531aede autogen cleanup (#6064)
* start autogen cleanup

* nvgpu

* better?

* better

* amd part

* gpu regen

* fix mockgpu amd

* nv

* amd fix linter

* remove import

* ugh

* nv on master

* amd on master
2024-08-14 20:20:35 +03:00
wozeparrot
059cf2a90d feat: autogen from kernel register offset headers (#6056) 2024-08-12 14:08:35 -07:00
wozeparrot
dc2617bffd feat: use more correct reg for local dims (#6048) 2024-08-12 11:15:37 -07:00
chenyu
e6c7c3e499 update pylint path to check indent/space for all (#6022)
also fixed many errors. it was not checking nested dirs. exclude autogen for now.

can we use ruff for this?
2024-08-10 14:41:09 -04:00
wozeparrot
d269bc95fa faster tinychat (#5993) 2024-08-08 19:16:26 -07:00
George Hotz
bc55c8a30e pmatmul example + GB/s bugfix [run_process_replay] (#5974)
* pmatmul example + bugfix

* improve pmatmul

* Update real_pmatmul.py
2024-08-07 22:32:11 -07:00
George Hotz
bf8ec23b00 hotfix: contiguous on precompute_freqs_cis 2024-08-07 14:40:56 -07:00
wozeparrot
5808e8a30f mockgpu remu changes (#5925) 2024-08-05 19:26:58 -07:00
wozeparrot
6740a0a6a0 hip_ioctl changes (#5917) 2024-08-05 11:58:38 -07:00
chenyu
996ff0c135 pow(2) -> square in RMSNorm [run_process_replay] (#5901)
reads nicer in metadata
2024-08-04 14:21:31 -04:00
Elias Wahl
4a114756f6 New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
nimlgen
34168a64e3 optimize nv profiler (#5856)
* nv profiler fix

* cleanup hcq a bit

* fixes

* fix

* typo

* all signals put timestamp

* a bit cleaner

* merge fields

* type

* import

* tiny fix
2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov
610e454132 fix opencl_ioctl on comma (#5814)
- remove unused code
- add CP_REG_TO_MEM opcode
- fixed parse_cmd_buf for more than 1 command object by correcting
an offset
- fixed memory mappings for cases when memory was allocated with
KGSL_MEMFLAGS_USE_CPU_MAP.
KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU
address will be 0. Calling mmap() will set the GPU address.
So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory
and it resulted to crash right after get_mem.
2024-07-30 20:44:06 -07:00
David Hou
9a485f36e4 shard kvcache (#5830) 2024-07-30 20:29:54 -07:00
George Hotz
4e89d45513 hotfix: put contiguous back in llama 2024-07-30 18:43:48 -07:00
George Hotz
21c5e8e1b7 extreme llama speed, 57.34 tok/s (#5827)
* extreme llama speed

* mergable
2024-07-30 18:32:09 -07:00
George Hotz
e6879035a0 work to make GEMV fast (#5824)
* work to make GEMV fast

* half8 cast

* align struct

* fix amd

* float8 is a later problem
2024-07-30 17:41:40 -07:00
Francis Lata
ce61be16f1 clean up how preprocessed folder is defined (#5813) 2024-07-30 12:35:26 -04:00
chenyu
471b188d79 fix mypy errors in latest mypy (#5794)
* fix mypy errors in latest mypy

mypy has stricter partial and api arg checks now

* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
nimlgen
ea27ec4cd0 nv switch classlist_v2 to classlist (#5763)
* nv switch classlist_v2 to classlist

* support in mockgpu

* fix mockgpu
2024-07-28 20:24:42 +03:00
chenyu
3686b6726a move GraphException to jit.py (#5744)
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
George Hotz
489a5b99a5 hotfix: triton_nv_matmul touchups 2024-07-24 23:24:29 +00:00
George Hotz
bf24be4c8c triton gets 163 TFLOPS on 4090 2024-07-24 18:32:29 +00:00
George Hotz
4d47968580 fix acc folding for NV tensor cores (#5658)
* fix acc folding for NV tensor cores

* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
nimlgen
08a9c0ae5e hcq cache invalidation for beam (#5630)
* nv full cache invalidation

* the same command on amd

* linter

* fix amd

* nv no hardcoded consts

* beam default
2024-07-22 18:13:17 +03:00
George Hotz
6c6d74d922 parallel mcts (#5626)
* start work on parallel mcts

* compile was linearizing twice

* typing + more early stopping

* fix compiler error
2024-07-21 14:53:23 -07:00
George Hotz
ef179087a4 mcts exit condition wasn't right, also use it with BEAM>=100 (#5619)
* mcts exit condition wasn't right, also use it with BEAM>=100

* mcts touchups

* clean up sample
2024-07-21 10:16:47 -07:00
George Hotz
0f67ef4674 mcts graph and dedup support (#5618)
* mcts graph and dedup support

* usable graph

* mcts colors

* C=4 seems better

* C=3 even better

* sample_tree

* backprop is external function

* late expand to match algo
2024-07-20 23:29:14 -07:00
chenyu
eddc5bcfd7 MCTS tweaks (#5616)
MCTS 500 is competitive with BEAM=8 on resnet on M1 Max.
- increment trial times even with compiled error and runtime error.
- use best time of children as the node value.
2024-07-20 19:45:59 -07:00
George Hotz
1113e47f96 print best in MCTS + light up the winner in hcopt 2024-07-20 09:39:36 -07:00
George Hotz
ac99ecd94e use statistics.median for timing (#5606) 2024-07-20 08:37:32 -07:00
George Hotz
06e336bccb mcts search (#5598)
* mcts search

* mcts cleanups

* mcts cleanup

* random shuffle children order

* mcts in handcode_opt

* src and remove_node

* debug 3 to print ast

* print the type

* mcts in extra
2024-07-19 21:38:39 -07:00
Tobias Fischer
72da3fe7e6 added clip vision model (#5595)
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-19 18:35:51 -04:00
George Hotz
fa7e734b49 MetaOps.KERNEL (#5543) 2024-07-17 19:41:23 -07:00
Francis Lam
2d53abb04a test/external/fuzz_linearizer: fix for new AST changes (#5519)
* test/external/fuzz_linearizer: fix for new AST changes

also add beautiful_mnist failures

* add CLANG and LLVM to test_failure_35 failed_platforms

* fix test_linearizer_failure names
2024-07-17 00:08:07 -04:00
Tobias Fischer
85d4ca7caa FID Inception Model (#5516)
* added model impl

* minor cleanups

* extracted weights loading into from_pretrained

* reorganized model for better weight loading

* removed lru cache for state dict loading
2024-07-16 23:12:03 -04:00
chenyu
28972418c4 s/get_linearizer/get_kernel [run_process_replay] (#5467) 2024-07-13 20:32:22 -04:00
George Hotz
03c2dc8bd7 lowerer is kernel [run_process_replay] (#5437) 2024-07-12 18:50:55 -07:00
chenyu
00813a92a0 update Tensor.eye api to match torch (#5433)
* update Tensor.eye api to match torch

input is n for nrows and optional m for ncols

* space

* fix onnx
2024-07-12 20:25:12 -04:00