Commit Graph

10633 Commits

Author SHA1 Message Date
nimlgen
17fa6e7619 disk: better error desc when not opened (#9028) 2025-02-11 16:31:04 +03:00
nimlgen
166670a2f2 nv: fill grid/block sizes (#9025) 2025-02-11 16:30:30 +03:00
qazal
c80603285e bring back some things from the fix_kernel_ops diff [pr] (#9027)
* bring fix_kernel_ops back [pr]

* fix
2025-02-11 14:20:31 +01:00
George Hotz
9209b85c91 add UOps.CAT (#9022)
* add UOps.CAT [pr]

* comment + no pr
2025-02-11 19:50:37 +08:00
George Hotz
a521260b7a dont reduce the ptr size, sz is base for unaligned [pr] (#9023) 2025-02-11 19:50:23 +08:00
George Hotz
d0d58a6771 add CUSTOM support to cstyle (#9020) 2025-02-11 18:02:58 +08:00
George Hotz
fb698920f1 revert scheduler change (#9019)
* Revert "cleanup ast rewriter [pr] (#9012)"

This reverts commit bf0bcb2d5a.

* Revert "kernel op cleanups + use ScheduleItem [pr] (#9009)"

This reverts commit c52cd2b437.

* Revert "construct the schedule sink 2 (#8925)"

This reverts commit cfd3db7862.
2025-02-11 11:34:12 +08:00
George Hotz
16e9e4db37 make llvm opt the default (#9017) 2025-02-11 10:08:45 +08:00
divinity76
bec4f59ce8 workaround f16 cast ambiguity (#8935)
for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose)

error: 
```
Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                 ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                               ^

<null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp)
            function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp)
    *((half4*)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3)));
                                                                                                                              ^

4 errors detected in the compilation of "<null>".
```
2025-02-11 09:38:56 +08:00
chenyu
b741a9aae7 update doc of Tensor.tolist (#9016)
it returns single value for const tensor
2025-02-10 16:51:23 -05:00
Joel
04e64765c4 Minor typo in ReadMe (#9015) 2025-02-10 15:30:20 -05:00
chenyu
6c39aa4a6b adjust cuda ci test targets (#9014) 2025-02-10 15:29:59 -05:00
nimlgen
dfc9d6827f am_smi: print power state (#9013) 2025-02-10 23:07:39 +03:00
qazal
bf0bcb2d5a cleanup ast rewriter [pr] (#9012) 2025-02-10 19:07:59 +01:00
chenyu
586e48d696 a few more backward tests now pass (#9010) 2025-02-10 12:46:21 -05:00
chenyu
f9898f7554 update gpuocelot commit (#9011) 2025-02-10 12:18:44 -05:00
qazal
c52cd2b437 kernel op cleanups + use ScheduleItem [pr] (#9009) 2025-02-10 17:54:30 +01:00
chenyu
25fa5e4d5f enable backward tests in test_std_one_in_axis [pr] (#9007)
still one correction=0 case is broken

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-02-10 10:44:05 -05:00
qazal
d426f1ad6e don't open devices in lowering (#9008) 2025-02-10 15:28:51 +01:00
qazal
cfd3db7862 construct the schedule sink 2 (#8925)
* work

* delete preload

* fix metadata

* this can keep existing

* assign pruning

* dedup early

* bfs

* cycle asserts

* move assign check

* once
2025-02-10 22:23:02 +08:00
nimlgen
3e005ca0c2 am: resize bar0 to max supported (#9006) 2025-02-10 16:48:44 +03:00
nimlgen
07cb7e701c am: fix gfx usage at 100% (#9003)
* am: fix gfx usage at 100%

* not need

* not needed

* fix power con

* not supported on 7600
2025-02-10 16:48:23 +03:00
nimlgen
f91409f038 am: fix proclogs (#9004) 2025-02-10 16:38:58 +03:00
qazal
cd77e51810 fix tensor realization bug in #8975 (#8984)
* fix tensor realization bug in #8975

* that's a reshape now

* work

* works

* give those tests better names

* test when multiple mops result in the same ShapeTracker

* test_become_existing_buf_complex is enough

* that too
2025-02-10 13:51:30 +01:00
qazal
b17ec42b56 remove const_arg (#9002)
* remove const_arg

* use -m pytest

* remove test_const_arg test, variable arg on CONST does not exist.

* use base in test_const_dtype
2025-02-10 12:45:11 +01:00
George Hotz
0568720a68 delete revectorize (#9000)
* delete revectorize

* test vectorized LLVM/CLANG

* idk about that

* was that the segfault?
2025-02-10 18:32:35 +08:00
qazal
fd9f9ec772 realized base tensors become RESHAPE(BUFFER) [pr] (#8994) 2025-02-10 10:17:54 +01:00
George Hotz
910ae260cd dsp float4 fold + revectorize [pr] (#8995)
* dsp float4 fold [pr]

* revectorize

* fix reg issue

* no bool vectorize

* cleanups

* no need for that
2025-02-10 12:14:32 +08:00
George Hotz
e618efce22 COMMUTATIVE flipping is only for ints (#8996)
* COMMUTATIVE flipping is only for ints [pr]

* no pr

* comm fixes this
2025-02-10 12:01:28 +08:00
George Hotz
2983285315 use HEX_REG_QEMU_INSN_CNT from qemu as a DSP timer [pr] (#8993)
* use HEX_REG_QEMU_INSN_CNT from qemu as a DSP timer [pr]

* add quantize test to dsp

* fix tests

* older onnx

* debug, let's see what's happening
2025-02-10 11:07:35 +08:00
chenyu
9119716761 update Tensor.maximum (#8992)
now it's just broadcast and UOp.maximum
2025-02-09 21:26:27 -05:00
nimlgen
88add71c25 amd: increase sdma copy size (#8989)
* amd: increase sdma max copy size

* rm this

* fix

* fx

* ops
2025-02-09 20:53:35 +03:00
qazal
7eba5fb413 Tensor.empty is RESHAPE(BUFFER) (#8987)
* empty is RESHAPE(BUFFER)

* eh

* add test_empty_buf

* can we unsupport this

* linter

* Revert "can we unsupport this"

This reverts commit 0f71e1aadb.
2025-02-09 18:42:51 +01:00
qazal
44479f8ad6 raise ValueError in view reshape for negative dims [pr] (#8988) 2025-02-09 17:27:15 +01:00
nimlgen
c6c2373bc0 replace libpciaccess autogen with just pci regs (#8983)
* replace libpciaccess autogen with just pci regs

* add pci.py
2025-02-09 18:40:45 +03:00
qazal
55351ebb31 minimal failing test for #8975 [pr] (#8982) 2025-02-09 14:10:37 +01:00
nimlgen
e5a3f60fc2 am: remove libpciaccess dep (#8980)
* am: remove libpciaccess dep

* offset in mockhwiface

* op

* fake regions
2025-02-09 16:06:55 +03:00
nimlgen
52a69dd5e9 Revert "use am in training benchmarks (#8965)" (#8981)
This reverts commit 107e616857.
2025-02-09 15:43:45 +03:00
George Hotz
0b26cee2f1 fix some slow tests [pr] (#8979) 2025-02-09 15:57:04 +08:00
George Hotz
208097d488 try reducing testing deps [pr] (#8976)
* reduce testing deps

* break out test models

* add PR to models, add models to metal

* okay, not that

* mac cleanup

* mac typo

* other typo
2025-02-09 15:22:32 +08:00
George Hotz
6ffee2fca9 reduce speed example [pr] (#8978)
* reduce speed example

* fast like a nascar
2025-02-09 14:13:59 +08:00
Samuel Ayala
ac3765c043 use getpass instead of os.getlogin() (#8972) 2025-02-08 23:29:26 +03:00
qazal
308516e439 fix viz paginate + cleanups [pr] (#8973)
* fix viz paginate [pr]

* cleanups

* remove the extra font definition

* more work

* none for the first graph
2025-02-08 20:26:57 +01:00
nimlgen
107e616857 use am in training benchmarks (#8965)
* am in training benchmarks

* fix

* not needed anymore
2025-02-08 20:20:47 +03:00
nimlgen
79de980565 am: do not fork pci bars (#8969) 2025-02-08 19:03:17 +03:00
chenyu
0cac941af1 move xpow to sym instead of late_rewrite (#8968)
does not need to be in late_rewrite and can be simplified further
2025-02-08 10:09:24 -05:00
qazal
e7182bbb2c fix "fatal bad object" log in process replay [pr] (#8966) 2025-02-08 11:57:38 +01:00
uuuvn
9b9c1e14da Late MTLCompiler load (#8963)
Moved loading MTLCompiler (and trying to load normal llvm before it)
to MetalCompiler, like in CPUProgram with helper
2025-02-08 17:29:23 +08:00
George Hotz
a3c78d47b3 speed docs + upgrades [pr] (#8964)
* add some docs about speed [pr]

* better torch gemm

* enable locals on llvm/clang

* disable locals for beam speed on LLVM/CLANG

* 0x20 alignment in llvm allows ymm use
2025-02-08 17:28:52 +08:00
George Hotz
5bdd6a1cc4 increase CI speed with more runners [pr] (#8961)
* increase CI speed with more runners [pr]

* splits + cleanups [pr]

* more runners

* need that dep

* split that too

* can't be minimal

* move test readme

* bugfix + naming

* one more split

* bump to 22.04
2025-02-08 09:04:36 +08:00