nimlgen
08ab184dfd
usbgpu: copyin over 100mb/s ( #10259 )
...
* usbgpu: over 100mb/s
* align
* h
2025-05-12 16:52:43 +03:00
Kirill R.
4c7c139102
Use cmod/cdiv in sym_infer ( #10258 )
...
* Use cmod/cdiv in sym_infer
* test
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-05-12 09:07:28 -04:00
chenyu
0015b3921f
sleep more in CI Remove amdgpu ( #10261 )
...
see if this is less flaky
2025-05-12 08:13:44 -04:00
qazal
95c6a736a9
fix FUSE_ARANGE=1 for bert ( #10255 )
2025-05-12 14:44:05 +03:00
Sieds Lykles
7c4b381fbf
Extra simplify valid test [pr] ( #10256 )
...
* add test
* Change the range
* add todo test
2025-05-12 07:32:03 -04:00
b1tg
7eeb35ba6f
fix AMD LLVM compile error for bf16 cifar ( #10254 )
...
* fix AMD LLVM compile error
* remove llvm_bf16_cast
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-05-12 01:57:07 -04:00
uuuvn
a0ed1ec1ae
Faster remote server ( #10235 )
2025-05-11 19:15:05 -07:00
b1tg
41f5ece877
add nsw flag ( #10249 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-05-11 19:14:32 -07:00
George Hotz
8864ff894b
hotfix: that repeat_kv belongs outside the if
2025-05-11 18:43:01 -07:00
George Hotz
98c84a711d
min rectified flow example [pr] ( #10252 )
...
* work on minrf example
* more
* jit sample
* t is tensor not const
* fixes
* more convs
* fix dropout
* don't print
* 504
* big patch
* onehot
* touch
* use embeddings
* dumb uses final layer
* act
* non fl
* match
* tp
* 3
* of
* ppsz
* normal
* add adln
* no t
* weird transformer
* weird transformer
* contig
* actual speed fix
* dumb
* cb
* 0
* t is 0
* mort-t
* args
* dumb days are over
* readable
* contig
* no more t mask
* mask_t
* init to zero
* clean
* steps
* work
* tt
* t
* solid
2025-05-11 18:36:44 -07:00
chenyu
70c797b107
train bert tests ( #10248 )
...
added a working bert tiny test, and a failed bert FUSE_ARANGE test
2025-05-11 08:42:08 -04:00
George Hotz
b2df4cb696
add support for amdgpu-flat-work-group-size to AMD LLVM IR ( #10246 )
...
* add support for amdgpu-flat-work-group-size to AMD LLVM IR
* don't spam llvm init
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-05-10 19:11:10 -07:00
qazal
9210280811
add v_fmac_f16 vop3 instruction to remu ( #10247 )
...
* fmac vop3
* from the box
2025-05-10 23:48:25 +03:00
George Hotz
697259a8a1
amd_comgr_action_info_set_options was deprecated [pr] ( #10245 )
...
* amd_comgr_action_info_set_options was deprecated [pr]
* more standard
2025-05-10 11:59:04 -07:00
Kevin Buhler
2e0990c4e9
even spacing in viz nodes ( #10168 )
...
* even spacing in viz nodes
* precise dy value
* dominant-baseline text-after-edge
* add STROKE_WIDTH constant, delete dominant_baseline attr
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-05-10 10:35:10 +03:00
chenyu
d0e9b74f40
minor div_and_mod_folding cleanup [pr] ( #10243 )
...
remove type ignore and one walrus
2025-05-09 22:42:01 -04:00
Adam Van Ymeren
a28ca0680f
update dead link ( #10242 )
2025-05-09 19:59:52 -04:00
nimlgen
2145bce3f9
usbgpu: copyin size is 16k ( #10240 )
...
* usbgpu: copyin size is 16k
* ush
2025-05-09 22:12:54 +03:00
Sieds Lykles
74e40aafa0
use cdiv in div and mod folding ( #10216 )
...
* use cdiv
* use cdiv and cmod there as well
* Add tests
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-05-09 12:37:24 -04:00
Sieds Lykles
8da9c070ca
take gcd out of trunc div ( #10238 )
2025-05-09 12:08:10 -04:00
qazal
e2292f6663
TRACEMETA>=2 displays UOp metadata in VIZ ( #10237 )
2025-05-09 17:42:00 +03:00
qazal
d5686f33a9
delete KernelContext dataclass [pr] ( #10236 )
2025-05-09 17:36:21 +03:00
qazal
467daf8d4c
remap UOp metadata in graph_rewrite_map [pr] ( #10234 )
...
* remap metadata in graph_rewrite_map [pr]
* fix
* merge loops
* UOp.metadata returns Metadata|None
* shorter
2025-05-09 17:20:53 +03:00
nimlgen
4c75b124b6
usb: copy into mv is faster ( #10233 )
...
* usb: copy into mv is faster
* missing
* bytes
2025-05-09 14:53:36 +03:00
nimlgen
d08ce62553
hcq: do not reread signal in wait ( #10232 )
2025-05-09 14:38:36 +03:00
nimlgen
0464a31000
usbgpu: no overrun check needed ( #10231 )
2025-05-09 14:20:24 +03:00
nimlgen
116390083f
nvme speed write example ( #10230 )
2025-05-09 14:20:01 +03:00
chenyu
9846435c2e
fix test_div_numerator_negative ( #10229 )
...
the simplification was wrong with negative const_factor
2025-05-09 06:19:59 -04:00
chenyu
cba508c8c3
update uop symbolic tests ( #10228 )
...
clean up TODOs and update tests
2025-05-09 01:55:53 -04:00
chenyu
56def6c319
better bound for mod negative number ( #10227 )
2025-05-09 01:19:47 -04:00
chenyu
99f6d89dfb
tighter idiv bound for symbolic denominator ( #10226 )
2025-05-08 22:38:56 -04:00
uuuvn
82a6160ff7
Detect metal paravirtualization bug via device name instead of CI ( #10225 )
2025-05-08 19:31:47 -07:00
Xingyu
a21369d039
Enhance tensor random functions with dtype support ( #10214 )
...
* Enhance tensor random functions with dtype support
- Updated `aten.uniform_` and `aten.normal_` to include dtype parameter in backend.py
- Added unit tests for uniform and normal tensor generation with specific dtypes in test.py
* Refactor test name for clarity
- Renamed `test_normal_dtype` to `test_normal` in `extra/torch_backend/test.py`
- Aims to improve readability and better reflect the test's purpose
2025-05-08 20:48:07 -04:00
qazal
b6904bbf83
Revert "split grouper into insert and finalize stages [pr] ( #10222 )" ( #10224 )
...
This reverts commit 2594e4db15 .
2025-05-09 03:02:38 +03:00
qazal
2594e4db15
split grouper into insert and finalize stages [pr] ( #10222 )
2025-05-09 02:36:22 +03:00
George Hotz
0b7e3e86d0
single device copy [pr] ( #10221 )
...
* single device copy [pr]
* simpler
2025-05-08 15:23:22 -07:00
qazal
1d0f239df7
use Tensor.train() in schedule test + typo [pr] ( #10220 )
2025-05-08 23:46:42 +03:00
qazal
ff2aa6d0b2
buffer in create_kernel is optional [pr] ( #10218 )
...
* buffer in create_kernel is optional [pr]
* pylint
2025-05-08 22:35:55 +03:00
qazal
40560e77c2
minor grouper + viz fixup [pr] ( #10217 )
...
* minor grouper + viz fixup [pr]
* gitignore mypy_cache
* reorder create_kernels
* replace with realized
* use tensor_map + viz before spec
* lint
* add that back
2025-05-08 21:39:44 +03:00
George Hotz
0411b09763
small changes from new multi [pr] ( #10213 )
2025-05-08 07:04:27 -07:00
Sieds Lykles
a0580e8d3c
Cleanup in div_and_mod_folding [pr] ( #10178 )
...
* Refactor binary var simplification
* Simplify the congruence logic
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-05-08 06:25:32 -07:00
nimlgen
267ba9b592
usbgpu: better names in copy speed benchmark ( #10212 )
2025-05-08 16:12:37 +03:00
hooved
7b4f05fd00
Add test for correctness of Infinity in WebGPU ( #10201 )
...
* use function for infinity instead of uniform
* test infinity math locally
* test infinity math in CI
* make pytest available to MacOS (WebGPU)
* revert to master except failing webgpu test
2025-05-08 05:20:05 -07:00
nimlgen
e24fe1c746
usbgpu: pci cache ( #10207 )
2025-05-08 14:31:01 +03:00
nimlgen
7d6ed1b1e9
hotfix: mac ci ( #10210 )
...
* fixed?
* cmnt
2025-05-08 14:13:23 +03:00
nimlgen
ba52fce4b2
usbgpu: benchmark in ci ( #10208 )
...
* usbgpu: benchmark
* usbgpu: benchmark
2025-05-08 12:02:04 +03:00
qazal
d0e3449992
remove view_supported_devices, check allocator instead [pr] ( #10209 )
2025-05-08 11:45:02 +03:00
nimlgen
5a7f6b4d8e
am: fix launch on rdna4 ( #10206 )
2025-05-08 09:46:12 +03:00
George Hotz
8d4c563c01
all COPY can be clone ( #10205 )
...
* match old behavior
* simple
* it means the naive thing before the multi
* fix
2025-05-07 20:31:39 -07:00
hooved
8e76c40aea
Refactor test: Enable generality in testing UOp alu expressions ( #10200 )
...
* use function for infinity instead of uniform
* test infinity math locally
* test infinity math in CI
* make pytest available to MacOS (WebGPU)
* revert to master except failing webgpu test
* isolate test refactor
2025-05-07 19:39:44 -07:00