qazal
9e2089dcd4
don't raise Exception in process replay [pr] ( #10392 )
...
* don't raise Exception in process replay [pr]
* continue generating diffs unless [pr] is set, exit(1) otherwise
* change
* works
2025-05-18 11:23:23 +03:00
chenyu
9b4e2a75cd
symlink datasets in mlperf workflow ( #10391 )
2025-05-18 03:26:05 -04:00
uuuvn
f20c5aac1f
Use itertools.count instead of manual increment in remote ( #10389 )
...
Similar to how it's done with `UOp.unique_num`, looks a bit nicer
2025-05-18 00:15:37 -07:00
qazal
0294bfe507
simpler can_pad ( #10364 )
...
* simpler can_pad [pr]
* 3 kernels
* tests
* less kernels
2025-05-18 10:00:07 +03:00
George Hotz
c91f2c4580
use float32 for sgd momentum ( #10387 )
2025-05-17 21:56:44 -07:00
George Hotz
305a3231c4
fix beam none if buf is optimized out ( #10388 )
2025-05-17 21:50:33 -07:00
George Hotz
6f77b938d7
Move getbits tests into test_helpers ( #10382 )
2025-05-17 17:04:00 -07:00
George Hotz
6ebfb505e9
docs: fix crossentropy name ( #10377 )
2025-05-17 16:39:14 -07:00
George Hotz
0b733ba75e
multi device training with GPT2 [pr] ( #10375 )
...
* multi device training with GPT2 [pr]
* Update grouper.py
2025-05-17 15:33:56 -07:00
George Hotz
6ec88d94df
add tests for multi ram usage [pr] ( #10376 )
2025-05-17 15:33:40 -07:00
uuuvn
5a18eab908
Fix __del__ in remote program ( #10372 )
...
Similar to #10341 , broke after hypotesis unpin
2025-05-17 21:29:44 +03:00
वेदांत
2453d99050
rms matching pytorch implementation ( #10319 )
...
* rms matching pytorch implementation
* pre commit fix
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-05-17 08:23:11 -07:00
nimlgen
da2b1834b4
hotfix: metal graph var vals ( #10370 )
2025-05-17 17:22:55 +03:00
qazal
e054b53a75
kernel count tests for pad [pr] ( #10369 )
...
* kernel count tests for pads
* handcoded rand one kernel
* comment
* prerealize device rng counter
* test_rand_handcoded generates /0
* remove track_rewrites
2025-05-17 17:20:46 +03:00
nimlgen
90c4bb10c0
fixedvars in all graphs ( #10365 )
...
* cuda fixedvars
* metal: fixevars
* f
* ups
* count fixedvars
2025-05-17 16:18:52 +03:00
chenyu
efa8dfe7fb
test cron job to run resnet ( #10368 )
2025-05-17 08:57:02 -04:00
uuuvn
2c706d363e
Remote higher timeout and overridable via REMOTE_TIMEOUT ( #10367 )
...
sometimes a minute is not enough, 5 minutes should be but if it isn't
for some huge workload it can be overridden
2025-05-17 15:30:49 +03:00
nimlgen
4fa1837916
metal: do not require icb fix on m3+ ( #10366 )
2025-05-17 15:30:40 +03:00
Xingyu
286b0f4051
Add equal function implementation and corresponding test ( #10351 )
...
- Implemented a new function `equal` in the torch backend to compare two tensors for equality.
- Added unit tests for the `equal` function to verify its correctness with different tensor inputs.
2025-05-16 23:39:49 -07:00
George Hotz
e13f2a3092
multi is O(1) ( #10183 )
...
* multi is O(1)
* allreduce
* no new uops needed
* junk
* something
* simple
* that's really what i want
* closer
* inject _device_num
* pretty print
* cleanups
* this
* early dnum
* ops allreduce is good
* ish
* device is the tuple and this is fine
* simpler
* progress
* copy_multi
* work
* more tests
* more tests pass
* work
* no None axis
* tests
* no none multi
* type fixes
* pre commit passes
* lil
* remove this
* mlperf dataloader on mac
* that test was wrong
* unbind
* support DEBUG=2
* realize
* only unbind bound vars
* don't include fixedvars
* graph test
* one test
* fixedvars in hcq
* new ring reduce
* ring reduce
* simpler ring
* mselect
* mselect doesn't work
* Revert "mselect doesn't work"
This reverts commit c78b77bd7d .
* Revert "mselect"
This reverts commit bb2e430ac3 .
* simpler
* fixups
* no optional
* fix jit
* move things around
* cleanup multi
* simpler multi
* simpler reshape
2025-05-16 23:14:23 -07:00
George Hotz
e1a40e8040
add hcq fixedvars support [pr] ( #10356 )
...
* add hcq fixedvars support [pr]
* different test
* fixedvars are only for comp_queues
* fix hcq varvals
2025-05-16 22:05:53 -07:00
George Hotz
11b5895c85
hotfix: schedule timing in tensor.py
2025-05-16 20:10:32 -07:00
uuuvn
64409a8bda
Remote beam ( #10357 )
...
* Use renderer properties instead of `.device`
* Remote beam
2025-05-16 18:59:22 -07:00
George Hotz
7cc35a031b
don't use UOp.multi in Tensor.rand ( #10362 )
2025-05-16 16:09:36 -07:00
George Hotz
7703dbef99
view substitute [pr] ( #10360 )
2025-05-16 15:08:24 -07:00
Elnur Rakhmatullin
de2b323d97
Fixed a typo in "simplify" ( #10358 )
2025-05-16 14:45:14 -07:00
Harald Schäfer
ee5258328a
You never want multiple backends ( #10354 )
2025-05-16 13:10:39 -07:00
George Hotz
876d2275a1
changes from new multi ( #10353 )
...
* changes from new multi
* revert hcq change
2025-05-16 13:07:29 -07:00
wozeparrot
66e00c04dd
fix: skip kernel timing tests on ci cuda ( #10348 )
2025-05-16 11:48:06 -07:00
Ignacio Sica
a54fd745c3
simpler barrier match in remu ( #10339 )
...
* s_barrier
* remove s_barrier from syncs
2025-05-16 14:40:58 +03:00
qazal
e9e5b54e43
grouper cleanups and merge with insert_kernels [pr] ( #10349 )
...
* grouper cleanups and merge with insert_kernels [pr]
* remove that
2025-05-16 14:39:56 +03:00
b1tg
caded2f413
llvm diagnostic error ( #10267 )
...
* llvm diagnostic info
* use decorator
* better error reporting
* fix mypy
* collect all diag msgs
* test diag error
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-05-16 02:03:20 -04:00
George Hotz
a4a25720b2
add test_multitensor_jit_input [pr] ( #10347 )
2025-05-15 20:47:57 -07:00
chenyu
c798f2f427
brew --quiet to suppress already installed warnings ( #10346 )
...
example https://github.com/tinygrad/tinygrad/actions/runs/15057000247
2025-05-15 23:31:18 -04:00
wozeparrot
12a1ccc680
clean: double import ( #10345 )
2025-05-15 20:15:09 -07:00
wozeparrot
1ed04f993b
move benchmark stat tracking to influxdb ( #10185 )
2025-05-15 16:14:56 -07:00
wozeparrot
f59ecf2116
fix: mockgpu cuda timing ( #10343 )
2025-05-15 14:14:14 -07:00
nimlgen
a825608dc2
hcq: fix progs' __del__ when shutdown ( #10341 )
...
* debug ci
* better?
* and mute this?
* revrt that
2025-05-15 23:26:48 +03:00
Ignacio Sica
47b3055fe2
set fail-fast behavior ( #10336 )
2025-05-15 11:24:45 -07:00
uuuvn
c2bf2c6bb0
Remote offset ( #10311 )
...
For memory savings from memory planner. Also for some reason it makes hlb
cifar on mac noticeably faster.
master:
```
3 210.12 ms run, 4.34 ms python, 205.78 ms REMOTE, 2075.90 loss, 0.002698 LR, 2.07 GB used, 1558.41 GFLOPS, 327.45 GOPS
4 210.40 ms run, 4.33 ms python, 206.07 ms REMOTE, 2481.94 loss, 0.002262 LR, 2.07 GB used, 1556.34 GFLOPS, 327.45 GOPS
5 188.08 ms run, 4.41 ms python, 183.67 ms REMOTE, 1967.49 loss, 0.001827 LR, 2.07 GB used, 1741.00 GFLOPS, 327.45 GOPS
6 211.19 ms run, 4.26 ms python, 206.93 ms REMOTE, 1511.62 loss, 0.001392 LR, 2.07 GB used, 1550.51 GFLOPS, 327.45 GOPS
```
this:
```
3 189.05 ms run, 4.50 ms python, 184.55 ms REMOTE, 2075.90 loss, 0.002698 LR, 1.60 GB used, 1732.08 GFLOPS, 327.45 GOPS
4 187.81 ms run, 4.11 ms python, 183.71 ms REMOTE, 2481.94 loss, 0.002262 LR, 1.60 GB used, 1743.49 GFLOPS, 327.45 GOPS
5 186.70 ms run, 4.09 ms python, 182.62 ms REMOTE, 1967.49 loss, 0.001827 LR, 1.60 GB used, 1753.89 GFLOPS, 327.45 GOPS
6 187.18 ms run, 4.06 ms python, 183.12 ms REMOTE, 1511.62 loss, 0.001392 LR, 1.60 GB used, 1749.36 GFLOPS, 327.45 GOPS
```
(`PYTHONPATH=. REMOTE=1 REMOTEDEV=METAL BS=256 STEPS=10 python examples/hlb_cifar10.py`)
Clouldn't reliably reproduce the faster thing on tinybox though.
2025-05-15 11:20:01 -07:00
Ignacio Sica
3c453e96a9
add ds_load_b96 and ds_store_b96 instructions ( #10338 )
2025-05-15 18:11:08 +03:00
qazal
be8202b293
add s_abs_i32 instruction to remu ( #10334 )
2025-05-15 16:47:58 +03:00
nimlgen
5efbe1c947
print offset only for subbuf ( #10332 )
2025-05-15 15:35:19 +03:00
qazal
7cfe367c07
failing test for slow embedding kernel with FUSE_ARANGE=1 [pr] ( #10330 )
2025-05-15 14:58:11 +03:00
nimlgen
5f03688280
usbgpu: remove max_read_len ( #10328 )
2025-05-15 14:49:58 +03:00
qazal
27b3dbe67e
remove FUSE_ARANGE_UINT [pr] ( #10324 )
2025-05-15 14:39:54 +03:00
qazal
0a45cd0cbe
grouper: merge views in fuse elementwise ( #10325 )
...
* grouper: merge views in fuse elementwise
* with gradient api
2025-05-15 13:17:09 +03:00
qazal
89d8d5b25e
add dims check in FUSE_ARANGE ( #10323 )
2025-05-15 11:33:21 +03:00
qazal
8fad0f0124
grouper: check for unsafe PAD in FUSE ( #10322 )
2025-05-15 10:53:44 +03:00
chenyu
f008e5f233
test_dtype_alu should cast bf16 input ( #10320 )
...
when testing alu for bfloat16, it should cast inputs to bfloat16 first, otherwise numpy has both errors from input and errors from alu which is more inaccurate
2025-05-15 01:11:39 -04:00