Commit Graph

8863 Commits

Author SHA1 Message Date
qazal
9e2089dcd4 don't raise Exception in process replay [pr] (#10392)
* don't raise Exception in process replay [pr]

* continue generating diffs unless [pr] is set, exit(1) otherwise

* change

* works
2025-05-18 11:23:23 +03:00
chenyu
9b4e2a75cd symlink datasets in mlperf workflow (#10391) 2025-05-18 03:26:05 -04:00
uuuvn
f20c5aac1f Use itertools.count instead of manual increment in remote (#10389)
Similar to how it's done with `UOp.unique_num`, looks a bit nicer
2025-05-18 00:15:37 -07:00
qazal
0294bfe507 simpler can_pad (#10364)
* simpler can_pad [pr]

* 3 kernels

* tests

* less kernels
2025-05-18 10:00:07 +03:00
George Hotz
c91f2c4580 use float32 for sgd momentum (#10387) 2025-05-17 21:56:44 -07:00
George Hotz
305a3231c4 fix beam none if buf is optimized out (#10388) 2025-05-17 21:50:33 -07:00
George Hotz
6f77b938d7 Move getbits tests into test_helpers (#10382) 2025-05-17 17:04:00 -07:00
George Hotz
6ebfb505e9 docs: fix crossentropy name (#10377) 2025-05-17 16:39:14 -07:00
George Hotz
0b733ba75e multi device training with GPT2 [pr] (#10375)
* multi device training with GPT2 [pr]

* Update grouper.py
2025-05-17 15:33:56 -07:00
George Hotz
6ec88d94df add tests for multi ram usage [pr] (#10376) 2025-05-17 15:33:40 -07:00
uuuvn
5a18eab908 Fix __del__ in remote program (#10372)
Similar to #10341, broke after hypotesis unpin
2025-05-17 21:29:44 +03:00
वेदांत
2453d99050 rms matching pytorch implementation (#10319)
* rms matching pytorch implementation

* pre commit fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-17 08:23:11 -07:00
nimlgen
da2b1834b4 hotfix: metal graph var vals (#10370) 2025-05-17 17:22:55 +03:00
qazal
e054b53a75 kernel count tests for pad [pr] (#10369)
* kernel count tests for pads

* handcoded rand one kernel

* comment

* prerealize device rng counter

* test_rand_handcoded generates /0

* remove track_rewrites
2025-05-17 17:20:46 +03:00
nimlgen
90c4bb10c0 fixedvars in all graphs (#10365)
* cuda fixedvars

* metal: fixevars

* f

* ups

* count fixedvars
2025-05-17 16:18:52 +03:00
chenyu
efa8dfe7fb test cron job to run resnet (#10368) 2025-05-17 08:57:02 -04:00
uuuvn
2c706d363e Remote higher timeout and overridable via REMOTE_TIMEOUT (#10367)
sometimes a minute is not enough, 5 minutes should be but if it isn't
for some huge workload it can be overridden
2025-05-17 15:30:49 +03:00
nimlgen
4fa1837916 metal: do not require icb fix on m3+ (#10366) 2025-05-17 15:30:40 +03:00
Xingyu
286b0f4051 Add equal function implementation and corresponding test (#10351)
- Implemented a new function `equal` in the torch backend to compare two tensors for equality.
- Added unit tests for the `equal` function to verify its correctness with different tensor inputs.
2025-05-16 23:39:49 -07:00
George Hotz
e13f2a3092 multi is O(1) (#10183)
* multi is O(1)

* allreduce

* no new uops needed

* junk

* something

* simple

* that's really what i want

* closer

* inject _device_num

* pretty print

* cleanups

* this

* early dnum

* ops allreduce is good

* ish

* device is the tuple and this is fine

* simpler

* progress

* copy_multi

* work

* more tests

* more tests pass

* work

* no None axis

* tests

* no none multi

* type fixes

* pre commit passes

* lil

* remove this

* mlperf dataloader on mac

* that test was wrong

* unbind

* support DEBUG=2

* realize

* only unbind bound vars

* don't include fixedvars

* graph test

* one test

* fixedvars in hcq

* new ring reduce

* ring reduce

* simpler ring

* mselect

* mselect doesn't work

* Revert "mselect doesn't work"

This reverts commit c78b77bd7d.

* Revert "mselect"

This reverts commit bb2e430ac3.

* simpler

* fixups

* no optional

* fix jit

* move things around

* cleanup multi

* simpler multi

* simpler reshape
2025-05-16 23:14:23 -07:00
George Hotz
e1a40e8040 add hcq fixedvars support [pr] (#10356)
* add hcq fixedvars support [pr]

* different test

* fixedvars are only for comp_queues

* fix hcq varvals
2025-05-16 22:05:53 -07:00
George Hotz
11b5895c85 hotfix: schedule timing in tensor.py 2025-05-16 20:10:32 -07:00
uuuvn
64409a8bda Remote beam (#10357)
* Use renderer properties instead of `.device`

* Remote beam
2025-05-16 18:59:22 -07:00
George Hotz
7cc35a031b don't use UOp.multi in Tensor.rand (#10362) 2025-05-16 16:09:36 -07:00
George Hotz
7703dbef99 view substitute [pr] (#10360) 2025-05-16 15:08:24 -07:00
Elnur Rakhmatullin
de2b323d97 Fixed a typo in "simplify" (#10358) 2025-05-16 14:45:14 -07:00
Harald Schäfer
ee5258328a You never want multiple backends (#10354) 2025-05-16 13:10:39 -07:00
George Hotz
876d2275a1 changes from new multi (#10353)
* changes from new multi

* revert hcq change
2025-05-16 13:07:29 -07:00
wozeparrot
66e00c04dd fix: skip kernel timing tests on ci cuda (#10348) 2025-05-16 11:48:06 -07:00
Ignacio Sica
a54fd745c3 simpler barrier match in remu (#10339)
* s_barrier

* remove s_barrier from syncs
2025-05-16 14:40:58 +03:00
qazal
e9e5b54e43 grouper cleanups and merge with insert_kernels [pr] (#10349)
* grouper cleanups and merge with insert_kernels [pr]

* remove that
2025-05-16 14:39:56 +03:00
b1tg
caded2f413 llvm diagnostic error (#10267)
* llvm diagnostic info

* use decorator

* better error reporting

* fix mypy

* collect all diag msgs

* test diag error

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-16 02:03:20 -04:00
George Hotz
a4a25720b2 add test_multitensor_jit_input [pr] (#10347) 2025-05-15 20:47:57 -07:00
chenyu
c798f2f427 brew --quiet to suppress already installed warnings (#10346)
example https://github.com/tinygrad/tinygrad/actions/runs/15057000247
2025-05-15 23:31:18 -04:00
wozeparrot
12a1ccc680 clean: double import (#10345) 2025-05-15 20:15:09 -07:00
wozeparrot
1ed04f993b move benchmark stat tracking to influxdb (#10185) 2025-05-15 16:14:56 -07:00
wozeparrot
f59ecf2116 fix: mockgpu cuda timing (#10343) 2025-05-15 14:14:14 -07:00
nimlgen
a825608dc2 hcq: fix progs' __del__ when shutdown (#10341)
* debug ci

* better?

* and mute this?

* revrt that
2025-05-15 23:26:48 +03:00
Ignacio Sica
47b3055fe2 set fail-fast behavior (#10336) 2025-05-15 11:24:45 -07:00
uuuvn
c2bf2c6bb0 Remote offset (#10311)
For memory savings from memory planner. Also for some reason it makes hlb
cifar on mac noticeably faster.

master:
```
  3  210.12 ms run,    4.34 ms python,  205.78 ms REMOTE, 2075.90 loss, 0.002698 LR, 2.07 GB used,   1558.41 GFLOPS,    327.45 GOPS
  4  210.40 ms run,    4.33 ms python,  206.07 ms REMOTE, 2481.94 loss, 0.002262 LR, 2.07 GB used,   1556.34 GFLOPS,    327.45 GOPS
  5  188.08 ms run,    4.41 ms python,  183.67 ms REMOTE, 1967.49 loss, 0.001827 LR, 2.07 GB used,   1741.00 GFLOPS,    327.45 GOPS
  6  211.19 ms run,    4.26 ms python,  206.93 ms REMOTE, 1511.62 loss, 0.001392 LR, 2.07 GB used,   1550.51 GFLOPS,    327.45 GOPS
```

this:
```
  3  189.05 ms run,    4.50 ms python,  184.55 ms REMOTE, 2075.90 loss, 0.002698 LR, 1.60 GB used,   1732.08 GFLOPS,    327.45 GOPS
  4  187.81 ms run,    4.11 ms python,  183.71 ms REMOTE, 2481.94 loss, 0.002262 LR, 1.60 GB used,   1743.49 GFLOPS,    327.45 GOPS
  5  186.70 ms run,    4.09 ms python,  182.62 ms REMOTE, 1967.49 loss, 0.001827 LR, 1.60 GB used,   1753.89 GFLOPS,    327.45 GOPS
  6  187.18 ms run,    4.06 ms python,  183.12 ms REMOTE, 1511.62 loss, 0.001392 LR, 1.60 GB used,   1749.36 GFLOPS,    327.45 GOPS
```

(`PYTHONPATH=. REMOTE=1 REMOTEDEV=METAL BS=256 STEPS=10 python examples/hlb_cifar10.py`)

Clouldn't reliably reproduce the faster thing on tinybox though.
2025-05-15 11:20:01 -07:00
Ignacio Sica
3c453e96a9 add ds_load_b96 and ds_store_b96 instructions (#10338) 2025-05-15 18:11:08 +03:00
qazal
be8202b293 add s_abs_i32 instruction to remu (#10334) 2025-05-15 16:47:58 +03:00
nimlgen
5efbe1c947 print offset only for subbuf (#10332) 2025-05-15 15:35:19 +03:00
qazal
7cfe367c07 failing test for slow embedding kernel with FUSE_ARANGE=1 [pr] (#10330) 2025-05-15 14:58:11 +03:00
nimlgen
5f03688280 usbgpu: remove max_read_len (#10328) 2025-05-15 14:49:58 +03:00
qazal
27b3dbe67e remove FUSE_ARANGE_UINT [pr] (#10324) 2025-05-15 14:39:54 +03:00
qazal
0a45cd0cbe grouper: merge views in fuse elementwise (#10325)
* grouper: merge views in fuse elementwise

* with gradient api
2025-05-15 13:17:09 +03:00
qazal
89d8d5b25e add dims check in FUSE_ARANGE (#10323) 2025-05-15 11:33:21 +03:00
qazal
8fad0f0124 grouper: check for unsafe PAD in FUSE (#10322) 2025-05-15 10:53:44 +03:00
chenyu
f008e5f233 test_dtype_alu should cast bf16 input (#10320)
when testing alu for bfloat16, it should cast inputs to bfloat16 first, otherwise numpy has both errors from input and errors from alu which is more inaccurate
2025-05-15 01:11:39 -04:00