Commit Graph

8888 Commits

Author SHA1 Message Date
George Hotz
ceb9d94eab Update AGENTS.md 2025-05-19 17:59:59 -07:00
George Hotz
9389edf7ac hotfix: add AGENTS.md 2025-05-19 17:48:42 -07:00
uuuvn
ec9955c956 Use REAL_DEV for test skips (#10420)
This should fix remote cpu tests flakiness (segfaults were in
`test_data_parallel_resnet_train_step` which is skipped on cpu but wasn't
skipped on remote cpu)
2025-05-19 17:32:14 -07:00
nimlgen
9a199ccd81 am: try to modprobe vfio (#10418)
* am: try to modprobe vfio

* fix
2025-05-19 23:46:50 +03:00
chenyu
67d1364106 update LOGMLPERF in red resnet run_and_time (#10416) 2025-05-19 13:23:33 -04:00
Sieds Lykles
db09676250 Dont simplify gate in gate, fix FUSE_ARANGE=1 python test/test_ops.py TestOps.test_scatter_add (#10411)
* substitute out index

* Add test

* change comment
2025-05-19 13:16:21 -04:00
chenyu
116d9e6306 run mlperf resnet on red box (#10413)
also made push to `update_mlperf` branch trigger
2025-05-19 12:48:36 -04:00
George Hotz
f1fe1f93c1 hotfix: 14000 lines 2025-05-19 09:40:53 -07:00
qazal
90eb3c0e5d add MobileNetV2 benchmark to comma CI (#10250)
* add MobileNetV2 to comma CI

* symlink imagenet

* also the signature

* comment that out

* need imagenetmock

* same train and test set

* quantize on CPU=1

* verbose

* need __hexagon_divsf3

* 0x858d6c15

* quant cpu + CC=clang-19
2025-05-19 18:22:50 +03:00
qazal
f9a5ad24c5 faster viz to_program [pr] (#10410)
* faster viz to_program [pr]

* Callable
2025-05-19 12:27:49 +03:00
qazal
cc8dda1d75 move multi_map to grouper rewrite pass (#10409)
* move multi_map to grouper rewrite pass

* delete that
2025-05-19 10:44:06 +03:00
George Hotz
b06291077c no amdgpu kernel driver (#10408)
* no amdgpu kernel driver

* don't test hip

* lower req
2025-05-18 20:52:39 -07:00
George Hotz
4b1f1a47bb hotfix: allow ModuleNotFoundError in metal llvm import 2025-05-18 20:46:31 -07:00
chenyu
485e80da69 run_and_time for resnet ci (#10405) 2025-05-18 23:39:57 -04:00
qazal
d1eeb19437 count viz javascript in lines (#10403)
* count viz javascript in lines

* don't count }

* it's javascript

* share with autogen
2025-05-18 19:34:00 -07:00
qazal
260d194523 merge insert_fuse and do_fuse [pr] (#10406) 2025-05-19 04:44:36 +03:00
uuuvn
33cf33902a Slightly less slow remote copyin (#10404)
bytes concat is slow, don't do it if data is already present in self._h

also don't cast memoryview into bytes (copy, +100ms) before it's needed

this mitigates shard copying before shrink

master:
```
*** REMOTE     6 copy 1073.74M,  REMOTE <- METAL           arg  2 mem  2.15 GB tm    806.84ms/   829.61ms (     0.00 GFLOPS    1.3|1.3     GB/s)
*** REMOTE:    7 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  3.22 GB tm    797.41ms/  1627.02ms (     0.00 GFLOPS    1.3|1.3     GB/s)
*** REMOTE:    8 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  4.29 GB tm    677.89ms/  2304.91ms (     0.00 GFLOPS    1.6|1.6     GB/s)
*** REMOTE:    9 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  5.37 GB tm    659.81ms/  2964.72ms (     0.00 GFLOPS    1.6|1.6     GB/s)
*** REMOTE:   10 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  6.44 GB tm    679.21ms/  3643.93ms (     0.00 GFLOPS    1.6|1.6     GB/s)
*** REMOTE:   11 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  7.52 GB tm    673.90ms/  4317.83ms
```

this:
```
*** REMOTE     6 copy 1073.74M,  REMOTE <- METAL           arg  2 mem  2.15 GB tm    867.06ms/   895.58ms (     0.00 GFLOPS    1.2|1.2     GB/s)
*** REMOTE:    7 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  3.22 GB tm    433.35ms/  1328.93ms (     0.00 GFLOPS    2.5|2.5     GB/s)
*** REMOTE:    8 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  4.29 GB tm    433.19ms/  1762.12ms (     0.00 GFLOPS    2.5|2.5     GB/s)
*** REMOTE:    9 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  5.37 GB tm    432.71ms/  2194.83ms (     0.00 GFLOPS    2.5|2.5     GB/s)
*** REMOTE:   10 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  6.44 GB tm    433.68ms/  2628.51ms (     0.00 GFLOPS    2.5|2.5     GB/s)
*** REMOTE:   11 copy 1073.74M, REMOTE: <- METAL           arg  2 mem  7.52 GB tm    432.91ms/  3061.42ms
```

The 430ms is basically all sha256 time.
2025-05-18 16:20:43 -07:00
qazal
e55ee28b29 little smaller viz/worker.js [pr] (#10402) 2025-05-18 23:44:46 +03:00
qazal
8a6fb37560 move viz /prof to extra [pr] (#10401) 2025-05-18 23:25:59 +03:00
George Hotz
411392dfb7 move files into uop dir (#10399)
* move files into uop dir [pr]

* tinygrad.uop is a thing

* fix uop docs, no pr

* fix viz
2025-05-18 11:38:28 -07:00
uuuvn
0f825e12f2 Remote fixedvars (#10371)
* amd mockgpu graph support

For testing remote graph stuff (prompted by #10371) in ci

* Remote fixedvars

Somehow none of existing tests failed when fixedvars were added, looking
what to add as an regression test for this

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-05-18 09:57:13 -07:00
uuuvn
27c12be471 amd mockgpu graph support (#10385)
For testing remote graph stuff (prompted by #10371) in ci
2025-05-18 09:43:16 -07:00
George Hotz
a3308e145d hotfix: remote print -> DEBUG=3 2025-05-18 09:09:04 -07:00
qazal
04b23087d8 grouper tests from fuse_arange_default [pr] (#10394) 2025-05-18 18:42:43 +03:00
qazal
17f0f5e764 add v_rcp_f32_e64 to remu (#10393)
* tests from the box

* add v_rcp_f32_e64 to remu

* f32::from_bits utils

* v_cndmask_b32 tests
2025-05-18 17:08:21 +03:00
qazal
9e2089dcd4 don't raise Exception in process replay [pr] (#10392)
* don't raise Exception in process replay [pr]

* continue generating diffs unless [pr] is set, exit(1) otherwise

* change

* works
2025-05-18 11:23:23 +03:00
chenyu
9b4e2a75cd symlink datasets in mlperf workflow (#10391) 2025-05-18 03:26:05 -04:00
uuuvn
f20c5aac1f Use itertools.count instead of manual increment in remote (#10389)
Similar to how it's done with `UOp.unique_num`, looks a bit nicer
2025-05-18 00:15:37 -07:00
qazal
0294bfe507 simpler can_pad (#10364)
* simpler can_pad [pr]

* 3 kernels

* tests

* less kernels
2025-05-18 10:00:07 +03:00
George Hotz
c91f2c4580 use float32 for sgd momentum (#10387) 2025-05-17 21:56:44 -07:00
George Hotz
305a3231c4 fix beam none if buf is optimized out (#10388) 2025-05-17 21:50:33 -07:00
George Hotz
6f77b938d7 Move getbits tests into test_helpers (#10382) 2025-05-17 17:04:00 -07:00
George Hotz
6ebfb505e9 docs: fix crossentropy name (#10377) 2025-05-17 16:39:14 -07:00
George Hotz
0b733ba75e multi device training with GPT2 [pr] (#10375)
* multi device training with GPT2 [pr]

* Update grouper.py
2025-05-17 15:33:56 -07:00
George Hotz
6ec88d94df add tests for multi ram usage [pr] (#10376) 2025-05-17 15:33:40 -07:00
uuuvn
5a18eab908 Fix __del__ in remote program (#10372)
Similar to #10341, broke after hypotesis unpin
2025-05-17 21:29:44 +03:00
वेदांत
2453d99050 rms matching pytorch implementation (#10319)
* rms matching pytorch implementation

* pre commit fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-17 08:23:11 -07:00
nimlgen
da2b1834b4 hotfix: metal graph var vals (#10370) 2025-05-17 17:22:55 +03:00
qazal
e054b53a75 kernel count tests for pad [pr] (#10369)
* kernel count tests for pads

* handcoded rand one kernel

* comment

* prerealize device rng counter

* test_rand_handcoded generates /0

* remove track_rewrites
2025-05-17 17:20:46 +03:00
nimlgen
90c4bb10c0 fixedvars in all graphs (#10365)
* cuda fixedvars

* metal: fixevars

* f

* ups

* count fixedvars
2025-05-17 16:18:52 +03:00
chenyu
efa8dfe7fb test cron job to run resnet (#10368) 2025-05-17 08:57:02 -04:00
uuuvn
2c706d363e Remote higher timeout and overridable via REMOTE_TIMEOUT (#10367)
sometimes a minute is not enough, 5 minutes should be but if it isn't
for some huge workload it can be overridden
2025-05-17 15:30:49 +03:00
nimlgen
4fa1837916 metal: do not require icb fix on m3+ (#10366) 2025-05-17 15:30:40 +03:00
Xingyu
286b0f4051 Add equal function implementation and corresponding test (#10351)
- Implemented a new function `equal` in the torch backend to compare two tensors for equality.
- Added unit tests for the `equal` function to verify its correctness with different tensor inputs.
2025-05-16 23:39:49 -07:00
George Hotz
e13f2a3092 multi is O(1) (#10183)
* multi is O(1)

* allreduce

* no new uops needed

* junk

* something

* simple

* that's really what i want

* closer

* inject _device_num

* pretty print

* cleanups

* this

* early dnum

* ops allreduce is good

* ish

* device is the tuple and this is fine

* simpler

* progress

* copy_multi

* work

* more tests

* more tests pass

* work

* no None axis

* tests

* no none multi

* type fixes

* pre commit passes

* lil

* remove this

* mlperf dataloader on mac

* that test was wrong

* unbind

* support DEBUG=2

* realize

* only unbind bound vars

* don't include fixedvars

* graph test

* one test

* fixedvars in hcq

* new ring reduce

* ring reduce

* simpler ring

* mselect

* mselect doesn't work

* Revert "mselect doesn't work"

This reverts commit c78b77bd7d.

* Revert "mselect"

This reverts commit bb2e430ac3.

* simpler

* fixups

* no optional

* fix jit

* move things around

* cleanup multi

* simpler multi

* simpler reshape
2025-05-16 23:14:23 -07:00
George Hotz
e1a40e8040 add hcq fixedvars support [pr] (#10356)
* add hcq fixedvars support [pr]

* different test

* fixedvars are only for comp_queues

* fix hcq varvals
2025-05-16 22:05:53 -07:00
George Hotz
11b5895c85 hotfix: schedule timing in tensor.py 2025-05-16 20:10:32 -07:00
uuuvn
64409a8bda Remote beam (#10357)
* Use renderer properties instead of `.device`

* Remote beam
2025-05-16 18:59:22 -07:00
George Hotz
7cc35a031b don't use UOp.multi in Tensor.rand (#10362) 2025-05-16 16:09:36 -07:00
George Hotz
7703dbef99 view substitute [pr] (#10360) 2025-05-16 15:08:24 -07:00