George Hotz
ceb9d94eab
Update AGENTS.md
2025-05-19 17:59:59 -07:00
George Hotz
9389edf7ac
hotfix: add AGENTS.md
2025-05-19 17:48:42 -07:00
uuuvn
ec9955c956
Use REAL_DEV for test skips ( #10420 )
...
This should fix remote cpu tests flakiness (segfaults were in
`test_data_parallel_resnet_train_step` which is skipped on cpu but wasn't
skipped on remote cpu)
2025-05-19 17:32:14 -07:00
nimlgen
9a199ccd81
am: try to modprobe vfio ( #10418 )
...
* am: try to modprobe vfio
* fix
2025-05-19 23:46:50 +03:00
chenyu
67d1364106
update LOGMLPERF in red resnet run_and_time ( #10416 )
2025-05-19 13:23:33 -04:00
Sieds Lykles
db09676250
Dont simplify gate in gate, fix FUSE_ARANGE=1 python test/test_ops.py TestOps.test_scatter_add ( #10411 )
...
* substitute out index
* Add test
* change comment
2025-05-19 13:16:21 -04:00
chenyu
116d9e6306
run mlperf resnet on red box ( #10413 )
...
also made push to `update_mlperf` branch trigger
2025-05-19 12:48:36 -04:00
George Hotz
f1fe1f93c1
hotfix: 14000 lines
2025-05-19 09:40:53 -07:00
qazal
90eb3c0e5d
add MobileNetV2 benchmark to comma CI ( #10250 )
...
* add MobileNetV2 to comma CI
* symlink imagenet
* also the signature
* comment that out
* need imagenetmock
* same train and test set
* quantize on CPU=1
* verbose
* need __hexagon_divsf3
* 0x858d6c15
* quant cpu + CC=clang-19
2025-05-19 18:22:50 +03:00
qazal
f9a5ad24c5
faster viz to_program [pr] ( #10410 )
...
* faster viz to_program [pr]
* Callable
2025-05-19 12:27:49 +03:00
qazal
cc8dda1d75
move multi_map to grouper rewrite pass ( #10409 )
...
* move multi_map to grouper rewrite pass
* delete that
2025-05-19 10:44:06 +03:00
George Hotz
b06291077c
no amdgpu kernel driver ( #10408 )
...
* no amdgpu kernel driver
* don't test hip
* lower req
2025-05-18 20:52:39 -07:00
George Hotz
4b1f1a47bb
hotfix: allow ModuleNotFoundError in metal llvm import
2025-05-18 20:46:31 -07:00
chenyu
485e80da69
run_and_time for resnet ci ( #10405 )
2025-05-18 23:39:57 -04:00
qazal
d1eeb19437
count viz javascript in lines ( #10403 )
...
* count viz javascript in lines
* don't count }
* it's javascript
* share with autogen
2025-05-18 19:34:00 -07:00
qazal
260d194523
merge insert_fuse and do_fuse [pr] ( #10406 )
2025-05-19 04:44:36 +03:00
uuuvn
33cf33902a
Slightly less slow remote copyin ( #10404 )
...
bytes concat is slow, don't do it if data is already present in self._h
also don't cast memoryview into bytes (copy, +100ms) before it's needed
this mitigates shard copying before shrink
master:
```
*** REMOTE 6 copy 1073.74M, REMOTE <- METAL arg 2 mem 2.15 GB tm 806.84ms/ 829.61ms ( 0.00 GFLOPS 1.3|1.3 GB/s)
*** REMOTE: 7 copy 1073.74M, REMOTE: <- METAL arg 2 mem 3.22 GB tm 797.41ms/ 1627.02ms ( 0.00 GFLOPS 1.3|1.3 GB/s)
*** REMOTE: 8 copy 1073.74M, REMOTE: <- METAL arg 2 mem 4.29 GB tm 677.89ms/ 2304.91ms ( 0.00 GFLOPS 1.6|1.6 GB/s)
*** REMOTE: 9 copy 1073.74M, REMOTE: <- METAL arg 2 mem 5.37 GB tm 659.81ms/ 2964.72ms ( 0.00 GFLOPS 1.6|1.6 GB/s)
*** REMOTE: 10 copy 1073.74M, REMOTE: <- METAL arg 2 mem 6.44 GB tm 679.21ms/ 3643.93ms ( 0.00 GFLOPS 1.6|1.6 GB/s)
*** REMOTE: 11 copy 1073.74M, REMOTE: <- METAL arg 2 mem 7.52 GB tm 673.90ms/ 4317.83ms
```
this:
```
*** REMOTE 6 copy 1073.74M, REMOTE <- METAL arg 2 mem 2.15 GB tm 867.06ms/ 895.58ms ( 0.00 GFLOPS 1.2|1.2 GB/s)
*** REMOTE: 7 copy 1073.74M, REMOTE: <- METAL arg 2 mem 3.22 GB tm 433.35ms/ 1328.93ms ( 0.00 GFLOPS 2.5|2.5 GB/s)
*** REMOTE: 8 copy 1073.74M, REMOTE: <- METAL arg 2 mem 4.29 GB tm 433.19ms/ 1762.12ms ( 0.00 GFLOPS 2.5|2.5 GB/s)
*** REMOTE: 9 copy 1073.74M, REMOTE: <- METAL arg 2 mem 5.37 GB tm 432.71ms/ 2194.83ms ( 0.00 GFLOPS 2.5|2.5 GB/s)
*** REMOTE: 10 copy 1073.74M, REMOTE: <- METAL arg 2 mem 6.44 GB tm 433.68ms/ 2628.51ms ( 0.00 GFLOPS 2.5|2.5 GB/s)
*** REMOTE: 11 copy 1073.74M, REMOTE: <- METAL arg 2 mem 7.52 GB tm 432.91ms/ 3061.42ms
```
The 430ms is basically all sha256 time.
2025-05-18 16:20:43 -07:00
qazal
e55ee28b29
little smaller viz/worker.js [pr] ( #10402 )
2025-05-18 23:44:46 +03:00
qazal
8a6fb37560
move viz /prof to extra [pr] ( #10401 )
2025-05-18 23:25:59 +03:00
George Hotz
411392dfb7
move files into uop dir ( #10399 )
...
* move files into uop dir [pr]
* tinygrad.uop is a thing
* fix uop docs, no pr
* fix viz
2025-05-18 11:38:28 -07:00
uuuvn
0f825e12f2
Remote fixedvars ( #10371 )
...
* amd mockgpu graph support
For testing remote graph stuff (prompted by #10371 ) in ci
* Remote fixedvars
Somehow none of existing tests failed when fixedvars were added, looking
what to add as an regression test for this
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-05-18 09:57:13 -07:00
uuuvn
27c12be471
amd mockgpu graph support ( #10385 )
...
For testing remote graph stuff (prompted by #10371 ) in ci
2025-05-18 09:43:16 -07:00
George Hotz
a3308e145d
hotfix: remote print -> DEBUG=3
2025-05-18 09:09:04 -07:00
qazal
04b23087d8
grouper tests from fuse_arange_default [pr] ( #10394 )
2025-05-18 18:42:43 +03:00
qazal
17f0f5e764
add v_rcp_f32_e64 to remu ( #10393 )
...
* tests from the box
* add v_rcp_f32_e64 to remu
* f32::from_bits utils
* v_cndmask_b32 tests
2025-05-18 17:08:21 +03:00
qazal
9e2089dcd4
don't raise Exception in process replay [pr] ( #10392 )
...
* don't raise Exception in process replay [pr]
* continue generating diffs unless [pr] is set, exit(1) otherwise
* change
* works
2025-05-18 11:23:23 +03:00
chenyu
9b4e2a75cd
symlink datasets in mlperf workflow ( #10391 )
2025-05-18 03:26:05 -04:00
uuuvn
f20c5aac1f
Use itertools.count instead of manual increment in remote ( #10389 )
...
Similar to how it's done with `UOp.unique_num`, looks a bit nicer
2025-05-18 00:15:37 -07:00
qazal
0294bfe507
simpler can_pad ( #10364 )
...
* simpler can_pad [pr]
* 3 kernels
* tests
* less kernels
2025-05-18 10:00:07 +03:00
George Hotz
c91f2c4580
use float32 for sgd momentum ( #10387 )
2025-05-17 21:56:44 -07:00
George Hotz
305a3231c4
fix beam none if buf is optimized out ( #10388 )
2025-05-17 21:50:33 -07:00
George Hotz
6f77b938d7
Move getbits tests into test_helpers ( #10382 )
2025-05-17 17:04:00 -07:00
George Hotz
6ebfb505e9
docs: fix crossentropy name ( #10377 )
2025-05-17 16:39:14 -07:00
George Hotz
0b733ba75e
multi device training with GPT2 [pr] ( #10375 )
...
* multi device training with GPT2 [pr]
* Update grouper.py
2025-05-17 15:33:56 -07:00
George Hotz
6ec88d94df
add tests for multi ram usage [pr] ( #10376 )
2025-05-17 15:33:40 -07:00
uuuvn
5a18eab908
Fix __del__ in remote program ( #10372 )
...
Similar to #10341 , broke after hypotesis unpin
2025-05-17 21:29:44 +03:00
वेदांत
2453d99050
rms matching pytorch implementation ( #10319 )
...
* rms matching pytorch implementation
* pre commit fix
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-05-17 08:23:11 -07:00
nimlgen
da2b1834b4
hotfix: metal graph var vals ( #10370 )
2025-05-17 17:22:55 +03:00
qazal
e054b53a75
kernel count tests for pad [pr] ( #10369 )
...
* kernel count tests for pads
* handcoded rand one kernel
* comment
* prerealize device rng counter
* test_rand_handcoded generates /0
* remove track_rewrites
2025-05-17 17:20:46 +03:00
nimlgen
90c4bb10c0
fixedvars in all graphs ( #10365 )
...
* cuda fixedvars
* metal: fixevars
* f
* ups
* count fixedvars
2025-05-17 16:18:52 +03:00
chenyu
efa8dfe7fb
test cron job to run resnet ( #10368 )
2025-05-17 08:57:02 -04:00
uuuvn
2c706d363e
Remote higher timeout and overridable via REMOTE_TIMEOUT ( #10367 )
...
sometimes a minute is not enough, 5 minutes should be but if it isn't
for some huge workload it can be overridden
2025-05-17 15:30:49 +03:00
nimlgen
4fa1837916
metal: do not require icb fix on m3+ ( #10366 )
2025-05-17 15:30:40 +03:00
Xingyu
286b0f4051
Add equal function implementation and corresponding test ( #10351 )
...
- Implemented a new function `equal` in the torch backend to compare two tensors for equality.
- Added unit tests for the `equal` function to verify its correctness with different tensor inputs.
2025-05-16 23:39:49 -07:00
George Hotz
e13f2a3092
multi is O(1) ( #10183 )
...
* multi is O(1)
* allreduce
* no new uops needed
* junk
* something
* simple
* that's really what i want
* closer
* inject _device_num
* pretty print
* cleanups
* this
* early dnum
* ops allreduce is good
* ish
* device is the tuple and this is fine
* simpler
* progress
* copy_multi
* work
* more tests
* more tests pass
* work
* no None axis
* tests
* no none multi
* type fixes
* pre commit passes
* lil
* remove this
* mlperf dataloader on mac
* that test was wrong
* unbind
* support DEBUG=2
* realize
* only unbind bound vars
* don't include fixedvars
* graph test
* one test
* fixedvars in hcq
* new ring reduce
* ring reduce
* simpler ring
* mselect
* mselect doesn't work
* Revert "mselect doesn't work"
This reverts commit c78b77bd7d .
* Revert "mselect"
This reverts commit bb2e430ac3 .
* simpler
* fixups
* no optional
* fix jit
* move things around
* cleanup multi
* simpler multi
* simpler reshape
2025-05-16 23:14:23 -07:00
George Hotz
e1a40e8040
add hcq fixedvars support [pr] ( #10356 )
...
* add hcq fixedvars support [pr]
* different test
* fixedvars are only for comp_queues
* fix hcq varvals
2025-05-16 22:05:53 -07:00
George Hotz
11b5895c85
hotfix: schedule timing in tensor.py
2025-05-16 20:10:32 -07:00
uuuvn
64409a8bda
Remote beam ( #10357 )
...
* Use renderer properties instead of `.device`
* Remote beam
2025-05-16 18:59:22 -07:00
George Hotz
7cc35a031b
don't use UOp.multi in Tensor.rand ( #10362 )
2025-05-16 16:09:36 -07:00
George Hotz
7703dbef99
view substitute [pr] ( #10360 )
2025-05-16 15:08:24 -07:00