George Hotz
7c630a9a53
hotfix: fix llama spacing + fix hcq
2024-05-10 15:10:13 +00:00
George Hotz
58e7256ce9
restore hcq graph ( #4513 )
...
* Reapply "hcq graph (#4380 )" (#4512 )
This reverts commit 06c1e7498e .
* bring back hcq graph
2024-05-10 07:45:05 -07:00
George Hotz
06c1e7498e
Revert "hcq graph ( #4380 )" ( #4512 )
...
This reverts commit 84a2e2b8c1 .
2024-05-10 07:18:09 -07:00
nimlgen
84a2e2b8c1
hcq graph ( #4380 )
...
* start hcq graph
* hack-fix sync on amd
* nv
* fix nv
* multigrah
* fixes
* temp fix for graph
* this is not needed
* fix
* cleaner
* linetr
* fix none
* faster cuda copy
* faster amd copy
* temp nv fixes
* alloc on gpu
* exp: faster amd
* Revert "exp: faster amd"
This reverts commit 2e4cfd1f7d8a33634c50fb5655cff1b40269d28c.
* revert, unrelated
* not in this pr
* linter
2024-05-10 07:15:12 -07:00
qazal
2b7ab60584
dfs fusion ( #4491 )
...
* use continue
* simplify
* flip
* track r
* derive forced_realize
* scheduler needs comments
2024-05-10 17:00:48 +03:00
qazal
bd8bb82555
move fusion out of child iteration ( #4509 )
2024-05-10 12:03:32 +03:00
qazal
ff216a383a
refactor fused children ( #4508 )
...
* realized_children -> group
* use a set
2024-05-10 11:49:23 +03:00
chenyu
b399d98e41
fix resnet eval ( #4507 )
2024-05-10 00:49:00 -04:00
wozeparrot
a602dc67d3
feat: more mlperf fixes ( #4505 )
2024-05-09 20:50:20 -07:00
chenyu
0e8aa0e288
use fake data in beam searching resnet ( #4504 )
2024-05-09 23:43:50 -04:00
George Hotz
5bfc33948a
hotfix: only run optimize_local_size once
2024-05-09 20:01:53 -07:00
wozeparrot
29daea4e60
fix: core count and os ( #4503 )
2024-05-09 19:55:07 -07:00
George Hotz
89e119bc58
move Allocator to buffer.py ( #4502 )
...
* move Allocator to buffer.py
* move those to realize
* memory file
* cleanup
2024-05-09 19:45:56 -07:00
George Hotz
1e843d495e
cleaning up search with Program ( #4500 )
...
* cleaning up search
* fix tests
* test fix
* minor compiler cleanup
2024-05-09 19:01:53 -07:00
chenyu
d3dc332c2e
Tensor.logsumexp ( #4442 )
...
the subtract max part should share with safe softmax
cleaner
2024-05-09 20:49:06 -04:00
chenyu
78b298aa2a
move 0d tensor reduce axis check to _reduce ( #4499 )
2024-05-09 20:29:55 -04:00
George Hotz
c9e84ed0da
refactor to Program class ( #4476 )
...
* refactor to Program class
* switch to Program
* fix tests
* smaller diff
* self.p
* more tests
* fix metal test
* tests
* fix openpilot
* move that to linearizer
* p.launchdims
2024-05-09 17:29:07 -07:00
chenyu
5de4a46f10
re-enable gpt2 half/beam mac benchmark ( #4496 )
...
* re-enable gpt2 half/beam mac benchmark
from fuzzer it seems to be flaky due to numerical issue, not kernel bug. we used to have half in splitted reduce.
run this in M1 Max for 20 loops and it's fine
* that should be jitted
2024-05-09 19:15:32 -04:00
nimlgen
a2e2ba380c
nv tune shmem size ( #4495 )
...
* nv tune shmem size
* compare them
* linter
* linter2
2024-05-10 00:35:01 +03:00
chenyu
ef93e41a15
resnet mlperf systems add tinygrad commit and python / runtime versions ( #4494 )
2024-05-09 16:04:15 -04:00
chenyu
b5afdfbc5b
first draft resnet mlperf readme ( #4493 )
...
* start readme
* something
2024-05-09 15:51:44 -04:00
chenyu
047c7f3e5b
polish resnet mlperf logging ( #4490 )
...
don't include save final check point time in run time, and some cosmetic order changes
2024-05-09 13:04:24 -04:00
chenyu
d78e159aa3
resnet logging move RUN_START to start of the script ( #4488 )
2024-05-09 12:32:32 -04:00
chenyu
1bcb58479d
resnet setup power cap red box gpu to 350W ( #4484 )
...
1%-2% faster
2024-05-08 23:32:41 -04:00
chenyu
0ed755bcf5
resnet use EVAL_BS=192 ( #4482 )
...
* resnet use EVAL_BS=192
also lower green run BEAM_MIN_PROGRESS from 10 to 5
* BEAM_MIN_PROGRESS 5 is too close to setup limit
2024-05-08 22:29:27 -04:00
chenyu
1f6bf9d2f7
real diskcache_clear in model_train resnet ( #4445 )
...
clear cache if INITMLPERF is set, or running run_and_time. dev_beam and dev_run do not clear cache
2024-05-08 19:06:09 -04:00
chenyu
1b4645bea6
hotfix resnet move init_start to start of the script ( #4481 )
2024-05-08 19:03:52 -04:00
wozeparrot
a347ae94d6
feat: remove wandb ( #4480 )
2024-05-08 15:31:16 -07:00
qazal
00c309dfe2
trigger tc in remu ( #4479 )
2024-05-08 23:23:46 +03:00
nimlgen
e14d5b6fd7
nv fix oob qmd ptr ( #4478 )
...
* nv fix oob qmd ptr
* test kernargs no oob
2024-05-08 23:11:04 +03:00
chenyu
db7e15c46f
hotfix resnet only log epoch start with RUNMLPERF ( #4477 )
2024-05-08 15:14:41 -04:00
chenyu
062c6dd65d
mlperf logging, truncate dir in logs and log seed ( #4475 )
2024-05-08 12:54:02 -04:00
chenyu
b62a65b617
redo faster sparse_categorical_crossentropy ( #4461 )
...
update LR and DECAY in resnet default that help convergence too
2024-05-08 11:21:43 -04:00
Elias Wahl
e87460c7e2
bump version ( #4474 )
2024-05-08 07:48:42 -07:00
Szymon Ożóg
4eb6aef73c
Speed up graph rewrite ( #4473 )
...
* Speed up graph rewrite
* Bring back old name
2024-05-08 07:15:15 -07:00
Nicklas Boman
cc33947fa5
Update links in new docs ( #4363 )
...
tensor and nn links to tensor.md and nn.md
2024-05-08 06:13:00 -07:00
chenyu
36a1f38049
lazy folding: mul -1 is neg, and neg neg is noop ( #4472 )
2024-05-08 01:52:22 -04:00
chenyu
c508eb7425
revert the removal of CAST_BEFORE_VIEW ( #4471 )
...
this brings most of the memory gain for resnet back.
2024-05-08 00:14:29 -04:00
George Hotz
5dbab7fae6
bring thneed back ( #4467 )
...
* bring thneed back
* simple thneed
* bug fixes in new thneed
* needs_load
* context
* move that there
* fix thneed size
* fix CI
* one memory planner
* assert on buffer size
2024-05-07 20:55:03 -07:00
chenyu
7eb035e7c5
stronger test case for half mean overflow ( #4470 )
2024-05-07 22:40:09 -04:00
chenyu
ca7300c783
fix half mean and its backward ( #4469 )
...
* fix half mean and its backward
cast to sum_acc_type, sum, div, then cast back
* mean dtype tests
2024-05-07 21:46:41 -04:00
Francis Lam
7da1b41f38
fuzz_linearizer: add FUZZ_REQUIRE_TC option to require TC in opts ( #4468 )
...
useful for checking late opts after TC such as GROUP, etc.
2024-05-07 17:14:21 -04:00
chenyu
46a793111b
test for LazyBuffer._view when mask out and degrade into const ( #4465 )
...
changed the condition from all 0 in masked dims to any 0 in masked. it's no-op because shapetracker rewrites whole mask to 0 if any dim has 0 as part of canonicalization
2024-05-07 12:56:23 -04:00
nimlgen
a1d350a810
nv timeline semaphores ( #4464 )
...
* nv timeline semaphores
* nv hcq fixes
2024-05-07 17:31:19 +03:00
nimlgen
e3bb85fd0e
amd timeline semaphores ( #4416 )
...
* amd timeline semaphores
* v2
* fixes
* reset signals
* fix
* rollover test
* small fixes
* linter
* copyin
2024-05-07 11:17:32 +03:00
George Hotz
17faae091b
optimizer shouldn't be run without training ( #4460 )
...
* optimizer shouldn't be run without training
* set training in relevant tests
* fix multitensor
* that too
2024-05-06 15:34:12 -07:00
qazal
35dfbc6354
rand_for_dtype helper ( #4459 )
2024-05-07 00:03:42 +03:00
nimlgen
a3140c9767
nv boost subdevice ( #4456 )
2024-05-06 23:05:20 +03:00
Francis Lam
47750e65fd
kernel: un-reverse the order of the local indices ( #4454 )
...
no change to performance or behavior. new LOCALS are added to the
left side of the LOCALS block (to the left of the first_reduce).
2024-05-06 15:21:27 -04:00
chenyu
5e036cd0b3
test unary and more reduces in test_flopcounter ( #4455 )
...
cannot really catch a spec change error without testing the new spec explicitly, but we don't intended to change the lazy spec lightly
another possible way to catch reduce flopcounter shape would be type checking InterpretedFlopCounter and throw error if `in` results in `Never`
2024-05-06 15:15:16 -04:00