qazal
c5f5755328
correctness test for multireduce nested locals ( #4682 )
...
* nested locals test
* move st
2024-05-22 19:35:35 +03:00
chenyu
bc9be39dec
set timeout in search _try_compile_linearized_w_idx ( #4677 )
2024-05-22 12:30:31 -04:00
qazal
d12d412e8b
revert uops dtype in pattern matcher ( #4681 )
...
This reverts commit 5f84cbb5df .
2024-05-22 14:45:51 +03:00
Elias Wahl
acc0039cfc
Resume fix + scheduler for non weight decay params ( #4679 )
...
* move ckpt dir
* fix resume. Add scheduler group
2024-05-21 19:38:13 -04:00
chenyu
0f21aa0416
example kernel that triggers Memory access fault for resnet on red ( #4678 )
2024-05-21 18:59:36 -04:00
qazal
5f84cbb5df
keep UOps.CAST in PHI-GEP fold for unmatching dtypes ( #4674 )
...
* these should be val.dtype
* cast float4 and float2 to root
* document tests
* 2 args
* fix assert
* match dtype
* no extra lines
* better fix
2024-05-21 14:59:49 -04:00
qazal
458a3961eb
catch compile errors in uops tests ( #4672 )
...
* use helper and compile
* llama beam=2
* ast length
* skip float4, fix hsa
* use empty tensors
2024-05-21 12:20:35 +03:00
wozeparrot
00432496d7
feat: tinyboxgreen ( #4366 )
...
* feat: tinyboxgreen
* feat: tinyboxgreenv2
* fix symlink weights
* fix: remove llama 2 70b for now
* feat: naming
* fix: remove extra cifar steps
* feat: disable mixtral on nvidia
2024-05-20 22:39:34 -04:00
Timmy
de733d73cf
Multireduce Linearizer Tests ( #4665 )
...
* updated tests
* make sure the upcasting tests actually causes the problem
* diff cleanup
* use UOpGraph utils
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-05-21 02:43:25 +03:00
chenyu
5e3fbbb33e
llama3 example add manual seed and log seed ( #4667 )
2024-05-20 19:09:57 -04:00
chenyu
8c99cc17f5
remove link to old adding_new_accelerators.md ( #4666 )
...
fix #4657
2024-05-20 19:05:23 -04:00
chenyu
c4089d169f
update BEAM_LOCAL_MAX to 1024 ( #4664 )
...
we used 1024 for mlperf submission and result steps time is 20% faster. the default should not be worse
2024-05-20 18:06:32 -04:00
chenyu
704cb1d8a0
fix conversation.py quantize ( #4663 )
...
it used to be true for int8, not it's a string for int8 or nf4
2024-05-20 17:36:37 -04:00
chenyu
ae861325ce
update llama sample for mac 32 input buffer limit ( #4662 )
...
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
Elias Wahl
993091adfa
loss scaler + nan fixes ( #4661 )
2024-05-20 17:08:35 -04:00
qazal
b33c827aed
UOps.RANGE toposort spec ( #4660 )
...
* use iterator
* nested loops and outer loads
* uop after phi
2024-05-20 23:38:20 +03:00
qazal
0d9e623d83
consolidate uops tests ( #4659 )
...
* merge uoptimize
* move tests
* fix skip message
2024-05-20 21:42:31 +03:00
Szymon Ożóg
1e7b7b2c3c
Fix flop coutning for mulacc ( #4640 )
...
* Fix flop coutning for mulacc
* add test_simple_mulacc
* Update test_uops_stats.py
* Update test_uops_stats.py
* revert test_mulacc
* Test for MULACC vs MUL+ADD
2024-05-20 12:06:00 -04:00
wozeparrot
b144d4b460
new llama3 example ( #4576 )
2024-05-19 22:42:23 -07:00
nimlgen
c9f7f2da70
nv hcq bind api ( #4629 )
...
* hcq bind api for nv
* linter
* linter
* add test
* small comment
2024-05-19 23:17:10 +03:00
qazal
d308f4fa9a
correctly insert UOps.END* in fuzz result ( #4653 )
2024-05-19 21:10:28 +03:00
chenyu
456aa0b656
update test_search kernel count ( #4652 )
...
integration test that beaming 1 kernel increments kernel count by 1, and moved exiting test_kernel_count to TestTimeLinearizer
2024-05-19 13:54:52 -04:00
qazal
954718e6bf
reorder DEFINE_GLOBAL in fuzz_uops ( #4651 )
...
* globals base
* test: opt out of DEFINE_GLOBAL
* do it like ExecItem
2024-05-19 20:51:31 +03:00
Léo
967e35f8b8
fix(beam): GlobalCounters kernel count increasing when clearing l2 ( #4598 )
...
* fix(beam): GlobalCounters kernel count increasing when clearing l2
* fix: removed the NOSTATS var by adding do_update_stats to Tensor.realize()
* test(search): regression test for _time_program, should not increment kernel_count
* fix(test_search): unused var and now properly checking when l2 is cleared
* fix(test_search): added assert message
* fix(test_search): now testing public beam api for kcount
* ruff fixes
---------
Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr >
2024-05-19 10:03:47 -07:00
George Hotz
4753283221
LOOP -> RANGE ( #4650 )
2024-05-19 06:40:20 -07:00
chenyu
286b4dbdf2
compile raise CompileError and skip only RuntimeError in multiprocess… ( #4646 )
...
* compile raise CompileError and skip only RuntimeError in multiprocess beam
renderer error with multiprocess should not be skipped by beam
* use `==` for dtype to dtype comparison
* that needs to be is
* typo
2024-05-19 00:25:25 -04:00
chenyu
8a0d1ca7bb
CI test timeout 20 min -> 10 min ( #4645 )
...
if it takes more than 10 usually setup fails anyway. also updated matmul_kfd -> matmul_amd in benchmark
2024-05-18 13:58:28 -04:00
qazal
b0cb02f719
uops fuzzing infra ( #4641 )
...
* base with bfs
* find paths
* get last
* try blocks
* Revert "try blocks"
This reverts commit 25f8e3fe85 .
* this should be simpler
* full exec
* support debug
* fix lint
* add todo
* copy in_degree
2024-05-18 20:19:57 +03:00
qazal
bf8f855838
assert kernel counts in unsupported fusions ( #4643 )
...
* replace with comments
* not relevant
* update comment
* custom exception maybe
* fix LoadOps.VIEW
2024-05-18 20:14:37 +03:00
qazal
a5204fe89d
refactor UOps.CONST ( #4639 )
...
* delete more
* nit: dont need assign
* can this be simpler
* use scalars
* always cast
* clang needs cast
* format
2024-05-18 10:07:36 +03:00
qazal
d0a2d40df3
root cause fix for UOps.CONST bad args ( #4638 )
...
* delete that
* real fix
2024-05-18 09:15:25 +03:00
George Hotz
9b464e34ea
increase speed of uops ( #4637 )
...
* increase speed of uops
* not equal
* minor speedup
2024-05-17 21:04:39 -07:00
George Hotz
b74cc1d01a
uops cleanup ( #4634 )
...
* def add cleanup
* minor speedup
* add back ptx speed
* a little faster
* merge that
* only linearize once for ptx
* two graph rewrites for ptx, bug?
2024-05-17 20:02:38 -07:00
George Hotz
07b350a8f4
new uops is an actual graph ( #4560 )
...
* new uops is an actual graph
* it's way slower
* simpler
* fix define acc
* render_loop unique
* ops test pass
* add pattern matcher back, there's bugs
* rewrite
* use priority queue
* recursive children
* fix tests
* fix tests with SINK
* fix abstractions
* fix assembly
* simpler
* link define_acc
* fix DEFINE_ACC placement
* type verify
* full cmp
* fix cmp
* ACCESS_ACC
* insert DEFINE_ACC
* fix PHI
* recursive rewrite
* fix many tests
* sum collapse
* more patterns
* correct change
* fold arange
* fix that lin test
* space
* big folding rule works
* close
* has more maxes, meh
* cached node replace
* set changed
* simplest folding yet
* works
* works
* DIV
* all tests pass
* del
* fuzz linearizer fails
* sum_collapse
* test depth 2 cf
* fix lin test 14
* fix clang depth
* disable that
* failure 14 is fixed
* fix ptx
* failure 27 is fixed
* fix llama
* run_cnt
* Revert "Optimize PTX gated loads index calculation (#4304 )"
This reverts commit d97d5a7689 .
* fix uops loop
* fix ptx bugs
* add barrier
* print
* mem_type in ptx direct
* bypass tests that fail in CI but pass locally
* ptx remove ptr_ar
* more ptx passing
* fix ptx tests
* assert compile support
* remove model inference benchmark from red
2024-05-17 18:00:18 -07:00
nimlgen
daf57af3eb
move tc to renderers ( #4631 )
...
* move tc to renderers
* missed import
* fix typo
* fix
* fix imports
* remove from tests
* fix 4607
* nv emulate timestamp
* time is int
* correct time
2024-05-18 00:36:29 +03:00
chenyu
d70988dddf
add blob and raw=true for image in docs showcase ( #4632 )
...
this should render the image correctly
2024-05-17 16:57:15 -04:00
nimlgen
10cf8e459b
hcq update queue in place ( #4626 )
...
* do not self wait in hcq
* faster enqueue
* comments
* tests
* linter
* fix typo
2024-05-17 22:18:20 +03:00
chenyu
ca1df20fa9
benchmark name fix - resnet eval is on eval data ( #4628 )
2024-05-17 12:56:12 -04:00
chenyu
c86adabe15
time with real global buffers in search ( #4621 )
...
* filter fake buffers in search
* test that
* update test
2024-05-17 12:36:23 -04:00
chenyu
e5d4e6a8aa
BEAM=2 in green CI for 100 TFLOPS ( #4624 )
2024-05-16 23:28:28 -04:00
chenyu
b3dd885ffb
cleanup double import from tinygrad.device in tensor.py ( #4620 )
2024-05-16 14:21:22 -04:00
uuuvn
639ea5b0f2
Metal linearizer failure 22 is flaky not just on CI ( #4617 )
...
* METAL doesn't fail anymore, not just on CI
* oops
2024-05-16 11:31:23 -04:00
qazal
f3f2b96583
pick schedule tests from external_test_opt ( #4615 )
...
* conv tests
* misc
* that shouldnt const fold
2024-05-16 15:43:41 +03:00
qazal
13200c6894
check simple_pads in all views ( #4614 )
2024-05-16 14:34:39 +03:00
qazal
0b464df605
base change scheduling spec ( #4613 )
...
* spec and kernel cnt
* dont use half
* skip half
2024-05-16 13:30:49 +03:00
nimlgen
65f7e3b3ab
nv setup constbuf4 ( #4511 )
...
* nv correct constbuf 4
* compare results to cuda
* test fixed
* failed kernel
* repro
* revert this change
2024-05-16 10:42:35 +03:00
chenyu
04f2327ca3
fix abs of diff of uint ( #4411 )
2024-05-15 18:39:11 -04:00
chenyu
2119e0456d
redo simpler abs and sign ( #4611 )
...
moved Sign logic to function.py, and backward always returns 0 to match torch.
rewrite abs as `self * self.sign()`, so it's backward also matches torch.
2024-05-15 18:19:46 -04:00
nimlgen
eb9689336e
nv mockgpu ( #4600 )
...
* mockgpu nv
* works
* comment that out
* fix merge
* setup gpuocelot
* install packages
* not run all of them
* passes
* fix ci
* almost
* should pass
* linter
* linter 2
* try this?
* ugn, not supported
* ci
* remove ticket from description
* better descs
2024-05-15 23:46:08 +03:00
chenyu
3c11ca452e
skip CLANG test casts between double and half for now ( #4609 )
...
start breaking after github CI image update
2024-05-15 16:17:06 -04:00