chenyu
8aee3f5a9a
docs: split, chunk, pad2d, flatten, unflatten ( #4706 )
2024-05-23 20:34:40 -04:00
wozeparrot
2c56aa7fe0
activation function docs ( #4705 )
2024-05-23 17:12:16 -07:00
nimlgen
27abbd5b2b
signal pool for nv/amd ( #4701 )
...
* signal pool
* useless
2024-05-24 02:09:52 +03:00
Francis Lam
49225522aa
wmma: chain unrolled WMMAs and phi only at the end ( #4703 )
...
* wmma: chain unrolled WMMAs and phi only at the end
* fix linter and tests
* reduce lines
2024-05-23 17:50:18 -04:00
chenyu
eb714a600d
fix UOps.CAST noop for vectorized dtypes ( #4704 )
...
* ==
* add test
* not lazyop
* use str comparison for PtrDType
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-05-23 17:33:29 -04:00
Szymon Ożóg
00bc2b738c
Fix tensor cores in PTX ( #4698 )
2024-05-23 16:27:51 -04:00
chenyu
38bc38cdff
fix llama example quantize ( #4699 )
...
* fix llama example quantize
import quantize layers from new example llama3
add to mac benchmark
* fix that
* save the files
2024-05-23 15:35:26 -04:00
qazal
532c9e08e3
proposal: PHI nodes in TC shouldn't have children inside the loop ( #4694 )
...
* expectations from UOpGraph
* one with children
* minimal repro
* replace
2024-05-23 15:11:26 -04:00
chenyu
afb426acaf
docs: gather, cat, stack, repeat, squeeze, unsqueeze ( #4697 )
...
* docs: gather, cat, stack, repeat, squeeze, unsqueeze
repeat can take separate args now to match torch
* new style for multi examples
2024-05-23 14:20:19 -04:00
chenyu
ce46a7e83f
raise CompileError in metal if newLibraryWithSource_options_error_ fails ( #4695 )
2024-05-23 12:52:46 -04:00
Timmy
871a3292f4
Refactors linearizer acc to a Dict ( #4675 )
...
* dict accs refactor
* bug
* linters
* fix line length limit
* renaming do_reduce to reduce_acc b/c it's the acc for whatever reduce we are doing
* reduce_acc is None
* x.op and reduce_acc is not None
* delete extra check
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-05-23 19:05:23 +03:00
chenyu
72560e30fe
add CACHELEVEL=0 to tinybox green GEMM BEAM ( #4693 )
...
* add CACHELEVEL=0 to tinybox green GEMM BEAM
* BEAM=4 is more stable
2024-05-22 23:59:50 -04:00
Yury Zhuravlev
af56f0e68a
fix HSA/KFD load for system-wide installation ( #4218 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com >
2024-05-22 20:33:21 -07:00
nimlgen
12339f6564
disable cuda test in ci ( #4630 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-05-22 23:23:32 -04:00
Szymon Ożóg
9a9963ba7b
Remove uops deepcopy from PTX ( #4671 )
...
* Remove uops deepcopy from PTX
* Update test
* Fix test
* fix for non-ptx
* Clean
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-05-22 23:14:17 -04:00
chenyu
47aba47f64
update Torch.gather api ( #4692 )
...
* update Torch.gather api
gather(self, dim, index) to match torch
* fix that
2024-05-22 21:54:06 -04:00
chenyu
792a494eb8
fix various examples ( #4691 )
...
* fix examples that used ax1 and ax2 for transpose
* fix that
* update those
2024-05-22 20:43:21 -04:00
wozeparrot
30b07f3c5d
reduce ops ( #4690 )
2024-05-22 16:20:56 -07:00
chenyu
a46be6cfef
docs for transpose ( #4689 )
...
* docs for transpose
change the arg from ax1, ax2 to dim0, dim1 too
* too clever
2024-05-22 18:44:33 -04:00
chenyu
86da83f86d
move movement op docs ( #4688 )
2024-05-22 18:09:14 -04:00
qazal
498cf3e7e0
fuzzer path search for DEFINE_ACC ( #4656 )
...
* insert acc
* add test_ops
* find toposorts
* todo - not yet ready
* remove the import
* atol and childless children
2024-05-23 00:50:01 +03:00
qazal
f11a81f707
isolated test for BEAM=2 llama wrong uops toposort ( #4687 )
...
* add ast
* skip test in CI
2024-05-23 00:47:37 +03:00
wozeparrot
6020595eb0
more tensor.py docs ( #4686 )
...
wow much docs
2024-05-22 21:28:26 +00:00
Francis Lam
721f9f6acf
test/external/verify_kernel: fix LOGKERNS variable name in comments ( #4685 )
...
should've been changed with the LOGKERN to LOGKERNS change
2024-05-22 17:08:40 -04:00
chenyu
f8f97562e0
remove File Specific Variables from env_vars.md ( #4684 )
2024-05-22 17:00:14 -04:00
chenyu
225dcab3be
prepend _ to broadcast_shape and deepwalk ( #4683 )
...
* prepend `_` to broadcast_shape and deepwalk
internal only
* that too
2024-05-22 16:39:05 -04:00
qazal
c5f5755328
correctness test for multireduce nested locals ( #4682 )
...
* nested locals test
* move st
2024-05-22 19:35:35 +03:00
chenyu
bc9be39dec
set timeout in search _try_compile_linearized_w_idx ( #4677 )
2024-05-22 12:30:31 -04:00
qazal
d12d412e8b
revert uops dtype in pattern matcher ( #4681 )
...
This reverts commit 5f84cbb5df .
2024-05-22 14:45:51 +03:00
Elias Wahl
acc0039cfc
Resume fix + scheduler for non weight decay params ( #4679 )
...
* move ckpt dir
* fix resume. Add scheduler group
2024-05-21 19:38:13 -04:00
chenyu
0f21aa0416
example kernel that triggers Memory access fault for resnet on red ( #4678 )
2024-05-21 18:59:36 -04:00
qazal
5f84cbb5df
keep UOps.CAST in PHI-GEP fold for unmatching dtypes ( #4674 )
...
* these should be val.dtype
* cast float4 and float2 to root
* document tests
* 2 args
* fix assert
* match dtype
* no extra lines
* better fix
2024-05-21 14:59:49 -04:00
qazal
458a3961eb
catch compile errors in uops tests ( #4672 )
...
* use helper and compile
* llama beam=2
* ast length
* skip float4, fix hsa
* use empty tensors
2024-05-21 12:20:35 +03:00
wozeparrot
00432496d7
feat: tinyboxgreen ( #4366 )
...
* feat: tinyboxgreen
* feat: tinyboxgreenv2
* fix symlink weights
* fix: remove llama 2 70b for now
* feat: naming
* fix: remove extra cifar steps
* feat: disable mixtral on nvidia
2024-05-20 22:39:34 -04:00
Timmy
de733d73cf
Multireduce Linearizer Tests ( #4665 )
...
* updated tests
* make sure the upcasting tests actually causes the problem
* diff cleanup
* use UOpGraph utils
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-05-21 02:43:25 +03:00
chenyu
5e3fbbb33e
llama3 example add manual seed and log seed ( #4667 )
2024-05-20 19:09:57 -04:00
chenyu
8c99cc17f5
remove link to old adding_new_accelerators.md ( #4666 )
...
fix #4657
2024-05-20 19:05:23 -04:00
chenyu
c4089d169f
update BEAM_LOCAL_MAX to 1024 ( #4664 )
...
we used 1024 for mlperf submission and result steps time is 20% faster. the default should not be worse
2024-05-20 18:06:32 -04:00
chenyu
704cb1d8a0
fix conversation.py quantize ( #4663 )
...
it used to be true for int8, not it's a string for int8 or nf4
2024-05-20 17:36:37 -04:00
chenyu
ae861325ce
update llama sample for mac 32 input buffer limit ( #4662 )
...
set default sampling params to function call to 0, and top k in llama3 to 25.
2024-05-20 17:23:39 -04:00
Elias Wahl
993091adfa
loss scaler + nan fixes ( #4661 )
2024-05-20 17:08:35 -04:00
qazal
b33c827aed
UOps.RANGE toposort spec ( #4660 )
...
* use iterator
* nested loops and outer loads
* uop after phi
2024-05-20 23:38:20 +03:00
qazal
0d9e623d83
consolidate uops tests ( #4659 )
...
* merge uoptimize
* move tests
* fix skip message
2024-05-20 21:42:31 +03:00
Szymon Ożóg
1e7b7b2c3c
Fix flop coutning for mulacc ( #4640 )
...
* Fix flop coutning for mulacc
* add test_simple_mulacc
* Update test_uops_stats.py
* Update test_uops_stats.py
* revert test_mulacc
* Test for MULACC vs MUL+ADD
2024-05-20 12:06:00 -04:00
wozeparrot
b144d4b460
new llama3 example ( #4576 )
2024-05-19 22:42:23 -07:00
nimlgen
c9f7f2da70
nv hcq bind api ( #4629 )
...
* hcq bind api for nv
* linter
* linter
* add test
* small comment
2024-05-19 23:17:10 +03:00
qazal
d308f4fa9a
correctly insert UOps.END* in fuzz result ( #4653 )
2024-05-19 21:10:28 +03:00
chenyu
456aa0b656
update test_search kernel count ( #4652 )
...
integration test that beaming 1 kernel increments kernel count by 1, and moved exiting test_kernel_count to TestTimeLinearizer
2024-05-19 13:54:52 -04:00
qazal
954718e6bf
reorder DEFINE_GLOBAL in fuzz_uops ( #4651 )
...
* globals base
* test: opt out of DEFINE_GLOBAL
* do it like ExecItem
2024-05-19 20:51:31 +03:00
Léo
967e35f8b8
fix(beam): GlobalCounters kernel count increasing when clearing l2 ( #4598 )
...
* fix(beam): GlobalCounters kernel count increasing when clearing l2
* fix: removed the NOSTATS var by adding do_update_stats to Tensor.realize()
* test(search): regression test for _time_program, should not increment kernel_count
* fix(test_search): unused var and now properly checking when l2 is cleared
* fix(test_search): added assert message
* fix(test_search): now testing public beam api for kcount
* ruff fixes
---------
Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr >
2024-05-19 10:03:47 -07:00