tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-05 20:24:57 -05:00

Author	SHA1	Message	Date
David Hou	2befdf86d9	dataloader worker/shm cleanup (#3710 )	2024-03-12 21:44:24 -04:00
chenyu	e1b2a82d89	fix st.real_size can be nagative if valid is always false (#3708 ) two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)	2024-03-12 20:34:07 -04:00
chenyu	b13457e4a7	explicit dtypes in hlb_cifar (#3707 ) prepared bfloat16 change. added float() and cast(default_float) in whiteing, explicitly set dtype in various places that convert between numpy and Tensor	2024-03-12 18:20:23 -04:00
Francis Lam	b6e2495fdd	kernel: limit shared memory usage when adding opts (#3705 ) * kernel: limit shared memory usage when adding opts * search: remove unnecessary limit on search space apply_opt will do the more correct check	2024-03-12 17:06:21 -04:00
George Hotz	2024b24f35	add some graph tests (#3702 ) * add some graph tests * PatternMatcher class * speedup * const cast test * fix tests * itertools chain	2024-03-12 09:49:47 -07:00
chenyu	f599c6e7f4	test output dtypes matche in test_ops (#3703 ) need to cast some torch output to int32 because torch default returns int64 for index related function close #2797	2024-03-12 12:44:40 -04:00
nimlgen	798970cfad	fix gpu hangs when exiting while aql queues are executing (#3700 )	2024-03-12 19:23:23 +03:00
chenyu	02ca067bdf	use default_float.np to construct test data in test_ops (#3701 ) first step of #2797	2024-03-12 11:58:20 -04:00
George Hotz	6755a9254f	constant fold pattern match (#3696 ) * constant fold pattern match * match * better match * fix bug in pattern * more folding	2024-03-12 08:48:07 -07:00
nimlgen	dd1a1c12df	rocm path in autogen (#3697 )	2024-03-12 14:06:43 +03:00
Patrick Tsai	971d7f5d7c	O(n) arange attempt (#3530 ) * It works? * Clamp correctly * Refactor * Make code better * Undo some stuff * First step to trying to make floats work * Floats work in Python op but not metal because int div is different Python integerdivision was implemented as // which rounds towards negative infinity, but C integer division rounds towards 0 so there is an off-by-1 division error * arange does cumsum with ints and then multiplies by step This is so loop optimization can remain int only * Undo a lot of symbolic changes * Final check * Cleanup * There can be multiple phis * Fix multiple phi op removal * const sets dtype correctly * Fix bugs * Fix a couple bugs and add loop vars to resolve * missed one * Don't trim too many ops * Fix symbolic test * Use ones instead of full * Delete test * Lint passes * max node error * Small updates to loop logic * Remove unnecessary changes * We are getting somewhere * Simple case * Fix * rm, prn * Better * If NumNode doesn't work then continue * clamp is needed for arange(256) * Move everything into the optim fn * Replace correctly * Order optimizations better * Delete * mypy * Test for simplification * Rename * Fix test * update test description * Undo more * Cleanup * No replaced_ops map * Fix lint * AssertionError * back again * Reinstate assertion * Return true and make diff not as big * Bigger range for test * Change cumsum impl * fix bug * make big cumsum work * lint * Undo cumsum 2-stage removal * No while helper * optional min/max clamping * floats work * rm giant arange test * fix python cast None * Check phi parents * one phi allowed per where * Fix one phi per where * Rework iteration * Delete assertions * convert to int * Try mul -1 instead of neg for hip..? * Remove one phi per where requirements * one accum only * Lint * should simplify a loop at a time * Don't get rid of loop explcitly * Need to iterate backwards * lint * unary neg * Make optim work for onnx and sum_pad_collapse * Better message * filter alu ops correctly * Fix the limiter * lint and simplify * Add it back * off by one error * test wheres and phis * test max ops and non-if stuff * <= * cast_scalar * Oops * Change test * Pass loop uops instead of a modified map * Cut param transfer between linearizer and uops * Fix issues * Fix lint * fix efficientnet python 3.8 invalid syntax * distinct vars in seen_vars * accurate var names --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-11 16:09:20 -07:00
George Hotz	a5d023dff8	reciprocal mlop (#3694 )	2024-03-11 16:08:46 -07:00
George Hotz	3af1c1051a	Revert "bring reciprocal back (#3687 )" (#3692 ) This reverts commit `bcf6fbd3b2`.	2024-03-11 15:55:14 -07:00
George Hotz	ef44c8959b	Revert "rewrite recip to div (#3690 )" (#3691 ) This reverts commit `2b089bfd18`.	2024-03-11 15:54:58 -07:00
George Hotz	2b089bfd18	rewrite recip to div (#3690 ) * rewrite recip to div * fix bug in uops add	2024-03-11 15:52:24 -07:00
qazal	aec4c4f01b	linearizer ast as a tuple of lazyops (#3689 ) * multi store op linearizer * currently we do only one output per kernel * named opts	2024-03-11 15:39:04 -07:00
chenyu	d0bcc9a66b	replace all `if dim < 0: dim += self.ndim` with _resolve_dim (#3688 )	2024-03-11 17:33:36 -04:00
George Hotz	bcf6fbd3b2	bring reciprocal back (#3687 ) * bring reciprocal back * better * explicit dtype for recip * llvm tighter * sigmoid can use RECIP	2024-03-11 14:19:54 -07:00
Francis Lam	9f13960f72	search: catch RuntimeError when timing acted_lins (#3664 ) when compilation succeeds, but runtime fails due to thread limits on METAL, this allows a beam search to proceed, treating this the same way as a compile failure.	2024-03-11 16:14:03 -04:00
rnxyfvls	490c5a3ec3	examples/stable_diffusion: support model checkpoints without alphas_cumprod key (#3681 ) * examples/stable_diffusion: support model checkpoints without alphas_cumprod key (which is most models on civitai) * fix indent --------- Co-authored-by: a <a@a.aa>	2024-03-11 16:05:52 -04:00
Francis Lam	3219a527d6	search: add a tool that beam searches one or more kernels (#3685 )	2024-03-11 16:02:17 -04:00
chenyu	b68fbd7d81	View.__add__ to merge_view (#3686 ) verified the cases that used real_strides are redundant	2024-03-11 15:52:34 -04:00
nimlgen	76ade20b89	hsa driver tiny cleanups (#3684 )	2024-03-11 22:32:43 +03:00
chenyu	d69170e27e	add llama 2 70B in ci and verify output (#3682 ) * add llama 2 70B in ci and verify output * ln -s llama2 dir	2024-03-11 12:48:22 -04:00
chenyu	e10ee2ed3f	llama beam tinybox ci (#3680 )	2024-03-11 01:35:39 -04:00
George Hotz	3415b0ee54	hotfix: mixtral copies norms together for 2% speed	2024-03-11 01:28:03 +00:00
Skosh	e8c350fdac	fix: make Tensor.rand produce correct values for float16 (#3654 ) * fix: make Tensor.rand produce correct values for float16 Due to precision loss when casting to float16, the data distribution created by custom_random isnt correctly in the interval ]0, 1[, but instead in the interval ]0, 1], which causes the Tensor.randn to incorrectly generate values of infinity. The solution uses a scaling value to make sure the values stay under 1, when using half precision. Closes #3611 * update implementation to truncate to closest f16 value to 1 * chore: fix whitespace * test larger distribution --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-10 18:48:00 -04:00
chenyu	bad6adaf8c	add mixtral and 6 gpus cifar to tinybox ci (#3676 ) * add mixtral and 6 gpus cifar to tinybox ci * print total ram used at the end of loading	2024-03-10 18:25:31 -04:00
George Hotz	44a67bf783	constant folding (#3675 ) * constant fold * bool math * fix ptx	2024-03-10 14:47:24 -07:00
George Hotz	25aede6fd9	truncate for exec_alu (#3674 )	2024-03-10 14:19:04 -07:00
Francis Lata	957ae9b594	Fix Tensor's __repr__ for printing out grad (#3673 ) * update check for Tensor's __repr__ with grad * add test for repr with grad bugfix	2024-03-10 17:04:29 -04:00
George Hotz	0f16729023	RDNA3: restore launch bounds (#3672 ) * bring launch bounds back * works * that second flag didn't do anything * fix linter	2024-03-10 10:27:52 -07:00
chenyu	d7452c2a20	clean up llvmir builder (#3671 ) ``` _block -> block builder._block.module -> builder.module var_dtype -> dtype ```	2024-03-09 21:19:36 -05:00
George Hotz	1143c62519	tensor.py touchups (#3667 ) * tensor.py touchups * put back	2024-03-09 16:12:20 -08:00
George Hotz	69ca7f7bf9	changes for teenygrad (#3665 ) * changes for teenygrad * upd * simpler test	2024-03-09 15:30:34 -08:00
Quentin Wach	89b8b5d549	Fix missing import. (#3666 )	2024-03-09 14:55:23 -08:00
Maximilian Wolf	8ae85b2cf5	add inference_mode context manager with decorator support (#3621 ) * add inference_mode context manager with decorator support * change val to mode for train and inference_mode * fix wrong rename	2024-03-09 08:38:26 -08:00
Obada Khalili	b5cbf1792a	Fix `Tensor.cumsum` when axis of length 0 is selected (#3473 ) * fix Tensor.cumsum when axis of length 0 is selected * add cumsum regression test * define padding left size in a seperate line	2024-03-09 08:26:41 -08:00
chenyu	915f98791c	use custom KernelOptError in kernel opt (#3661 ) be more specific about invalid kernel opt, used that in test_linearizer_failures. make BEAM kernel search work even with assertion disabled. `BEAM=2 python3 -O examples/llama.py --temperature=0 --count=10 --prompt="Hello." --timing`	2024-03-08 15:36:16 -05:00
George Hotz	ac02e7347d	ptx timing vs cuda timing (#3659 )	2024-03-08 10:17:49 -08:00
uuuvn	daa4034e80	No more metal flakiness (#3643 )	2024-03-08 08:54:44 -08:00
chenyu	e25879d50e	don't get new var_val for the same ast in fuzz_linearizer (#3657 ) fixed result comparison for kernels with variables	2024-03-08 09:49:24 -05:00
chenyu	1130c73844	add FUZZ_NTH to fuzz_linearizer (#3656 ) * add FUZZ_NTH to fuzz_linearizer also update tests in test_linearizer_failures to not just run on METAL * update failures for HIP/HSA * test_failure_21 LLVM PADTO	2024-03-08 09:16:49 -05:00
David Hou	9f66dcf718	PolynomialDecayWithWarmup + tests (#3649 ) * working PolynomialDecayWithWarmup + tests....... add lars_util.py, oops * keep lars_util.py as intact as possible, simplify our interface * whitespace * clean up * clean up * asserts * test polylr for full resnet training run * add comment * rename * fix do_optim * don't cast lr * info * calculate from train_files * skip it	2024-03-07 18:53:36 -05:00
chenyu	57df8e8d82	update fuzz_linearizer (#3648 ) included non-reduce kernel and kernel with variables. green msg when everything passed it's possible that creating rawbufs failed due to memory error, included that in failure cases	2024-03-07 18:41:22 -05:00
chenyu	b282a45e39	fix direct store float4 with same vin (#3652 ) In a kernel that stores expanded value, the vin of float4 can come from same source, and we only remove once in that case.	2024-03-07 18:11:50 -05:00
chenyu	a66ffec6d3	update kernel dataset to exclude the disktensor ones (#3651 ) disk tensor load contains big offset and is not meant to be run by gpu. repro steps ``` time ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ```	2024-03-07 17:35:19 -05:00
chenyu	fcf4a5ccf2	fix example that calls Tensor.__bool__ (#3650 ) also removed `.cpu()` calls in mask_rcnn so `python3 examples/mlperf/model_spec.py` runs	2024-03-07 16:59:26 -05:00
George Hotz	6e50582e62	working to improve ptx (#3647 ) * working to improve ptx * fix compile fail	2024-03-07 12:39:31 -08:00
Zaffer	1853ec9a02	add tests for bfloat16 on HIP (#3638 ) * Fix bug in login functionality * Remove HSA backend test and add bfloat16 dtype tests that run in CI * Skip tests on HIPCPU * skip tests causing segfault on LLVM backend * Exclude bfloat16 tests causing segfaults in LLVM backend * move bf16 cast tests to only test on HIP	2024-03-07 10:45:36 -08:00

... 136 137 138 139 140 ...

10633 Commits