tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-23 05:48:08 -05:00

Author	SHA1	Message	Date
chenyu	aeaf7894a7	more generic version of #6548 (#6549 ) x(-1)<0 can be generalized to x(-1)<c, 473 -> 462 valids	2024-09-16 23:17:16 -04:00
chenyu	596f41eb46	simple drop image valid case (#6548 ) * simple drop image valid case started unit test, 530 -> 473 valids * cleanup	2024-09-16 22:54:07 -04:00
chenyu	798be6bb74	add gated read_image count in openpilot compile2 (#6546 ) 530 to go	2024-09-16 21:17:00 -04:00
George Hotz	cd90092f14	graph rewrite tests (#6519 ) * more graph rewrite tests * more complex test cases * more tests * more tests * cleanups * 9600 lines * cleanups	2024-09-15 17:29:16 +08:00
qazal	c5bae55ec8	new generate_dataset.sh (#6423 ) * new generate_dataset.sh * keep those there * test: rm expected failures * rename to extract	2024-09-09 15:13:07 +08:00
George Hotz	4b128da525	hotfix: line count to 9500	2024-09-06 09:10:03 +08:00
ignaciosica	c15506fc35	[WIP] amx support as TC (#5693 ) * almost working with relu, even hackable... but acc size is wrong, fix needed * upcast based on threads, change thread size to 4x4 * revert wrongfully commented assert * fix tc load indexing * modify for size 8 * fix bug for size 8 * Revert "fix bug for size 8" This reverts commit `cdb3f5df85`. * Revert "modify for size 8" This reverts commit `3ef0904bd9`. * good kernel with changes in lowerer * revert "good kernel with changes in lowerer" This reverts commit `975e2b5a4e`. * good kernel for relu! * refactor lowerer changes * add amx context var to helper * clean up amx flag * improve lowerer changes readability * improve check for amx * revert lowerer if * add float4 type rendering for clang * add amx definitions * enable indexing for clang if amx * working amx example, wrong because of dims * almost works for float 16, need to spot using double load in amx * cleaner render_kernel * revert chages in simple_matmul and delete env * add new var upcast_offset to get_optimized_ast * change axis for axes * invert if in rendering phi * fix some bugs * fix linearizer tests * fix vec/get pat for amx * remove clang tc if amx is disabled * add ops_python support * refactor into one complementary function in ops_python * add job for EMUALTE_AMX * improve checking for AMX in UPCAST and TC extra ops * fix lint issue * commit before refactor into autocontained AMX * start refactor by removing special rendering for AMX * all ready for amx handcoded kernel * working poc, most straightforward amx support * avoid local opts for tc if amx * fix merge bugs * skip test for clang * skip tc hand-coded opts if amx * remove hardcoded ops_python values * remove hardcoded sizes for amx kernel * fix ops_python bug where dim was hard-coded * change contract for vectorize * working without changes in lowerer * revert changes in gep rendering * fix ops_python * modify comment * skip test if clang for different type accumulation * move rename and bug for seperate pr * fix wrong path for test * addmm not implemented in torch for cpu * change struct for vector; equally slow but cleaner * revert modified test * simply wmma rendering * minor change * noqa:501 * add length 16 for AMX * fix vectorized half issue * fix error * remove comment * change set for dedup * split test of tensor_core_extra_ops so that cases that dont require locals run for AMX * add amx reference * load acc into amx registers * fix dtype rendering and remove noqa * moved tests change into another pr * add real AMX job for CI and fix bug * fix ops_python bug * fix test class * remove real AMX tests and fix uops_stats test * remove wrong test * acc folding * hotfix: bug * fix float4 tests for amx * hack for fixing flops counting * hotfix: mypy * add flop counts test for amx * improve test_float4_multidim_amx * improve test_float4_multidim_amx * improve test_float4_multidim_unaligned_load_amx * nits tests --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-09-06 09:01:10 +08:00
nimlgen	d22b46a2ac	qcom in benchmarks (#6337 )	2024-09-02 19:59:11 +03:00
nimlgen	8e2a3fc165	raise lines count to 9300 for qcom (#6336 )	2024-09-02 18:57:57 +03:00
George Hotz	365babe391	precompute early_reject [run_process_replay] (#6327 ) * precompute early_reject [run_process_replay] * features for ebs * fix ocelot cache	2024-08-29 18:26:24 -07:00
CaltropHungerton	002f60b4c3	fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192 ) * fix wmma flop counting on intel, add count tests * half * add half gemm * Update test.yml * one test * Update test_uops_stats.py * Update test_uops_stats.py * Update test_uops_stats.py * smaller matrix, use unittest skipUnless decorator	2024-08-25 18:37:05 -07:00
chenyu	e745e16441	remove UnaryOps.NEG (#6238 ) * Remove UnaryOps.NEG generated new dataset with ``` time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ``` * fix that	2024-08-22 14:21:39 -04:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
George Hotz	e8ae9af962	bump line count to 9000. we should be here a while	2024-08-16 08:46:36 -07:00
chenyu	7d46fb0c83	load balance NV benchmark ci (#6107 )	2024-08-16 10:08:08 -04:00
chenyu	a41c9dd12c	test py.typed as a package (#6094 ) * test py.typed as a package * try this? * and this * try that? * add this back * cleanup	2024-08-15 11:19:08 -04:00
qazal	30035df5a4	add metal process replay back (#6068 ) test this new one	2024-08-14 12:29:56 +03:00
qazal	9d2ea94fe9	temp: disable process replay on metal (#6062 )	2024-08-13 16:31:55 +03:00
nimlgen	8f787785d9	fix openpilot benchmark (#6049 )	2024-08-12 21:12:32 +03:00
chenyu	e6c7c3e499	update pylint path to check indent/space for all (#6022 ) also fixed many errors. it was not checking nested dirs. exclude autogen for now. can we use ruff for this?	2024-08-10 14:41:09 -04:00
George Hotz	cfb04c67d1	run unit tests separate from others (and only once) (#6020 ) * run unit tests separate from others * ignore unit tests elsewhere	2024-08-10 11:17:56 -07:00
qazal	266afad8ed	hotfix: skip schedule capture in benchmarks (#6012 )	2024-08-10 17:13:53 +03:00
qazal	24c7c41ce0	diff LazyBuffer schedules in process replay (#5996 ) * start diff printing * this should be 2 * add to process_replay.py * enable schedule capture * arange diff is process replay	2024-08-09 14:16:43 +03:00
George Hotz	3d445039c2	hotfix: 8800 lines for AMX+intel tc	2024-08-06 17:50:26 -07:00
chenyu	adba5efc64	enable llama 2 70B in tinybox green CI (#5905 ) runnable with MAX_CONTEXT=256	2024-08-04 18:48:46 -04:00
George Hotz	7348c40d9d	sampling time sync (8700 lines) (#5843 ) * sampling time sync * jitter matrix * comment * pass mypy * line count	2024-08-02 14:44:35 -07:00
wozeparrot	acadccf344	comma benchmark (#5518 )	2024-08-02 14:36:54 -07:00
chenyu	f27f949a5d	Revert "revert some UOp IDIV bound (#5863 )" (#5871 ) This reverts commit `0c8d202348`.	2024-08-01 21:38:31 -04:00
chenyu	df138bc558	Revert "revert a mod pattern (#5864 )" (#5870 ) This reverts commit `5c8de2d044`.	2024-08-01 20:44:26 -04:00
chenyu	1b0314d9ef	Revert "remove one more UOp mod pattern (#5865 )" (#5868 ) This reverts commit `b03b8e18c2`.	2024-08-01 20:28:35 -04:00
chenyu	b03b8e18c2	remove one more UOp mod pattern (#5865 ) fixed UOP_IS_SYMBOLIC=1 test_failure_40	2024-08-01 18:29:04 -04:00
chenyu	5c8de2d044	revert a mod pattern (#5864 ) fixed UOP_IS_SYMBOLIC=1 linearizer failure 47	2024-08-01 17:24:26 -04:00
chenyu	0c8d202348	revert some UOp IDIV bound (#5863 ) * revert some UOp IDIV bound breaks conv with UOP_IS_SYMBOLIC, added some conv tests in CI * those are correct * skip slow ones	2024-08-01 15:09:06 -04:00
George Hotz	5eedd9e3ad	raise the line ceiling to 8600. USE LINES CAREFULLY	2024-07-31 09:56:39 -07:00
wozeparrot	eebb1b9922	feat: temperature 0 llama3 benchmark (#5806 )	2024-07-30 12:05:36 -07:00
chenyu	cb6718347f	`python -m mkdocs build --strict` in CI (#5800 )	2024-07-29 16:46:30 -04:00
chenyu	be3899d211	hotfix increase ci timeout to 20 mintues (#5799 ) when cache is clear it takes time to populate cache	2024-07-29 16:25:27 -04:00
chenyu	471b188d79	fix mypy errors in latest mypy (#5794 ) * fix mypy errors in latest mypy mypy has stricter partial and api arg checks now * PYTHONPATH="."	2024-07-29 14:53:30 -04:00
George Hotz	0392123e6e	TC=2 still sets tensor cores (and TC=3 support for locals) (#5780 ) * TC=2 still sets tensor cores * add TC=3 support for using locals * bugfix * lines + TC=3 tests * CUDA can use threads, fix fuzz linearizer	2024-07-28 16:16:53 -07:00
qazal	3e49d86c01	process replay diffs 3 things now (#5731 ) * github api infra * process replay is 3 parts now * parse benchmarks * add gh_token * complete diff * move process replay tests * last successful run * add tempdir * skip master	2024-07-27 12:52:20 +03:00
qazal	57b4a8e98d	assert process replay asserts (#5737 ) * assert process replay asserts * one ci job is fine * test: Revert "separate process replay main loop (#5734)" This reverts commit `94d578396f`. * mac sed needs that * Revert "test: Revert "separate process replay main loop (#5734)"" This reverts commit `e4ad7684d5`. * disable process replay capture * save time * amd is tiny * send to /dev/null	2024-07-27 12:07:50 +03:00
George Hotz	db1d093b29	reenable LLaMA-3 8B BEAM on NV (#5746 )	2024-07-26 16:56:41 -07:00
chenyu	eff7c5fd2c	halve kernel counts in metal Fuzz Test linearizer (#5716 ) the test time has increased to 3 minutes	2024-07-25 14:35:11 -04:00
chenyu	7c8fe0fe47	skip interpolate tests for PYTHON=1 (#5664 )	2024-07-23 18:47:15 -04:00
George Hotz	e3f00ac77d	Fix cuda tc emu test (#5663 ) * fix acc folding for NV tensor cores * fix correctness of reduce_before_expand * fix test emulated CUDA tensor cores * test_gemm_fp16 on some devices	2024-07-23 15:04:25 -07:00
qazal	fdfc0015a7	[run_process_replay] for opencl/openpilot (#5009 ) * lil reset script * find the prg * use lower_schedule_item * add process replay back * cleanups	2024-07-18 19:42:33 +03:00
wozeparrot	6ccb2390c3	feat: update_benchmark_staging (#5529 )	2024-07-17 20:40:57 -07:00
George Hotz	d3b098299d	add failing regression test for image (#5540 ) * add failing regression test for image * tg type * simpler test * don't realize image to image casts caused issue * simple pad	2024-07-17 17:27:18 -07:00
wozeparrot	218e157f00	benchmark on update_benchmark_staging (#5541 )	2024-07-17 17:11:52 -07:00
Alessandro Benetti	13e200b437	add strict mkdocs check (#5497 )	2024-07-15 14:21:37 -07:00

... 10 11 12 13 14 ...

1092 Commits