tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-14 16:44:59 -05:00

Author	SHA1	Message	Date
chenyu	4ecd5789ab	#include <tgmath.h> in ops_clang (#3927 ) * different clang sqrt/log2/exp2/sin function based on dtype fixed softmax_argmax issue in #3552 for clang. * tgmath.h * revert those	2024-03-25 17:48:57 -04:00
chenyu	dc508022a9	clean up clang src header (#3925 ) don't need to define int64 and uchar	2024-03-25 15:18:35 -04:00
nimlgen	f2a9ea4ea9	lru allocator for copyin host buffers (#3918 ) * lru allocator for copyin host buffers * linter happy	2024-03-25 15:57:18 +03:00
George Hotz	e0e234bf94	hotfix, str compare version for cuda	2024-03-24 20:35:24 -07:00
Arseny Kapoulkine	715850aef9	Fix sm89 PTX=1 compilation (#3915 ) * Fix sm89 PTX=1 compilation The minimum PTX version that supports sm89 is 7.8 (same version also supports sm90); without this ptxas fails when running tinygrad with PTX=1 on RTX 4090. * Use int(arch[3:]) for forward compat with SM10.0 if that happens	2024-03-24 20:32:29 -07:00
chenyu	2e39f57594	move lines around in ops_python wmma (#3911 )	2024-03-24 17:14:26 -04:00
chenyu	8c8b57fd5f	cleanup ops python (#3908 ) i just want to merge lars!	2024-03-24 11:36:31 -04:00
sekstini	7c3632fd1e	add --minimal flag to nvrtc (#3899 )	2024-03-23 16:38:31 -07:00
nimlgen	4e18dd78d3	faster program start in llvm (#3897 )	2024-03-23 15:20:15 +03:00
George Hotz	46a3501cec	nv ioctl sniffer (#3892 ) * nv ioctl sniffer * unused import * Update __init__.py * that work * that fix it	2024-03-23 00:29:30 -07:00
nimlgen	16e31f7f0d	init multidevice cuda graph (#3858 ) * init multidevice cuda graph * cuda just works! * clean * linter happier * liners happy * update transfer inputs * do not change free * useless check for cuda --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 13:49:48 -07:00
chenyu	1c51d586ea	replace raise Exception with specific errors (#3874 )	2024-03-22 12:32:21 -04:00
nimlgen	8ef5490ec8	cuda tranfer + async copyin (#3873 )	2024-03-22 09:01:37 -07:00
Szymon Ożóg	624bc89910	PTX - implement float 4, ptr arithmetics and other speed improvements (#3775 ) * ptx float4 implementation * remove from cache when trimming uops * Gate for float4 * Linting fix * disable test reasonable time for ptx * import getenv * Update uops.py * linter * Add div test for half * upcast if op does not support operation * fix offset * Run only if dtype supported * zero out registers when accessing by pred + cleanup * Remove trailing whitespace * revert * spacing fix * move cache clearing outside loop * did this suddenly start working? * unused import removed * Remove cast * Use pattern matching * linting --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 08:54:02 -07:00
George Hotz	f4055439dc	don't include hip common (#3851 ) * don't install hip common * only that * Revert "only that" This reverts commit `85f22015d9`. * less * needed * sep comgr * header file * 6.0.2 * update hsa * hsakmt * Revert "hsakmt" This reverts commit `d3a118078e`.	2024-03-22 08:50:50 -07:00
nimlgen	b78352b423	do not create structs every call in CUDAProgram (#3855 ) * do not create structs in cuda * fix graph * linter * do not exec twice * fix graph	2024-03-21 17:51:40 +03:00
nimlgen	e5745c1a0d	fix nan on multigpus cuda (#3854 )	2024-03-21 15:21:55 +03:00
nimlgen	85691c8e20	fix hsa sync issue (#3847 ) * fix hsa sync issue * linter	2024-03-21 04:00:30 +03:00
nimlgen	2d54e4d747	clean up hsa driver (#3818 ) * clean up driver * remove returns	2024-03-20 00:17:41 +03:00
chenyu	5dd048a378	remove HIP in core tinygrad (#3810 ) * remove HIP in core tinygrad ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc. Also updated README and EMULATE tc test flag * EMULATE_CUDA	2024-03-18 18:19:27 -04:00
nimlgen	629757eaa1	hotfix: update inputs of correct transfers in hsagraph (#3800 ) * hotfix: update inputs of correct transfers in hsagraph * test it * run in ci?	2024-03-18 15:52:27 -04:00
qazal	d79a1d315b	add outbufs back (#3803 ) * update outcounts * update JIT * refactor search * hsa uses outcount	2024-03-18 10:30:53 -07:00
nimlgen	e78df485c7	update inputs for transfers in hsagraph (#3560 )	2024-03-18 18:01:04 +03:00
George Hotz	53adcb34f5	remove hip backend (#3783 ) * remove hip backend * remove unused * rhip * more RHIP	2024-03-17 10:12:16 -07:00
George Hotz	2a14d1b5e0	Revert "add outbufs info to CompiledASTRunner (#3781 )" (#3782 ) This reverts commit `722dd4276c`.	2024-03-17 09:47:23 -07:00
qazal	722dd4276c	add outbufs info to CompiledASTRunner (#3781 ) * add outbufs * Revert "add outbufs" This reverts commit `5f4c0668f5`. * simplify	2024-03-17 07:52:20 -07:00
nimlgen	91e181ee02	make alignment readable (#3766 )	2024-03-15 23:18:40 +03:00
nimlgen	ba79a3c09a	some hsa lines saving + fixes (#3752 ) * fix write to ring + some lines * hsa driver test	2024-03-15 18:12:18 +03:00
George Hotz	ca19eb3e82	where fold try 2 (#3748 ) * where fold try 2 * assign fold * test_where_fold works * add gated store support to ops_python --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-15 07:46:26 -07:00
chenyu	75d4344cda	UOps.BITCAST (#3747 ) * UOps.BITCAST implicitly fixed no const folding for bitcast * python backend * ptx * consistent llvm	2024-03-14 21:00:35 -04:00
nimlgen	4b01c44579	hotfix: sdma/aql are visible again (#3733 )	2024-03-14 10:33:22 +03:00
nimlgen	0f050b1028	hsa profiler (#3711 ) * hsa profiler * simpler * profile * copy -> is_copy * print when saved * faster * do not create structs --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-13 21:19:22 -07:00
qazal	337cd53444	multioutput ScheduleItem (#3699 ) * refactor realize.py * update docs * update test_sched * update runners and devices * update openpilot and unit tests * cleanup runner lowering * update more tests	2024-03-13 08:59:38 -07:00
Francis Lam	b6e2495fdd	kernel: limit shared memory usage when adding opts (#3705 ) * kernel: limit shared memory usage when adding opts * search: remove unnecessary limit on search space apply_opt will do the more correct check	2024-03-12 17:06:21 -04:00
nimlgen	798970cfad	fix gpu hangs when exiting while aql queues are executing (#3700 )	2024-03-12 19:23:23 +03:00
nimlgen	dd1a1c12df	rocm path in autogen (#3697 )	2024-03-12 14:06:43 +03:00
Patrick Tsai	971d7f5d7c	O(n) arange attempt (#3530 ) * It works? * Clamp correctly * Refactor * Make code better * Undo some stuff * First step to trying to make floats work * Floats work in Python op but not metal because int div is different Python integerdivision was implemented as // which rounds towards negative infinity, but C integer division rounds towards 0 so there is an off-by-1 division error * arange does cumsum with ints and then multiplies by step This is so loop optimization can remain int only * Undo a lot of symbolic changes * Final check * Cleanup * There can be multiple phis * Fix multiple phi op removal * const sets dtype correctly * Fix bugs * Fix a couple bugs and add loop vars to resolve * missed one * Don't trim too many ops * Fix symbolic test * Use ones instead of full * Delete test * Lint passes * max node error * Small updates to loop logic * Remove unnecessary changes * We are getting somewhere * Simple case * Fix * rm, prn * Better * If NumNode doesn't work then continue * clamp is needed for arange(256) * Move everything into the optim fn * Replace correctly * Order optimizations better * Delete * mypy * Test for simplification * Rename * Fix test * update test description * Undo more * Cleanup * No replaced_ops map * Fix lint * AssertionError * back again * Reinstate assertion * Return true and make diff not as big * Bigger range for test * Change cumsum impl * fix bug * make big cumsum work * lint * Undo cumsum 2-stage removal * No while helper * optional min/max clamping * floats work * rm giant arange test * fix python cast None * Check phi parents * one phi allowed per where * Fix one phi per where * Rework iteration * Delete assertions * convert to int * Try mul -1 instead of neg for hip..? * Remove one phi per where requirements * one accum only * Lint * should simplify a loop at a time * Don't get rid of loop explcitly * Need to iterate backwards * lint * unary neg * Make optim work for onnx and sum_pad_collapse * Better message * filter alu ops correctly * Fix the limiter * lint and simplify * Add it back * off by one error * test wheres and phis * test max ops and non-if stuff * <= * cast_scalar * Oops * Change test * Pass loop uops instead of a modified map * Cut param transfer between linearizer and uops * Fix issues * Fix lint * fix efficientnet python 3.8 invalid syntax * distinct vars in seen_vars * accurate var names --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-11 16:09:20 -07:00
Francis Lam	9f13960f72	search: catch RuntimeError when timing acted_lins (#3664 ) when compilation succeeds, but runtime fails due to thread limits on METAL, this allows a beam search to proceed, treating this the same way as a compile failure.	2024-03-11 16:14:03 -04:00
nimlgen	76ade20b89	hsa driver tiny cleanups (#3684 )	2024-03-11 22:32:43 +03:00
George Hotz	ac02e7347d	ptx timing vs cuda timing (#3659 )	2024-03-08 10:17:49 -08:00
uuuvn	daa4034e80	No more metal flakiness (#3643 )	2024-03-08 08:54:44 -08:00
George Hotz	6e50582e62	working to improve ptx (#3647 ) * working to improve ptx * fix compile fail	2024-03-07 12:39:31 -08:00
George Hotz	81baf3eed3	bring ptx back (#3623 ) * bring ptx back * ptx back * fix define var * fix a few bugs * bugfixes * fixes * fix llvm bug * fix test bug	2024-03-06 13:34:21 -08:00
nimlgen	3db826e195	hsa in lin opts (#3602 )	2024-03-04 06:17:32 -08:00
nimlgen	640dc0fc51	hsa flush hdp (#3591 ) * hsa flush hdp * use _alloc()	2024-03-03 04:55:07 -08:00
George Hotz	aa9b013d79	add constant folding for WHERE in uops (#3584 ) * add constant folding for WHERE in uops * prereqs for generic constant folding * fix test * disable slow overflow logic * make that test faster	2024-03-02 10:37:14 -08:00
nimlgen	3b7e3fa2e4	fix sync in hsa graph (#3582 )	2024-03-02 07:37:51 -08:00
Francis Lam	e17f1821a7	wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544 )	2024-03-01 17:51:02 -08:00
chenyu	b7e555f6c0	run test_linearizer_failures on PYTHON backend (#3565 ) * run test_linearizer_failures on PYTHON backend only test 1, some have hanging issues and gated store is not implemented * --durations=20 * two less slow ones	2024-03-01 17:00:18 -05:00
George Hotz	2c19ab6561	define var (#3548 ) * define var * remove vars from there * fix python symbolic ops * fix llvm * pypath	2024-02-29 16:43:27 -08:00

1 2 3 4 5 ...

476 Commits