tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-14 17:38:06 -05:00

Author	SHA1	Message	Date
George Hotz	846753f343	remove KernelInfo from gpudims (#11809 ) * remove KernelInfo from gpudims * that's good in there	2025-08-23 16:32:45 -07:00
Sieds Lykles	07d4ed7e4c	one more symbolic add variation (#11807 )	2025-08-24 01:15:04 +02:00
qazal	759ebea4eb	viz: reflect timeline API boundary in names (#11808 ) * define shapes once * depth isn't an event property * update server naming	2025-08-24 02:12:12 +03:00
George Hotz	132f09fab7	global/locals from AxisType in range (#11806 )	2025-08-23 15:49:17 -07:00
qazal	0d86288bd7	viz: calculate timeline fixed points in client side (#11805 ) * viz: calculate timeline fixed points in client side * 26 bytes / event * math	2025-08-24 01:44:40 +03:00
George Hotz	a75da49951	use AxisType for UPCAST/UNROLL (#11800 ) * use AxisType for UPCAST/UNROLL * fixes * fix the bug * fix hack * bad test * flaky test	2025-08-23 14:44:48 -07:00
qazal	2407fecdae	viz bytepack format (#11792 ) * viz bytepack format Training a 1B llama yields ~20M profiler events. With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB. Design decisions: - Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events. - Strings (kernel names, metadata, etc) are deduped. - Buffer sizes are in u64 nbytes. More optimization possible: - The string lookup is a JSON dumped array, we can compress this. - Can store less for memory by moving the layout to client. Results \| \| Events \| JSON \| bytepack \| \|----------------\|---------\|-------------\|-------------\| \| DP=8 llama 1B train (`command: [1]`) \| 24M \| 5.8GB \| 640MB \| \| examples/beautiful_mnist.py \| 16K \| 3.7MB \| 745KB \| \| examples/gpt2.py \| 55K \| 12.54MB \| 1.40MB \| `[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py` * python reference decoder * 27 bytes / event, 1hr hard limit	2025-08-23 23:50:21 +03:00
qazal	b12d1d866c	count bytes per kernel in test_viz (#11801 ) Currently at ~100 bytes/kernel with JSON.	2025-08-23 23:35:27 +03:00
Sieds Lykles	6a50ab6b87	adjust idiv min_max (#11802 ) * change div min_max * add tests	2025-08-23 22:25:51 +02:00
chenyu	9d4cccd0f9	test_dtype_alu cleanups (#11799 )	2025-08-23 15:11:17 -04:00
George Hotz	aefabaf774	add AxisType to range (#11798 ) * add AxisType to range * missed them * fix that test * fix that test	2025-08-23 11:15:00 -07:00
qazal	b975830424	add profile loader helper in test_viz (#11797 )	2025-08-23 19:20:29 +03:00
chenyu	7123df3928	Use Tensor.logaddexp to implement Tensor.softplus (#11796 ) instead of piecewise linear, numerical is handled by logaddexp. jax does this and i think it's more elegant than torch's approach	2025-08-23 11:52:29 -04:00
qazal	aaea6b97ad	viz memory: compute nbytes (#11795 ) * viz memory: compute nbytes * local map	2025-08-23 17:34:07 +03:00
qazal	58653b5eae	viz: store memory scale (#11794 )	2025-08-23 16:19:44 +03:00
chenyu	fb8ee02424	Tensor.logaddexp (#11793 )	2025-08-23 09:15:00 -04:00
Sieds Lykles	5a6817d5f8	Fix z3 rendering of floats in indexing (#11740 ) * Fix floating point comparison in indexing * wrap in noop * update tests * improve rules for loading and comparing floats * add test cast to bool	2025-08-23 05:56:19 +02:00
chenyu	4267c45db3	non-supported dtype in transcendental (#11754 ) * non-supported dtype in transcendental `CPU=1 python3 test/test_dtype_alu.py TestDTypeALU.test_bfloat16_unary` works * test * works on real mac	2025-08-22 23:13:45 -04:00
chenyu	e39b25cd36	upcast float exp to at least float32 (#11758 ) * upcast float exp to at least float32 * unlucky seed	2025-08-22 20:16:34 -04:00
nimlgen	b057a90d49	memory: rename is_huge_page -> is_page (#11786 )	2025-08-22 20:08:58 +03:00
qazal	38f0fa7bde	viz: only send trace duration (#11789 ) * viz: only send trace duration * can unwrap	2025-08-22 20:00:48 +03:00
qazal	1c81ec9248	viz: rename to start/end timestamp (#11788 )	2025-08-22 19:47:49 +03:00
qazal	9ff03680ba	viz: store relative timestamps (#11787 ) * viz: store relative timestamps * err * update test	2025-08-22 19:30:21 +03:00
nimlgen	698392334f	system: message for eaccess as well (#11785 )	2025-08-22 18:21:32 +03:00
geohotstan	1e679bd789	fix max_unpool2d inf (#11784 ) * start * add regression test for maxunpool2d	2025-08-22 08:31:24 -04:00
George Hotz	9832599c9e	test_vmap + permute isn't a sint (#11783 ) * test_vmap + permute isn't a sint * order	2025-08-21 22:39:35 -07:00
George Hotz	bb8de51e5f	remove unused early cleanups + contig w range [pr] (#11780 ) * remove unused early cleanups [pr] * contiguous with range * woah, this works	2025-08-21 20:04:45 -07:00
chenyu	91a4de4ca7	fix getitem with inf in tensor (#11781 )	2025-08-21 21:55:32 -04:00
George Hotz	66e9d54eed	RANGEIFY=2 is partial contig (#11777 )	2025-08-21 16:53:58 -07:00
Jordan Chalupka	8de6db15ac	exclude .git from ruff (#11773 )	2025-08-21 15:37:50 -07:00
George Hotz	5954a0975f	fix some assigns on rangeify (#11774 ) * fix some assigns * llvm test * more tests * upd test	2025-08-21 15:15:54 -07:00
qazal	2e0eb88549	viz: add metadata to UOp tracing (#11772 ) * viz: add metadata to UOp tracing * place after tag * optional field * err, refcount of root must be 0	2025-08-22 00:18:45 +03:00
George Hotz	d6f9606e93	small cleanups to rangeify (#11769 )	2025-08-21 11:15:09 -07:00
uuuvn	bd4a9473b0	Multihost exception handling (#11729 ) Co-authored-by: wozeparrot <wozeparrot@gmail.com>	2025-08-21 13:51:49 -04:00
George Hotz	a2c7b807e0	don't bufferize 0s (#11766 )	2025-08-21 10:10:56 -07:00
nimlgen	9eff7cd1d8	am: support 64bit discovery (#11768 )	2025-08-21 18:28:13 +03:00
b1tg	56cd47a159	fix amd llvm bf16 tc (#11713 ) * fix amd llvm bf16 tc * is_cdna --------- Co-authored-by: b1tg <b1tg@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2025-08-21 09:33:28 -04:00
George Hotz	a044648111	rangeify load cleanups + multi support (#11765 ) * use the old buf_uop + cleanups * simpler handling of load * everything needed for multi too	2025-08-20 20:55:49 -07:00
George Hotz	9f94c25a25	fix symbolic usage. use shrink, not reshape (#11762 ) * fix test_var * revert those things * fix the ones in test tiny * use better syntax * it's the same, but that's clearer * fix pad	2025-08-20 18:35:42 -07:00
chenyu	5276fbc9c5	fix gather with inf values (#11760 ) (mask * x) is wrong because 0*inf is nan. i feel we have a lot of those still...	2025-08-20 20:35:40 -04:00
wozeparrot	b979162c5d	llama3 eval train (#11706 )	2025-08-20 19:56:35 -04:00
chenyu	dbd3b67657	clamp GRAD_CLIP_NORM in llama (#11761 )	2025-08-20 19:55:50 -04:00
George Hotz	9635592141	** rangeify, try 3 (#11683 ) * ** rangeify, try 3 * bring that over * bufferize, don't use contig tag * work * ish * fix rangeify * flash attention is back * fix rangeify tests * stuff passes * fix test_log_softmax * more stuff passes * progress children * new endrange solution * progress * progress counter * basic assign * contigs only * symbolic in schedule * unbind_kernel * late children * ops fixed * beautiful mnist is close * that seems to work * mnist works * improve names * fix bmnist * no pcontig * testing backward * work * clone movement ops * new_range helper * MBLOCK/MERGE * ops tests pass * revert mblock stuff * cleanups...but it breaks ops * remove reindex * hack for relu * disable the hacks * more hacks * upd * mostly works with cleanups disabled * ndr * ops tests pass * terrible hacks for indexing to work * context mismatch * pcontig * split pcontig v contig * z3 trunc * null * no fuse in rangeify * ops test passes * lnorm * fix assign * nd rangeify * both should work * tests for rangeify * cleanups * stores pass the pointer through * disable pcontig for now * PARTIAL_CONTIG is a flag	2025-08-20 14:22:44 -07:00
chenyu	d7553721d1	clean up test_dtype_alu (#11757 ) remove the check that looks into schedule, only test if output matches	2025-08-20 14:36:18 -04:00
chenyu	5f08a3e928	hotfix: cast half to float in Tensor.tolist (#11755 ) workaround for python < 3.12	2025-08-20 12:18:35 -04:00
qazal	de4cb722a4	viz: add metadata and var_vals tracing (#11753 ) * viz: add metadata and var_vals tracing * add test_trace_metadata * set TRACEMETA=1	2025-08-20 18:39:51 +03:00
nimlgen	6589c9e643	hcq: better errors for ifaces (#11751 ) * hcq: better errors for ifaces * fix linter * typo * space	2025-08-20 17:50:51 +03:00
chenyu	be7b0b6970	TRANSCENDENTAL_SUPPORTED_DTYPES->TRANSCENDENTAL_DTYPES (#11752 )	2025-08-20 10:29:36 -04:00
ttomsa	220a2a88d7	a(1/b) -> a/b on LLVM, CPU (#11743 ) add fdiv rewrite * :) * use float_lop * use reciprocal() * revert * move to decompositions	2025-08-20 09:35:10 -04:00
George Hotz	12ab3f8b06	correct row_count in process replay (#11748 )	2025-08-19 22:21:07 -07:00

... 23 24 25 26 27 ...

11106 Commits