tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-20 04:18:13 -05:00

Author	SHA1	Message	Date
nimlgen	9182948951	remove llvm_bf16_cast (#12075 )	2025-09-08 20:51:15 +03:00
Sieds Lykles	75b58fe2d3	move simplify_valid pat to sym (#12065 ) * move simplify_valid pat to sym * fix expectedfailure	2025-09-08 07:01:26 +02:00
Sieds Lykles	581b2388c2	add dtypes.index (#12015 ) * add dtypes.index * cast shape, stride and mask to dtypes.index in view.create * move pm_lower_index_dtype to ops * DEFINE_VAR is dtype.index by default * merge var_val_using_str * remove int from commutative * fix test_rewrite_map * change that to dtypes.index * change some int to index * shorten those * remove old cast in renderer * cleanup * change that back * add comment * delete comment * just delete those * view doesnt have to cast anymore * adjust comment	2025-09-06 06:03:44 +02:00
Sieds Lykles	c6c16b2946	`var_vals` uses str for var (#12011 ) * var_vals is str,int * remove imports * remove print * fix test * change var_vals in hcq * update test_hcq * fix multitensor _device_num var * fix syminfer test * shorten line * p.vars stays list[Variable] * shorten line * vars is back to tuple[Variable, ...] * change var_vals in extra * change var_vals from shapetracker * var_vals is str:int * fix signature	2025-09-06 04:16:12 +02:00
George Hotz	ee4f696086	delete more tests (#12043 ) * delete more tests * delete and simplify * flaky on windows * a few more, those remained	2025-09-05 15:31:30 -07:00
George Hotz	12c7b1bb01	cleanup lin tests without Kernel (#12041 ) * cleanup lin tests without Kernel * no kernel.py there * remove that test	2025-09-05 15:13:14 -07:00
Sieds Lykles	f5404ca53c	Divmod combine - associative variations (#12017 ) * add rule and test * more rules and tests * add all four variations * fix test * test fixed! * adjust commment * add new variations * disable intel tensor core ops count test for bigger_matmul_half	2025-09-05 03:44:02 +02:00
chenyu	677220ae7e	test_tesnor_data to unit/ (#12013 )	2025-09-04 19:58:27 -04:00
qazal	da61b40604	some viz tests don't need track_rewrites (#12010 )	2025-09-04 23:59:32 +03:00
qazal	be364a1adb	viz: add default tracing group (#12009 ) This enables seeing rewrites in unit tests like `VIZ=1 python3 test/test_uop_graph.py TestUOpGraph.test_in_bounds_access_gated_local` that call graph_rewrite directly. `@track_rewrites` keeps existing as an optional helper to organize larger traces.	2025-09-04 23:29:56 +03:00
chenyu	dc8501af30	clean up wino tests (#12008 ) removed the one that tests hcopt and added one for backward kernel counts	2025-09-04 16:14:55 -04:00
qazal	4996bb668b	load all traces before asserting in test_viz (#12004 )	2025-09-04 21:34:48 +03:00
George Hotz	09106e4aae	refactor and split test_linearizer (#12001 ) * refactor and split test_linearizer * forget that file * imports * remove from docs * test gen float4	2025-09-04 10:53:07 -07:00
Sieds Lykles	572a3c15c6	Move Ops.SPECIAL arg to src (#11918 ) * initial moving bound to src * arg to src * remove import * fixup linearizer * arg to src * fix test_uop_graph * fix more tests * fix python renderer * get const value from const uop * ssimplify uop estimates * fix webgpu locals * fix old test * gate Ops.SPECIAL in linearizer * use ssimplify() for local/global_size * remove toposort gate_parents_instead_of_self * fix rendering in comment * cleanup * rename and add comments * add BottomUpGate with test	2025-09-04 09:31:44 +02:00
George Hotz	5cf42dc4db	add Scheduler to replace Kernel with POSTOPT=2 (#11924 ) * ** simple kernel to replace Kernel for postopt * support old * fix beam * beaming * beam on old * bring tensor cores back * raise * postbeam * test ops passes on mac * skip that * postopt default * gate that * fix tensor cores * a few test fixes * dsp fix * tc fix * loop * support swap * test_gemv * fix beam for variable * test opts from high level stuff * range annoying * compile slow * metal slow * better beam * no POSTBEAM * fix nolocals * hc opt mostly works * put that back * lil * some work * fix that * POSTOPT 2 * fix tests * no postopt 2 * work * back * padded tensors cores * shift_to * postopt 0 passes? * write PADTO * fix padded tensor cores * compare hcopt * 18000 lines * should pass tests * fix rangeify * put types back	2025-09-03 19:23:30 -07:00
chenyu	b13e071463	move test_winograd to unit test (#11993 )	2025-09-03 21:47:32 -04:00
Jordan Chalupka	68e83b850f	nbytes should raise an exception when size is unlimited (#11928 ) * nbytes should raise an exception when size is unlimited * adding a test	2025-09-03 07:06:20 -07:00
Sieds Lykles	86e908db57	cast parents of int64 alu to int32 if possible (#11977 ) * add overflows helper * add rules * x -> y * check overflow of u too * cleaner * use alu instead of replace to preserve vectorization * just one rule * add test	2025-09-03 11:05:04 +02:00
Sieds Lykles	033184b3cb	parse_valid with non const rhs (#11957 ) * const to using vmin/vmax * add test * convert to int * remove left over part of and	2025-09-03 08:08:46 +02:00
Sieds Lykles	53eff8970a	add Ops.GEP to _min_max (#11976 )	2025-09-03 07:07:54 +02:00
chenyu	69dd1817d0	raise RuntimeError in merge_dicts instead of assert [pr] (#11965 )	2025-09-02 17:18:44 -04:00
qazal	f750c15965	viz: add python marker (#11952 ) * viz: add python marker * remove duplicate	2025-09-02 23:44:00 +03:00
qazal	0a53e72f70	viz: fix trace duration in python test decoder (#11949 )	2025-09-01 14:32:25 +03:00
qazal	27c9ed5a84	viz: more consistent naming of events (#11948 ) * s/shapes/events in test_viz * s/bufs/events in the memory packer	2025-09-01 14:16:47 +03:00
Sieds Lykles	a19d689481	fix vec dtype _min_max (#11944 )	2025-09-01 03:24:07 +02:00
Sieds Lykles	d3252ccd85	fix special vmax when arg is UOp (#11930 )	2025-08-31 06:54:39 +02:00
qazal	c27b99d68f	viz: refactor to indexed rewrite traces (#11923 )	2025-08-30 20:01:47 +03:00
qazal	bf0d055b39	viz: color by name (#11919 )	2025-08-30 16:04:58 +03:00
Sieds Lykles	0bc34c000f	simplify range mod its own upper bound (#11917 ) * add rules * add tests	2025-08-30 08:37:35 +02:00
George Hotz	afad7d0cd1	remove dtype from range, it will be dtypes.index soon [pr] (#11914 ) * remove dtype from range, it will be dtypes.index soon [pr] * a few more	2025-08-29 09:52:07 -07:00
Ben Waldron	ea1be2e4cd	[bounty] Remove using reshape to register symbolic shape (#11771 ) * Modify tests and start work towards removing symbolic reshape * Refactor symbolic reshape * fix small error * much cleaner + fix more tests * Can remove this now * Update test_symbolic_ops and test_tiny * Couple more tests * Unused import * More tests and add EXPAND to Tensor.empty * Fix test beam search * all int * Fix rangeify by adding shrink * Remove OOB check and so fix test_symbolic_jit * test_symbolic_jit doesn't need OOB Context anymore either * Should remove that test now * Cleanups part 1 * fix linters * Final cleanups * Don't reassign inside for loop --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-08-28 12:30:49 -04:00
Nino Risteski	54be477152	rope cache optim for jit prune in llm.py (#11678 ) * rope cache optim for jit prune * rope test * tests in test attention * Revert "rope test" This reverts commit `69ede543d0`. * lint	2025-08-28 08:31:29 -07:00
George Hotz	6d6f0dada7	support for tuple ranges (#11890 ) * support for tuple ranges * breaks it	2025-08-28 07:02:31 -07:00
Jordan Chalupka	e9789d8a70	Add mxfp4 support (#11873 ) * bump ggml url * map mxfp4 to tensor * tests	2025-08-27 10:56:56 -07:00
Sieds Lykles	d39365809a	add ctx to z3_renderer arg (#11867 ) * add ctx to z3_renderer arg * update symbolic fuzzer * rewrite u1,u2,u3 * update fuzz_fast_idiv * remove imports	2025-08-27 03:38:15 +02:00
Sieds Lykles	a3aeef45cc	associative variation of where branch-merging (#11851 ) * add rule and test * change comment	2025-08-26 19:27:05 +02:00
b1tg	1dd613cb89	test float_to_bf16 round-to-even behavior (#11849 ) Co-authored-by: b1tg <b1tg@users.noreply.github.com>	2025-08-26 12:16:10 -04:00
b1tg	409399c609	fix nan in float_to_bf16 (#11843 ) Co-authored-by: b1tg <b1tg@users.noreply.github.com>	2025-08-26 11:42:25 -04:00
chenyu	f28f613f85	improved float_to_bf16 (#11848 ) round instead of truncate	2025-08-26 11:14:06 -04:00
chenyu	ac3449b0c8	truncate_fp16 cleanup (#11838 ) native `@` is default	2025-08-25 19:03:41 -04:00
qazal	a1f6823060	viz: memory layout in client side (#11830 ) * viz: memory layout in client side * update test_viz	2025-08-25 14:49:33 +03:00
George Hotz	6540bb32a6	move into codegen late [pr] (#11823 )	2025-08-24 10:23:25 -07:00
Sieds Lykles	dd69114573	Revert "Better div nesting (#11811 )" (#11818 ) This reverts commit `952f729b07`.	2025-08-24 18:11:24 +02:00
Sieds Lykles	952f729b07	Better div nesting (#11811 ) * remove check * use fold_divmod_congruence instead of simplify * adjust tests * shorten line	2025-08-24 04:17:40 +02:00
Sieds Lykles	e652062f92	tweak divmod_folding condition (#11810 )	2025-08-24 02:59:02 +02:00
Sieds Lykles	07d4ed7e4c	one more symbolic add variation (#11807 )	2025-08-24 01:15:04 +02:00
qazal	0d86288bd7	viz: calculate timeline fixed points in client side (#11805 ) * viz: calculate timeline fixed points in client side * 26 bytes / event * math	2025-08-24 01:44:40 +03:00
qazal	2407fecdae	viz bytepack format (#11792 ) * viz bytepack format Training a 1B llama yields ~20M profiler events. With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB. Design decisions: - Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events. - Strings (kernel names, metadata, etc) are deduped. - Buffer sizes are in u64 nbytes. More optimization possible: - The string lookup is a JSON dumped array, we can compress this. - Can store less for memory by moving the layout to client. Results \| \| Events \| JSON \| bytepack \| \|----------------\|---------\|-------------\|-------------\| \| DP=8 llama 1B train (`command: [1]`) \| 24M \| 5.8GB \| 640MB \| \| examples/beautiful_mnist.py \| 16K \| 3.7MB \| 745KB \| \| examples/gpt2.py \| 55K \| 12.54MB \| 1.40MB \| `[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py` * python reference decoder * 27 bytes / event, 1hr hard limit	2025-08-23 23:50:21 +03:00
qazal	b12d1d866c	count bytes per kernel in test_viz (#11801 ) Currently at ~100 bytes/kernel with JSON.	2025-08-23 23:35:27 +03:00
Sieds Lykles	6a50ab6b87	adjust idiv min_max (#11802 ) * change div min_max * add tests	2025-08-23 22:25:51 +02:00

1 2 3 4 5 ...

848 Commits