tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-14 09:28:04 -05:00

Author	SHA1	Message	Date
chenyu	e745e16441	remove UnaryOps.NEG (#6238 ) * Remove UnaryOps.NEG generated new dataset with ``` time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ``` * fix that	2024-08-22 14:21:39 -04:00
Francis Lam	7376b67e36	extra/gemm/triton_nv_matmul: fix Program arguments (#6212 ) remove op_estimate	2024-08-20 14:05:38 -07:00
Francis Lata	8fd8b970b0	update URL to eval cases from recent MLPerf file movements (#6201 )	2024-08-20 08:43:13 -04:00
chenyu	9db2d0d5c6	fix some type error in onnx [run_process_replay] (#6153 )	2024-08-17 19:54:20 -04:00
chenyu	7c9c8ce22f	use TensorProto enum in onnx dtype mapping [run_process_replay] (#6151 )	2024-08-17 17:58:40 -04:00
George Hotz	9bc81c6db4	UOps.SHAPETRACKER (#6129 ) * UOps.SHAPETRACKER [run_process_replay] * no process replay	2024-08-16 23:26:34 -07:00
George Hotz	89c7989659	no shapetracker in ops [run_process_replay] (#6117 )	2024-08-16 17:23:27 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit `1408a59f12`. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
nimlgen	7ab531aede	autogen cleanup (#6064 ) * start autogen cleanup * nvgpu * better? * better * amd part * gpu regen * fix mockgpu amd * nv * amd fix linter * remove import * ugh * nv on master * amd on master	2024-08-14 20:20:35 +03:00
wozeparrot	059cf2a90d	feat: autogen from kernel register offset headers (#6056 )	2024-08-12 14:08:35 -07:00
wozeparrot	dc2617bffd	feat: use more correct reg for local dims (#6048 )	2024-08-12 11:15:37 -07:00
chenyu	e6c7c3e499	update pylint path to check indent/space for all (#6022 ) also fixed many errors. it was not checking nested dirs. exclude autogen for now. can we use ruff for this?	2024-08-10 14:41:09 -04:00
wozeparrot	d269bc95fa	faster tinychat (#5993 )	2024-08-08 19:16:26 -07:00
George Hotz	bc55c8a30e	pmatmul example + GB/s bugfix [run_process_replay] (#5974 ) * pmatmul example + bugfix * improve pmatmul * Update real_pmatmul.py	2024-08-07 22:32:11 -07:00
George Hotz	bf8ec23b00	hotfix: contiguous on precompute_freqs_cis	2024-08-07 14:40:56 -07:00
wozeparrot	5808e8a30f	mockgpu remu changes (#5925 )	2024-08-05 19:26:58 -07:00
wozeparrot	6740a0a6a0	hip_ioctl changes (#5917 )	2024-08-05 11:58:38 -07:00
chenyu	996ff0c135	pow(2) -> square in RMSNorm [run_process_replay] (#5901 ) reads nicer in metadata	2024-08-04 14:21:31 -04:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00
nimlgen	34168a64e3	optimize nv profiler (#5856 ) * nv profiler fix * cleanup hcq a bit * fixes * fix * typo * all signals put timestamp * a bit cleaner * merge fields * type * import * tiny fix	2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov	610e454132	fix opencl_ioctl on comma (#5814 ) - remove unused code - add CP_REG_TO_MEM opcode - fixed parse_cmd_buf for more than 1 command object by correcting an offset - fixed memory mappings for cases when memory was allocated with KGSL_MEMFLAGS_USE_CPU_MAP. KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU address will be 0. Calling mmap() will set the GPU address. So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory and it resulted to crash right after get_mem.	2024-07-30 20:44:06 -07:00
David Hou	9a485f36e4	shard kvcache (#5830 )	2024-07-30 20:29:54 -07:00
George Hotz	4e89d45513	hotfix: put contiguous back in llama	2024-07-30 18:43:48 -07:00
George Hotz	21c5e8e1b7	extreme llama speed, 57.34 tok/s (#5827 ) * extreme llama speed * mergable	2024-07-30 18:32:09 -07:00
George Hotz	e6879035a0	work to make GEMV fast (#5824 ) * work to make GEMV fast * half8 cast * align struct * fix amd * float8 is a later problem	2024-07-30 17:41:40 -07:00
Francis Lata	ce61be16f1	clean up how preprocessed folder is defined (#5813 )	2024-07-30 12:35:26 -04:00
chenyu	471b188d79	fix mypy errors in latest mypy (#5794 ) * fix mypy errors in latest mypy mypy has stricter partial and api arg checks now * PYTHONPATH="."	2024-07-29 14:53:30 -04:00
nimlgen	ea27ec4cd0	nv switch classlist_v2 to classlist (#5763 ) * nv switch classlist_v2 to classlist * support in mockgpu * fix mockgpu	2024-07-28 20:24:42 +03:00
chenyu	3686b6726a	move GraphException to jit.py (#5744 ) same place where GraphRunner is defined	2024-07-26 19:01:12 -04:00
George Hotz	489a5b99a5	hotfix: triton_nv_matmul touchups	2024-07-24 23:24:29 +00:00
George Hotz	bf24be4c8c	triton gets 163 TFLOPS on 4090	2024-07-24 18:32:29 +00:00
George Hotz	4d47968580	fix acc folding for NV tensor cores (#5658 ) * fix acc folding for NV tensor cores * fix correctness of reduce_before_expand	2024-07-23 13:03:02 -07:00
nimlgen	08a9c0ae5e	hcq cache invalidation for beam (#5630 ) * nv full cache invalidation * the same command on amd * linter * fix amd * nv no hardcoded consts * beam default	2024-07-22 18:13:17 +03:00
George Hotz	6c6d74d922	parallel mcts (#5626 ) * start work on parallel mcts * compile was linearizing twice * typing + more early stopping * fix compiler error	2024-07-21 14:53:23 -07:00
George Hotz	ef179087a4	mcts exit condition wasn't right, also use it with BEAM>=100 (#5619 ) * mcts exit condition wasn't right, also use it with BEAM>=100 * mcts touchups * clean up sample	2024-07-21 10:16:47 -07:00
George Hotz	0f67ef4674	mcts graph and dedup support (#5618 ) * mcts graph and dedup support * usable graph * mcts colors * C=4 seems better * C=3 even better * sample_tree * backprop is external function * late expand to match algo	2024-07-20 23:29:14 -07:00
chenyu	eddc5bcfd7	MCTS tweaks (#5616 ) MCTS 500 is competitive with BEAM=8 on resnet on M1 Max. - increment trial times even with compiled error and runtime error. - use best time of children as the node value.	2024-07-20 19:45:59 -07:00
George Hotz	1113e47f96	print best in MCTS + light up the winner in hcopt	2024-07-20 09:39:36 -07:00
George Hotz	ac99ecd94e	use statistics.median for timing (#5606 )	2024-07-20 08:37:32 -07:00
George Hotz	06e336bccb	mcts search (#5598 ) * mcts search * mcts cleanups * mcts cleanup * random shuffle children order * mcts in handcode_opt * src and remove_node * debug 3 to print ast * print the type * mcts in extra	2024-07-19 21:38:39 -07:00
Tobias Fischer	72da3fe7e6	added clip vision model (#5595 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-19 18:35:51 -04:00
George Hotz	fa7e734b49	MetaOps.KERNEL (#5543 )	2024-07-17 19:41:23 -07:00
Francis Lam	2d53abb04a	test/external/fuzz_linearizer: fix for new AST changes (#5519 ) * test/external/fuzz_linearizer: fix for new AST changes also add beautiful_mnist failures * add CLANG and LLVM to test_failure_35 failed_platforms * fix test_linearizer_failure names	2024-07-17 00:08:07 -04:00
Tobias Fischer	85d4ca7caa	FID Inception Model (#5516 ) * added model impl * minor cleanups * extracted weights loading into from_pretrained * reorganized model for better weight loading * removed lru cache for state dict loading	2024-07-16 23:12:03 -04:00
chenyu	28972418c4	s/get_linearizer/get_kernel [run_process_replay] (#5467 )	2024-07-13 20:32:22 -04:00
George Hotz	03c2dc8bd7	lowerer is kernel [run_process_replay] (#5437 )	2024-07-12 18:50:55 -07:00
chenyu	00813a92a0	update Tensor.eye api to match torch (#5433 ) * update Tensor.eye api to match torch input is n for nrows and optional m for ncols * space * fix onnx	2024-07-12 20:25:12 -04:00

... 11 12 13 14 15 ...

1363 Commits