tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-21 04:47:56 -05:00

Author	SHA1	Message	Date
George Hotz	4e6665bda5	different way to write torch backend (#9197 ) * different way to write torch backend * both backends * more work * simpler code * more work * test both * imply unwrap/wrap * FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works * ready to start making test_ops work in torch backend * backward pass, TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_add works * FORWARD_ONLY=1 TINY_BACKEND=1 python3 test/test_ops.py TestOps.test_simple_conv2d works * matmul backward is broken with as_strided	2025-02-22 14:42:26 +08:00
geohotstan	6587c7879b	simple fixes to onnx (#9195 ) * uncontroversial changes * cleaner _prepare_quantize	2025-02-21 13:10:06 -05:00
George Hotz	e87be0131e	torch backend start (#9191 ) * start torch backend * progress * ugh, you need cpp crap * 1+1 works * 1+1 works * becoming a real backend * ready to merge?	2025-02-21 16:57:28 +08:00
chenyu	2e7c2780a9	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
chenyu	1692087db5	_one_hot_along_dim input needs to be int (#9179 ) * _one_hot_along_dim input needs to be int indexing and onehot compare with arange, and non-int dtype is likely a bug	2025-02-20 09:00:43 -05:00
George Hotz	bf36967883	cuda hooking (#9180 ) * cuda hooking * progress * more hook cuda * fix params * compile + cuMemHostAlloc hook * work * revert that	2025-02-20 19:20:01 +08:00
chenyu	975c318dbc	bert use int32 for input ids (#9173 ) original data was int32 for these. float might have caused precision issues	2025-02-19 08:17:27 -05:00
George Hotz	7eea9b639d	hotfix: add replay_pkl debugging env	2025-02-17 17:34:58 +08:00
chenyu	1fda98d14f	fix import time_linearizer [pr] (#9118 ) only test that used it was skipped in CI due to being slow	2025-02-15 21:33:28 -05:00
Josh Moore	1f9d2442b9	Add `Tensor.scatter_reduce` (#8947 ) * pytorch scatter -> scatter_reduce * WIP scatter_reduce implementation * _pre_scatter return type hint * split out src, mask to satisfy linter * Add src cast back in * dict of lambdas instead of ifs * sum and prod reduction ops with include_self * add reduce arg error message * add amax and amin reduction ops * Fix include_self for higher dims * Simplify * Simplify amax and amin too * Pull include_self logic out into _inv_mask function * reduce arg cannot be None for scatter_reduce * Fix self-mask issue * Add mean reduce op * Add tests * any() not needed here * remove comment * End support for Tensor src with reduce arg in tinygrad scatter * Process index, dim inside actual functions * Add scatter_reduce to onnx * Add excluded onnx ScatterElements reduction tests back in * Save 2 lines on the mask helpers * Update docs * Add include_self=False tests * cleanup * Remove unneeded helper function --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-02-13 09:08:54 -05:00
George Hotz	74742c018f	hotfix: setup_mock_nv_osx	2025-02-13 12:26:15 +08:00
chenyu	f4f56d7c15	move time_linearizer to extra.optimization.helpers [pr] (#9048 ) no longer used in tinygrad	2025-02-12 15:49:58 -05:00
divinity76	bec4f59ce8	workaround f16 cast ambiguity (#8935 ) for unknown reasons, without this, when trying to execute "Llama 3.2 1B", I get the error below. Fwiw I do not know the performance impact for this change. I can't even get exo running, but this change allows me to /get further/ (before running into a separate issue with vram allocation? story for another day i suppose) error: ``` Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): Nvrtc Error 6, NVRTC_ERROR_COMPILATION <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) ((half4)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3))); ^ <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) ((half4)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3))); ^ <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) ((half4)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3))); ^ <null>(18): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) ((half4)((data0+(alu0+(gidx1<<14)+(lidx0<<11)+alu1)))) = make_half4(((half)(val0)),((half)(val1)),((half)(val2)),((half)(val3))); ^ 4 errors detected in the compilation of "<null>". ```	2025-02-11 09:38:56 +08:00
nimlgen	dfc9d6827f	am_smi: print power state (#9013 )	2025-02-10 23:07:39 +03:00
nimlgen	f91409f038	am: fix proclogs (#9004 )	2025-02-10 16:38:58 +03:00
nimlgen	c6c2373bc0	replace libpciaccess autogen with just pci regs (#8983 ) * replace libpciaccess autogen with just pci regs * add pci.py	2025-02-09 18:40:45 +03:00
George Hotz	6ffee2fca9	reduce speed example [pr] (#8978 ) * reduce speed example * fast like a nascar	2025-02-09 14:13:59 +08:00
George Hotz	a3c78d47b3	speed docs + upgrades [pr] (#8964 ) * add some docs about speed [pr] * better torch gemm * enable locals on llvm/clang * disable locals for beam speed on LLVM/CLANG * 0x20 alignment in llvm allows ymm use	2025-02-08 17:28:52 +08:00
nimlgen	11d50324d8	am: tiny cleanups (#8958 ) * am: start cleanups * am	2025-02-07 23:44:43 +03:00
Ahmed Harmouche	133cacadde	Autogen webgpu dawn, removing wgpu-py dependency (f16 support part 1) (#8646 ) * Switch to dawn, all tests passing locally * Use dawn-python * Skip failing test * Skip midcast and fix timestamp on metal ci * Autogen webgpu * Try fetch dawn lib again * /usr/lib * Without lib prefix * Test autogen diff * Delete webgpu support, move everything to ops_webgpu * mypy fix * Simplify, refactor * Line savings * No ResultContainer * Type annotation for result * Some more simplifications * Why was this explicit sync used at all? * Refactor: delete functions that are only used once * Create shader module inline * Clear unit tests cache, maybe that solves it * That wasn't it * Try deleting cache to pass failing weight compare * weights_only=False for pytorch 2.6 * Simplify ctype array creation * Remove nanosecond precision timestamps * Simplify error handling * Refactor, add back type annotations * Deleted custom submit function, refactor * read_buffer simplify * Fix use after free, refactor * Simplify supported_features * Runtime docs --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-02-07 15:16:59 +08:00
nimlgen	ee1a0fb8ec	am_smi: print device name (#8939 )	2025-02-07 03:01:25 +03:00
chenyu	a092b6395d	Tuple -> tuple, List -> list [pr] (#8936 )	2025-02-06 14:21:19 -05:00
nimlgen	86feb98dcd	am: add support for 7600 (#8910 ) * am: start to add support for 7600 * test_tiny passes * mmhub 3 0 2 * cleaner	2025-02-06 14:04:07 +03:00
geohotstan	057c70b05f	add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890 ) * start * log severity * only change this * change abstraction so it's more usable for huggingface --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-02-04 16:36:01 -05:00
George Hotz	56fa5c1191	dsp simulator (#8869 ) * dsp simulator * progress * fix * close on test tiny * working * less waste * line savings * Device DSP compiler * mock DSP at the bottom * DSP tests * docker caching * test update * need load * skip that test for CI DSP * last touch * ugh	2025-02-04 09:45:04 +08:00
geohotstan	d1aa9f30bc	copy onnx_ops into onnx (#8876 ) * just copy it over * make OnnxOps a global var * some small style stuff * rerun CI but also some small clean up * some comments	2025-02-03 12:15:07 -05:00
George Hotz	f484db0e63	dsp cleanups [pr] (#8866 )	2025-02-03 15:18:53 +08:00
qazal	ba17786068	do not construct unmasked VALID (#8759 ) * new lines that exist in codegen/ops * update tests * update sops.gz (13071 -> 13070 asts) * fix viz too * remove that TODO * diff pruning * mask assert + device * work * diff pruning * re: fix viz too --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-01-28 20:51:21 +02:00
George Hotz	80089536e5	Revert "move llvm_bf16_cast to renderer for CLANG and LLVM [pr] (#8720 )" (#8786 ) This reverts commit `af0452f116`.	2025-01-28 18:59:02 +09:00
mesozoic-egg	af0452f116	move llvm_bf16_cast to renderer for CLANG and LLVM [pr] (#8720 ) * handle bf16 via bitcasting for CLANG and LLVM * On LLVM, skip float16 cast * float32 on llvm lite, float32 elsewhere * code format * trigger pr * move to rewriter --------- Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-01-28 18:16:43 +09:00
nimlgen	1c608ae34f	am_smi: better spacing (#8773 ) * am_smi: better spacing * not used	2025-01-27 23:01:02 +03:00
Ignacio Sica	b240f12593	[TIP-9] rename Opt's amt to arg 2 (#8770 ) * rename Opt amt to arg * ignore_beam_cache for test_tiny * move ignore_beam_cache to test_tiny * move to separate pr * revert space change --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-01-27 14:19:04 -05:00
George Hotz	3ed146a5ff	Revert "rename Opt amt to arg (#8767 )" (#8769 ) This reverts commit `bf041659a5`.	2025-01-27 23:46:37 +09:00
Ignacio Sica	bf041659a5	rename Opt amt to arg (#8767 )	2025-01-27 23:36:47 +09:00
nimlgen	2f06eccf1d	am: script and vfio msg (#8742 ) * am: script and vfio msg * use sysfs bars always for now * tiny chnages	2025-01-25 00:33:00 +03:00
nimlgen	dc10187fc0	am: add am_smi (#8739 ) * am: start monitor * cleanups * fixes * hmm * progress * cleanup	2025-01-24 20:16:19 +03:00
geohotstan	04846b91aa	reorder and categorize onnx_ops (#8731 ) * new order * remove a todo * constant node is definitely requires_grad false * one new line spacing * property and graph * oops linter	2025-01-23 13:18:54 -05:00
chenyu	49b914ee69	simpler bert acc [pr] (#8714 ) logit.log_softmax().argmax(-1) is equivalent to logit.argmax(-1)	2025-01-22 10:32:19 -05:00
geohotstan	dd82b4c913	make onnx runner a class (#8647 ) * this * clean up * more clean ups and improve debug msg * more correct training toggler * remove manual training toggling * change some variable names * actually just add the training toggle for LIMIT envvar too * more refinement * __call__ and OnnxRunner * fix half pylint, other half is importing from onnx while this file is onnx.py, figure out later * ahhhh found another mistake * remove limit from __call__ --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-01-20 10:11:05 -08:00
ignaciosica	b49a04145e	fix for int plus minor cleanup (#8650 )	2025-01-17 22:30:39 -05:00
geohotstan	4abe631b56	fix onnx mobilenetv2-7-quantized.onnx (#8574 ) * is 67% considered fixed? * move test up * share function * add qgemm too * make sure qgemm comes out as int * actually that note is not right * remove qgemm (I did it wrong) and add it later lol.	2025-01-13 09:25:06 -08:00
Francis Lata	c25d5d3101	improve isin checks (#8589 )	2025-01-13 12:12:31 -05:00
nimlgen	38b5ac4d4a	mypy for mockgpu/cuda & dsp/run (#8575 )	2025-01-12 18:25:39 +03:00
geohotstan	815c505e1d	fixes from adapting tvm tests (#8570 )	2025-01-11 11:38:36 -05:00
qazal	60503c8621	use CAPTURE_PROCESS_REPLAY=1 in CI [pr] (#8564 )	2025-01-11 06:03:48 -05:00
George Hotz	c7acd40574	more aggressive onnx const creation [pr] (#8561 )	2025-01-10 17:38:32 -08:00
George Hotz	70fa65cd95	viz fixups + scheduler option [pr] (#8557 )	2025-01-10 15:09:31 -08:00
George Hotz	9833fe83d8	more work on onnx imagenet [pr] (#8552 ) * more work on onnx imagenet [pr] * working quantization * static quant * benchmark onnx 0 dim	2025-01-09 20:28:18 -08:00
George Hotz	5720871903	onnx consts are const [pr] (#8548 )	2025-01-09 16:09:22 -08:00
geohotstan	299d333806	Add QLinearConv, QLinearMatMul, QLinearAdd, QLinearGlobalAveragePool to onnx (#8478 ) * QLinearEverything * ok ort verify passes * this should be int instead * cast to int then char to do wraparound * cleaner * move contrib ops to microsoft ops --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-01-09 15:08:53 -08:00

... 5 6 7 8 9 ...

1248 Commits