tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-22 13:28:06 -05:00

Author	SHA1	Message	Date
chenyu	a1940ced77	remove the assign hack in whisper (#4240 ) no longer needed, the commented test case was removed too	2024-04-20 23:56:44 -04:00
chenyu	3f126c7664	fix examples vits / converstion.py (#4239 ) it was passing a const numpy array into Tensor.arange	2024-04-20 23:29:12 -04:00
George Hotz	cd88afc98b	datasets isn't a feature + filter docstrings (#4228 ) * datasets isn't a feature * filter docstrings in sz	2024-04-19 16:16:10 +04:00
George Hotz	d99b512084	llm.c timing (#4219 ) * add timing info * fix malloc * 8s with beam	2024-04-19 12:43:21 +04:00
George Hotz	39b60a25f0	more llm c work (#4207 ) * more llm c work * print nicely * fake load pretrained * select warmups * output c code	2024-04-18 22:20:44 +04:00
chenyu	f7416916df	update resnet hparams based on BS=1632 RCP (#4210 ) https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json	2024-04-18 12:01:46 -04:00
George Hotz	fa57c3e7ce	continue llm.c (#4190 ) * continue llm.c * export more * progress on llm.c * simpler optim, names work	2024-04-18 10:57:54 +04:00
Francis Lata	3644077a42	[MLPerf][UNet3D] Add DICE loss + metrics (#4204 ) * add DICE loss and metrics * update dice to include reference implementation's link * remove unused imports * remove unnecessary test file and update pred + label for metrics and losses test * add tests to CI + add exclusion of mlperf_unet3d --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-17 20:09:33 -04:00
chenyu	cd801a15f3	scipy.signal.gaussian -> scipy.signal.windows.gaussian (#4205 ) fixed unet3d model_eval, will add to CI after merging new dice loss	2024-04-17 19:15:37 -04:00
David Hou	1dbf3b2b19	Benchmarks for individual resnet layers (#4182 ) * resnet individual layer benchmarks! * small * 1 and 2 * mem_used * no ci * better conv print * defaults * prints * adjust * adjust * adjust * benchmark only one layer example * tensor.training, zero_grad, sum instead of mean, last mem, last kernel count * default jitcnt=1 * scale flops/kernels with jitcnt * add note about jitcnt memory * touchup	2024-04-16 13:53:18 -04:00
George Hotz	55ae73e951	Replicate llm.c in tinygrad (#4179 ) * write llm.c and add a few new methods to tensor * training works * add jit * tests for new functions * test tolist * simple fix for onnx test failures (#4186) * write llm.c and add a few new methods to tensor * training works * add jit * tests for new functions * bump line count to 7500 * simplest fix * safenumpy tolist for now --------- Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> --------- Co-authored-by: geohotstan <135171913+geohotstan@users.noreply.github.com>	2024-04-16 15:40:48 +04:00
chenyu	aa093efa43	fix handcode_resnet50_opt flops count (#4184 )	2024-04-15 22:13:45 -04:00
chenyu	d5b67c1ca3	log resnet TRAIN_BEAM / EVAL_BEAM (#4181 ) also run eval in benchmark mode if either one is positive	2024-04-15 19:29:08 -04:00
chenyu	6a2168e698	TRAIN_BEAM and EVAL_BEAM for resnet (#4177 ) working on measuring compile time	2024-04-15 14:57:21 -04:00
David Hou	593c90d7d6	Resnet fp16 training with fp32 master weight copy (#4144 ) * add casts to layers * FLOAT flag * detach * no_grad for eval * whitespace * explicit fp32 initialization * oops * whitespace * put back config['DEFAULT_FLOAT'] * bad * live dangerously (don't hide bugs) * don't bundle changes --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-14 11:25:08 -04:00
chenyu	e20d6f9221	correct resnet estimate time (#4169 ) 7.99 hours was rendered as 7h0m.	2024-04-14 02:21:46 -04:00
George Hotz	ebc94c9d6c	rewrite the jit in the context of new schedule (#4162 ) * rewrite the jit in the context of new schedule * mypy better * fix placeholder * tests * all functionality should work * fix tests * no CacheCollector	2024-04-12 21:54:36 -07:00
George Hotz	216eb235e5	hotfix: cast mnist to float	2024-04-09 19:30:03 -07:00
George Hotz	fea774f669	spend 5 lines to bring mnist into the repo (#4122 )	2024-04-09 19:24:57 -07:00
chenyu	92c0675ccf	setitem initial support (#4093 ) * wip setitem it's an eager assign to output shapetracker view * cleanups and tests * more cleanups	2024-04-07 20:35:22 -04:00
George Hotz	97c402d69e	use imagenet spawn (#4096 )	2024-04-06 08:34:10 -07:00
George Hotz	fffd9b05f5	mock mnist data for imagenet trainer (#4095 ) * mock mnist data for imagenet * move print and test * needed to reshape	2024-04-06 08:08:40 -07:00
George Hotz	93824e59eb	support MOCKDATA=1 for resnet (#4090 ) * mockdata for resnet * fix eval, revert hsa	2024-04-05 17:19:18 -07:00
George Hotz	bec2aaf404	add beautiful_mnist_multigpu example	2024-04-02 00:54:04 +00:00
chenyu	aa76d566c2	cleanup mamba (#4004 ) make it read nicer and cleanup some movement methods and math simplification. 790m, 1.4b, 2.8b model does not really run. sampling is not implemented. jit is incorrect. some deadcode / wrong code path and copied from torch stuff stuff.	2024-03-30 02:50:13 -04:00
chenyu	c71627fee6	move GlobalCounter to helpers (#4002 ) break circular import between ops and buffer	2024-03-30 00:30:30 -04:00
chenyu	ecf38f498e	beam search resnet eval too in BENCHMARK (#4000 )	2024-03-29 21:07:23 -04:00
reddyn12	9b5e15db6e	Mamba Implementation (#3456 ) * first commit * state back to orig * mamba comparisions * rm file * rename file * use Tensor.einsum and mke default model 370M * Cleaned code and made a comparision test * Simplyfy pull request. Only has 1 mamba implementation now. * Update prompt * rm whitespaces * last space * remove Einops dependency * rm unused code * add tests * rm print statement * rm imports * skip CLANG * Update skipIf description * skip model test in CI and add CLANG fix * rm Device import * don't be stupid * Fix conv assign When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token * fix p1 * temp * fix jit import --------- Co-authored-by: schlimeszn <schlimeszn@gmail.com> Co-authored-by: reddyn <nikidsniper@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-28 17:49:12 -07:00
chenyu	b47f6cebb2	LinearizerOptions -> CompilerOptions (#3978 )	2024-03-28 17:50:23 -04:00
David Hou	4b95350c41	fp16 resnet (without expand backwards sum in float, doesn't work) (#3816 ) * fp16 resnet * cast running mean and var back to default float * extra cast * check symbolic no overflow * add linearizer failure * loss scaler after grad contig * oops * i think this works * don't loss scale fp32 * remove overflow test case * remove symbolic bounds check * loss scaler should be float * temporarily disable padto cuz bug shruggie * make running stats in batchnorm float32? * calculate lars stuff in fp32? * oops * remove most changes * move loss scaler out of optimizer * no more FP16 var * oops --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-28 01:25:37 -04:00
Francis Lam	16a1d43f6f	llama: prevent device initialization outside of __main__ (#3966 ) * llama: prevent device initialization outside of __main__ causes HSA resources leakages in child compile processes * llama: fix loading with multiple devices	2024-03-27 19:19:38 -04:00
George Hotz	68ca4d4276	split to schedule.py (#3949 ) * split to schedule.py * split	2024-03-26 21:02:46 -07:00
George Hotz	150ea2eb76	create engine folder and move code (#3948 ) * retry * older tf * that	2024-03-26 20:38:03 -07:00
Arseny Kapoulkine	cb6e7b57a6	examples: Fix parameter bandwidth accounting for quantized LLama (#3930 ) Instead of assuming every parameter is 2 bytes, just add up tensor sizes in bytes	2024-03-25 18:41:05 -04:00
chenyu	d651835ef5	verify beautiful_mnist.py eval acc and put into benchmark ci (#3926 ) * verify beautiful_mnist and put in ci * 97.5 for eval verification	2024-03-25 16:47:49 -04:00
chenyu	83f39a8ceb	env var to change default float (#3902 ) * env var to change default float to fp16 or bf16 looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights. working on default bf16 too. ``` RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined __bf16 cast0 = (nv_bfloat16)(val0); ``` remove that in cifar * DEFAULT_FLOAT * default of default * unit test * don't check default * tests work on linux	2024-03-24 20:33:57 -04:00
wozeparrot	9a9cac58f9	add lars to nn (#3750 ) * feat: add lars * feat: don't remove this comment * clean: smaller diff * clean: shorter line * feat: remove mlperf lars, switch resnet * fix: fully remove mlperf lars * clean: comment * feat: contiguous * feat: no weight decay on skip params * feat: optimizergroup * feat: classic momentum * fix: pylint * clean: move comment * fix: correct algo * feat: lrschedulergroup * feat: skip list tests * feat: :\| forgot that params are a thing * feat: remove skip_list params from main params * feat: set moment --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-24 11:43:12 -04:00
chenyu	e22d78b3d2	training cifar with BF16 on CUDA (#3905 ) * training cifar with BF16 on CUDA memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. * simpler bf16 functions * bf16 cifar works for HSA too just very slow * simpler bf16 functions, we love cuda	2024-03-24 01:37:47 -04:00
chenyu	24d004a89b	hotfix check ckpts before writing achieved model (#3901 ) this killed tinybox green run	2024-03-23 17:16:38 -04:00
chenyu	f7f67e0cc5	simple fix llama shard with quantize (#3882 ) copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory. 70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES. `python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize` 13B on 6 gpus uses 47 GB v.s. 34 GB quantized	2024-03-22 18:15:37 -04:00
Francis Lam	a26090d404	search: change to use "spawn" and limit the number of tasks per child (#3862 ) also clean up some examples to use __main__ and not initialize resources outside of main	2024-03-21 21:23:36 -07:00
Anurag Lamsal	4e0819e40b	fixing the benchmark not printing in handcode resnet50 opt example (#3850 )	2024-03-21 00:55:31 -04:00
chenyu	9d1d08fbb0	show llama bandwith with timing (#3844 )	2024-03-20 17:19:15 -04:00
chenyu	dccefab23f	remove mixtral weight to clang first (#3792 ) seems fine without it now	2024-03-17 23:33:17 -04:00
chenyu	5ac1fa933f	apply the same fix_bf16 in llama and coder (#3789 ) * apply the same fix_bf16 in llama and coder did not realize the same logic was in llama too. really fix #2775 * flag for native SUPPORT_BF16 cast	2024-03-17 21:25:24 -04:00
chenyu	639bd5dbfc	move bf16 cast hack to Tensor.llvm_bf16_cast (#3788 )	2024-03-17 18:51:22 -04:00
chenyu	9255332d9e	use llvm as bridge to fix_bf16 loading (#3774 ) This is how bf16 load is tested in test_bf16_disk_write_read now and it should fix #2775. I tested that it fixed loading coder using PYTHON backend. Will separate this special bf16 load v.s. regular bf16 support	2024-03-16 15:22:19 -04:00
chenyu	e1c5aa9cce	estimated resnet training time for BENCHMARK (#3769 )	2024-03-15 22:36:58 -04:00
chenyu	4bd5535d72	update mlperf resnet default hparams (#3758 ) we might be able to have higher lr given smaller BS, but this is good. Trained to 75.9% https://wandb.ai/chenyuxyz/tinygrad-examples_mlperf/runs/xi2f48se/overview	2024-03-15 12:09:26 -04:00
George Hotz	641f347232	simple LoadOps.ASSIGN (#3745 ) * simple LoadOps.ASSIGN * skip that test * don't assign in onnx ops gemm * track cache usage * recreate the lazybuffer to avoid the cache * fix contigs * skip that test * lol * better letters	2024-03-14 20:44:34 -07:00

... 10 11 12 13 14 ...

1179 Commits