tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-04-07 03:00:26 -04:00

Author	SHA1	Message	Date
George Hotz	f265e8523a	movement ops aren't really ops (#1056 )	2023-06-26 15:01:28 -07:00
Rayan Hatout	65cbaa3429	no need to slice A and B twice in LLaMa complex multiplication (#1054 )	2023-06-26 14:42:58 -07:00
George Hotz	571089f10e	Back off minor speed stuff for simplicity (#1053 ) * passing in buffers doesn't increase speed * functools.reduce * no more get_buffers	2023-06-26 14:42:17 -07:00
Rayan Hatout	dedbd970aa	Optimizations in lazy.py (#987 ) * optimizations in lazy.py * make mypy happy with stubs and fix the graph import hack * merge conflict in helpers.py	2023-06-26 13:55:42 -07:00
Roelof van Dijk	8bea6b6d35	perf/refactor_weakops (#1052 ) Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>	2023-06-26 10:13:33 -07:00
Roelof van Dijk	8c65f9324c	refactor: print formatting for llama timing (#1050 ) * refactor: print formatting for llama timing, report median and individual runs * feat: back to mean * fix: whitespace * fix: add mean to print --------- Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>	2023-06-26 09:49:31 -07:00
Roelof van Dijk	c604ef4beb	symbolic.py: faster Node.sum, faster SumNode.div (#1014 ) * refactor: replace isinstance with class check where possible * refactor: faster partition * fix; flake8 * feat: rework node.sum, correct list typing * fix: typo * feat: refactor sum * fix: pylint * refactor: simpler sum and factorize * feat; clean up sumnode div, all cpu tests pass * feat: simplify floordiv, cache factorization * don't factor numnodes at all * python 3.8 functools does not yet have @cache * fix: restore assert * refactor, fix failing tests * fix: address review comments * feat: rework, add specialization, remove cache * fix: remove specialization * feat: no tuple conversion, faster loop --------- Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>	2023-06-26 09:47:17 -07:00
Casey Primozic	52b7105f87	Dedup params in `Optimizer` (#1047 ) * Dedup params in optimizer * Passing the same tensor multiple times in the set of learnable params passed to optimizers can result in models completely failing to learn, but no errors are produced. This dedups tensors to avoid the problem. * Fix types * Use new variable to satisfy linter * Use `helpers.dedup` instead of `set()` to dedup params * Add test for duped params in optimizers	2023-06-26 00:49:23 -07:00
Kunwar Raj Singh	5d3310ce56	MaskRCNN Inference (#884 ) * MaskRCNN weights loading * backbone maybe works * backbone works, but resnet body atol 1e-3 * RPN Call, but veryy wrong output * fixed topk * RPN maybe works, not sure about nms * Fix cursed modules * add back editorconfig * Full call, wrong output * Full call works * fix mask * use NMS from retinanet * Removing extra funcs * refactor * readable * Add example to run model * remove filter * Fix split, batched inference is worse * Fix image sizes * Matching reference * merge master * add filter on top detections * cuda backend fixed * add model eval and spec * convert images to rgb * fix eval * simplify examples code * remove extra code * meshgrid using tinygrad * removing numpy * roi align, floor, ceil * remove numpy from level_mapper * remove numpy from pooler * Revert "Merge branch 'master' of github.com:kunwar31/tinygrad into mrcnn-inference" This reverts commit `4b95a3cb49`, reversing changes made to `98f2b1fa2e`. * roi align gather * fix master merge * revert to old floor, ceil as ints present in domain * use log2 op * fix indexes * weird bug with ints and gpu * weird bug with ints and gpu * refactors, add env var for gather * floor with contiguous, where * refactor topk, sort * remove staticmethod * refactor stride * remove log2 mlop * realize -> contiguous * refactor forward * remove num_classes, stride_in_1x1 from state * refactor forward * refactoring * flake8 * removing numpy in anchor gen, use numpy for gather, nonzero, optimize topk * keep using tinygrad for smaller gathers * fix empty tensors * comms * move from tensor.py * resnet test passing * add coco dataset back * fix spaces * add test for log2 * no need to create Tensors * no need to create Tensors --------- Co-authored-by: Kunwar Raj Singh <kunwar31@pop-os.localdomain>	2023-06-25 15:37:51 -07:00
George Hotz	0f281e7b18	touchups	2023-06-25 15:24:26 -07:00
George Hotz	c8fbdeb48e	test speed llama (#1046 ) * test speed llama * oops, put it back * uses the real device codegen * just do it on the mac * pp * is faster? * Revert "is faster?" This reverts commit `42db542010`. * disable docker again for less load on CI	2023-06-25 15:22:56 -07:00
Jacky Lee	5d16cc283f	Docker fix (#1039 ) * Docker test * Remove extra installs * Don't run full test * No need for testing dependencies	2023-06-25 10:38:58 -07:00
Francesco Castelli	6ff720103e	Reduce tensor dot line count and fixed 1d tensor dot (#1045 ) * fixed tensor.dot * no 1d dot for image=1 * shorter lines * add 3d dot tests	2023-06-25 10:32:45 -07:00
George Hotz	9c6e507518	move accel into extra	2023-06-23 16:38:15 -07:00
Yair Lifshitz	7f73d6a4da	Fix input path in examples/compile_efficientnet.py, examples/efficientnet.py. (#1034 )	2023-06-23 16:34:33 -07:00
兰天游	0222ee7bd2	feat: fix shell alias on readme (#1022 ) * feat: fix shell alias on readme * feat: edit the install command	2023-06-23 00:00:34 -07:00
cloud11665	264b1e5f48	cache gpuocelot build in cuda CI (#1032 )	2023-06-22 17:42:12 -07:00
cloud11665	2407690d82	add cuda on cpu tests (#1020 )	2023-06-22 14:15:50 -07:00
Eli Frigo	e09219df0f	fixed division by zero for fast kernels (#1021 ) * fixed division by zero for fast operations * made et closer to 0	2023-06-22 14:02:53 -07:00
George Hotz	18892242b0	global -> group (#1007 ) * global -> group * allow None for local_size in custom function * lil local * comment on shape * fix cuda * smart local cast * better local heuristic * fix ptx, and work_dim cleanup * fix metal * fix ops test * fix openpilot jit * no more optlocal * might fix metal tests * try metal now * see generated metal code * test free removal. REVERT THIS * mergable	2023-06-21 11:50:43 -07:00
Casey Primozic	aab9ee0fca	Add RDNA3 assembler `UOps.CAST` partial support + other fixes/improvements (#1012 ) * Add support for one case of `UOps.CAST` for RDNA3 assembler * Adds support for casting from `bool` -> `float32`. Seems like a very common operation that is required in many places. * Fix bool register definition for vector operations * Use `vcc_lo` instead of `vcc` which seems to be required since it's configured to use wavefront_size=32 * Add vector support for some places that were scalar only in register definition and comparison ops * Fix some issues in what seems to be defunct `external_test_image.py` * Some tests still don't pass for other reasons, but it at least runs now and one broken test is now fixed * Refactor RDNA3 assembler register definition * Unify multi-registor code between dtypes and combine with single-register allocation since they're all untyped registers at the end of the day	2023-06-20 11:34:10 -07:00
Diogo	57d3aa76a5	Windows & Ubuntu CLANG CI support (#1011 ) * matrix strategy * push env to GITHUB_ENV * use printf instead of echo * use temp helper function for cross os paths * use path join * switched to using temp helper function * skip test on windows due to memory limit * small fix * removed semi * touchups * clean up * seperate tests * test changes to test_utils on windows * small refactor * more cleanups * undo helpers change * only skip if in CI and WINDOWS	2023-06-19 09:33:24 -07:00
George Hotz	0d4c4f4e9e	metal ci attempt (#1010 ) * metal ci attempt * skip failing ops tests * skip in the ops test * no dtype test	2023-06-19 09:23:55 -07:00
George Hotz	0ac84d5e94	exclude a few more onnx tests	2023-06-19 08:51:29 -07:00
George Hotz	0fd648dff4	exclude more dumb onnx tests	2023-06-19 08:51:29 -07:00
Pasan Perera	b6102ba4ac	added CUDA and PTX to env_vars.md (#1009 )	2023-06-19 08:47:44 -07:00
Sayantan Das	e829e0e718	Update CONTRIBUTING.md (#1008 )	2023-06-18 22:09:03 -07:00
George Hotz	d84c600e5d	contibuting	2023-06-18 21:48:18 -07:00
Casey Primozic	651d6ea457	Minor improvements + cleanup to `ops_gpu.py` (#1006 ) * Minor improvements + cleanup to `ops_gpu.py` * Add some previously undocumented environment variables from `ops_gpu.py` to `env_vars.md` * Update debug print for OpenCL to print the devices that will be used post-filtering with `CL_EXCLUDE` * Remove a couple unused or superfluous variables and assignments * Use `fromimport` shorthand to shave off a couple precious LOC * Couple small whitespace changes to clean things up * Revert change to ordering of OpenCL devices * Small refactor for OpenCL context creation	2023-06-18 21:26:40 -07:00
George Hotz	5428b5d774	good changes from tensor_cores branch (#1005 ) * good changes from tensor_cores branch * touchups * real_strides fixup * refactor merge_views	2023-06-18 20:28:06 -07:00
Yann Huynh	ccb51ff5b0	"Fixed argument passing in example yolov8" (#1004 ) "Fixed argument passing in example yolov8"	2023-06-18 14:29:39 -07:00
George Hotz	b14b7bc749	don't make HIP the default...it's slower	2023-06-18 19:11:39 +00:00
Alex Wang	3d63c71e27	HIP backend (#750 ) * llama works for HIP backend * Use hipMemcpyAsync; Less lines of code * Remove unused code * Refactor * Add comments; hipDeviceSynchronize * HIP over GPU; Remove PyHIP dependency * Cleanups * Fix mypy check * Merge master; Dump assembly code	2023-06-18 11:35:57 -07:00
Casey Primozic	805eef10dd	Add tensorflow GEMM benchmark script (#1000 ) * Modelled closely after the existing torch benchmark script but just adapted slightly for tensorflow	2023-06-18 10:57:45 -07:00
George Hotz	c690eeaca9	flip mulacc to save a line (#997 )	2023-06-17 16:47:55 -07:00
Diogo	d2b837c1d9	Adds floor/ceil (#989 ) * floor ceil impl * control casting in numpy	2023-06-17 10:56:21 -07:00
sehaj	775287ed91	Add yolov8 implementation (#806 ) * added SPPF module from yolov8 * added conv_block, bottleneck modules * cleaned modules * c2f example * spf changes * C2f * fixed and tested bottleneck * improved detect class * tested spf and conv * checked c2f * DFL structure * fixed dfl * added dist2bbox function * added dist2bbox function * added and tested make_anchors function for the head * keeping functions above * creating the detection head * fixing head * untested blocks a. scale_boxes b. clip_boxes c. xywh2xyxy d. box_iou * head works * structure fixx * added darknet (backbone) * yolov8 neck, and intialize bias function while detection * fixed spacing * yolov8 class, init bias, and fixed c2f * forward pass almost working * fixed net structure * init bias not needed, forward pass working * load weights boilerplate * load weights done? * all variants loading! * post process: clip_boxes, scale_boxes, xywh2xyxy, and box_iou(untested) * fix scale_boxes * box_iou fixed and tested * created the pre nms function * fix nms * fixed load weights, apparently the latest commit broke something, excluding num_batches_tracked * added letterbox and pre_tranform for pre_process function * fixed letterbox, pre_transform and added preprocess function * custom NMS done, integrated prepare_boxes and nms, improved box_iou * added postprocess function till parsing * added draw_bounding_boxes_and_save function * testing full flow * using fetch for class names * fixed make_anchors + all tinygrad now * added command line arguments, weight downloading * single image for now only * made draw boxes more efficient * made NMS functions efficient * made compute_transform better * v8 working now, inference is done * prints objects detected in console now * fixed image loading (pre processing) * batch post processing * created initial tests * fixes bounding box thickness AND added get_detected_classes_with_frequency function * cleaning for testing * two tests * added url option for image, removed need for specifiying arguments * tests complete, but lots on things are printed on screen by ultralytics * remove parse arguments * fixed weight location * fixed colours of classes, and black font when high brightness * minor changes * TODOs for later * removed use of torch, using .npz weights * fixed tests * one path for fetch * preprocess now in tinygrad, plus test fix for that * updated tests * fix tests * no class labels needed * Add files via upload * Update showcase.md * Update showcase.md * added safe tensors as weights, and tests fix for that * safe tensors test * using safe_load * using tinygrad functions now to load weights * update tests --------- Co-authored-by: r3sist-uniq <amanmatreja@gmail.com> Co-authored-by: r3sist <72573738+r3sist-uniq@users.noreply.github.com>	2023-06-16 18:55:19 -07:00
George Hotz	fe71282ba1	faster RDNA assembly backend (#990 ) * fast asm * torch gemm	2023-06-16 12:06:38 -07:00
George Hotz	ba56ee6020	RDNA assembly backend ($1000 bounty) (#787 ) * Revert "Revert "ops rdna"" This reverts commit `0400315078`. * Revert "Revert "writing 2"" This reverts commit `325a3bf2cf`. * no dump * 2x 2 * simple asm * local size * sub * lil work * support args != 3 * assembler work * generate that * ptx assembler * begin index renderer * max * ptx loops * gemms work * valid works * asm working a bit more * close * passing all ops tests * ptx is a codegen only, not a backend * ptx * float16 support * rdna goes here * install types * make amd disassemble * ansilen for pretty print * fix ptx log2/exp2 * assemblyinstruction * new asm * working gemm * fix cmp * more passing * mod * ptx works again * rdan3 add works * log exp * sin is sin 2pi * fix types * progress * loops work * rdna xyz * better addressing * cleanups * handle exception in early process * div support * rdna float4 * locals work * fix neg index * cast * smaller diff * yaml * import only if selected * fromimport * types * this all needs rewriting * a few more	2023-06-16 09:33:18 -07:00
George Hotz	dca084f227	minor == to is touchups	2023-06-15 17:11:12 -07:00
blake	041d96083c	clang rt for msvc (#986 ) * added platform config for clang runtime and tempfile dir for xplatform /tmp * flake8 lint * mypy lint * pythonic? * python? * return darwin cflags * <lines * lint;	2023-06-15 17:06:44 -07:00
George Hotz	039f0d372f	delete ltypes (#984 ) * delete ltypes * only upcast float types * test dtype on mac passes * ugh, these upcasts	2023-06-15 16:24:45 -07:00
Yahya Lmallas	804c45b5fc	FIX: Can't pickle local object (#979 ) _early_exec_process is a local function that is defined whiting the scope of another function, should be global	2023-06-14 12:32:17 -07:00
Rayan Hatout	2d567ef688	Optimizations in tensor.py (#974 ) * optimizations in tensor.py * make mypy happy * revert split of Function class	2023-06-14 08:44:35 -07:00
Diogo	0629791cbd	F64 support (#976 ) * initial commit * added osx check for opencl * added llvm f64 conversions * typo in llvmir * more tests and modified unsupported error * fixed linting error * added pragma fp64 * simplified exclusion for OSX * fixed device check and also added it to cast func * added ifdef check for fp16 in ops_gpu * Revert "added ifdef check for fp16 in ops_gpu" This reverts commit `92de754d48`. * f64 prekernel signature match f16 * moved condition to buffer init	2023-06-13 21:31:31 -07:00
John Moore	45bc040a63	Fix typo (#978 )	2023-06-13 15:15:45 -07:00
George Hotz	80e665bddb	a couple new tests	2023-06-13 12:36:05 -07:00
George Hotz	ba4eadb04c	PTX assembly support (#977 ) * ptx assembly * all ops tests pass * fix tests	2023-06-13 12:31:42 -07:00
Rayan Hatout	727416201f	Shapetracker optimizations (#966 ) * optimizations in shapetracker.py * revert micro-optimizations in assertions * make mypy happy * list comp instead of map in get_unsafe_resize_offset * list comp instead of map in get_unsafe_resize_offset	2023-06-12 18:13:21 -07:00
cloud11665	5f13e7c3cf	cuda: fix fp16, uint8, int64, half4 codegen (#968 ) * cuda: add uchar, int64 typedefs * cuda: fix float16 codegen * fuck it, half4 stub. llama time! * inline fp16 half4, revert changes to CStyleLanguage * add inline just in case * remove half4 operators * use dict	2023-06-12 11:15:44 -07:00

1 2 3 4 5 ...

2028 Commits