Commit Graph

916 Commits

Author SHA1 Message Date
uuuvn
a51c688f39 Cleanup llvm cleanup (and some clang things too) (#8871)
* Cleanup llvm cleanup (and some clang things too)

* Tests

* Tests 2

* forgot mockgpu

* more print some sources
2025-02-05 07:49:05 +08:00
George Hotz
56fa5c1191 dsp simulator (#8869)
* dsp simulator

* progress

* fix

* close on test tiny

* working

* less waste

* line savings

* Device DSP compiler

* mock DSP at the bottom

* DSP tests

* docker caching

* test update

* need load

* skip that test for CI DSP

* last touch

* ugh
2025-02-04 09:45:04 +08:00
chenyu
836cf42c2e fix rand_like for multi (#8880) 2025-02-03 19:00:14 -05:00
uuuvn
6dadb60c93 LLVM JIT (+autogen llvm instead of llvmlite) (#8486)
* LLVM JIT

* Autogen LLVM

* Update autogen

* Move things around

* even more non-determinism

* windows

* more autogen weirdness

* more windows stuff

* blind windows development try 2

* more blind windows development

* even more blind windows development

* maybe i should just set up a windows vm...

* why can't everyone just use sysv abi?

* cleanup debugging stuff

* unused import

* icache flushing isn't required on x86

* merge jit_nt and jit_unix

* more

* Temporary hack to not segfault

* better error

* bad conflict resolution

* Attempt to simplify support/llvm.py

* More refactoring

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-02 19:52:42 +08:00
chenyu
7f606fbde4 remove DEBUG=5 in windows ci test [pr] (#8803)
DEBUG=5 prints a lot of info that's slow, and is not visible if test passed on CI.
also skip two tests that took 3 minutes in python backend
2025-01-29 14:18:17 -05:00
FICTURE7
ec120ce6b9 Fix allocator memory alignment (#8800)
* Fix allocator memory alignment

* Run `test_ops.py` using LLVM and CLANG on Windows
2025-01-29 21:03:17 +03:00
b1tg
da464d039f fix windows ci cache (#8787)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-01-28 13:22:15 +02:00
b1tg
5d62aa28dc Support CLANG backend on Windows (#8768)
* Support CLANG on Windows

* Put both backends in a windows ci

* remove coff loader

* use memmove

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-28 18:19:34 +09:00
b1tg
efc7971090 add windows test to ci (#8761)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-01-27 14:53:21 +09:00
George Hotz
1b4618e257 gradient cleanup (#8750)
* switch backward to use gradient [pr]

* set device correctly, dedup

* why does that fail?

* add noop cast

* simple backward

* fix beautiful_mnist

* touchups

* set in compute_gradient

* uop_count

* uop_count was wrong

* collections

* no note

* skip that test

* update sched kernel counts

* train mnist is 65

* fix metadata and gc

* fixes

* materialize_grads

* no pathlib stuff

* add contiguous_backward, fix bugs

* add some realize

* fix multi

* remove unused backward passes [pr]

* lower line count
2025-01-26 09:30:55 +09:00
chenyu
0c759e1ff6 add bert to bechmark ci (#8741)
with `DISABLE_DROPOUT=1 BERT_LAYERS=2` for now
2025-01-24 14:45:11 -05:00
George Hotz
e82ba1454b MultiLazyBuffer is UOp [pr] (#8662)
* MultiLazyBuffer is UOp [pr]

* this is new mlb

* this is the idea

* progress

* multitensor works

* more movement ops

* this

* MultiLazyBuffer is UOp

* cleanups

* multi axis

* fix more tests

* work

* not that

* add multi grad and move shard to ops

* mops not views

* no double contig

* sweet, all mt tests passing

* port old logic

* remove lbs

* fix realized

* whitespace

* assign tweak

* test_assign_kv_cache_multi passes

* fix is_realized

* fix JIT for multi

* just a few more lines i'll pay them back soon i swear please bro just a few more

* no split reduceop for multi
2025-01-24 13:28:55 +09:00
George Hotz
46a8c5e1e5 delete forced_realize (#8615)
* delete forced_realize

* put that back

* expectedFailures

* cleaner create_subbuffer

* more comments

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-20 09:40:36 -08:00
nimlgen
9d3c40601f am: fast memory manager (#8654)
* start

* progress

* fixes

* smth

* mini fixes

* fix2

* ugh, need this for now

* faster

* cleanups

* tiny linters

* make mypy happier

* test & free pts

* ops

* linter

* cleanup vm

* fix

* remove map_from

* tiny fixes

* add test to ci
2025-01-20 16:58:22 +03:00
ignaciosica
d2234e308a tf32 tc for nv and ptx (#8635)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-17 17:43:57 -08:00
nimlgen
f671da6755 ci: add AM start time to benchmark (#8637)
* ci: add AM start time to benchmark

* am: unlock it

* add AMD

* revert this
2025-01-16 14:47:36 +03:00
chenyu
4ee3243c93 JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] (#8623)
is it fast?
2025-01-14 19:52:38 -05:00
George Hotz
bfbe81df71 remove cast before view (#8613)
* remove cast before view

* greener

* indexing

* that passes too

* openpilot too

* ack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-01-14 15:04:58 -05:00
chenyu
393eec3201 raise RuntimeError for uneven shard [pr] (#8593)
no 7B llama on 6 GPUs

skip 70B
2025-01-14 14:51:48 -05:00
ignaciosica
d5a646d492 CUDA Turing TC (#8597)
* init turing tc

* reorder tc

* hotfix: remove some spaces

* revert var name to x

* consistent order of factors

* revert order of terms to match old stuff

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-14 10:35:14 -08:00
nimlgen
1ff6862a3d ci: sleep a bit to let the driver unload the prev pid (#8605) 2025-01-14 15:55:23 +03:00
nimlgen
74b83c4c41 am in ci (#8532)
* try am in ci

* no sudo

* temp

* run more am test

* run half on am

* insert amdgpu

* other machine as well
2025-01-13 19:55:17 +03:00
qazal
2f71a00236 remove PYTHONPATH=. from mypy ci [pr] (#8578) 2025-01-12 09:52:03 -08:00
qazal
98c9e23560 remove global PYTHONPATH setting in CI (test.yml) [pr] (#8568)
* remove global PYTHONPATH setting in CI [pr]

* only run mypy in tinygrad/

* still needed for benchmarks
2025-01-11 12:47:50 -05:00
qazal
60503c8621 use CAPTURE_PROCESS_REPLAY=1 in CI [pr] (#8564) 2025-01-11 06:03:48 -05:00
nimlgen
aa3d612df2 add script to install amd mockgpu on macOS (#8536)
* upload artifact every time

* hm

* sh script

* hm

* hm2

* hm2

* hm2

* no sudo

* def paths

* small comments

* text

* try auth for bigger limits
2025-01-09 01:29:25 +03:00
patrini32
21c7d7c71a MOCKGPU amd test on OSX (#8505)
* add tests

* Refactor

* cache only amd/comgr/build (saves a lot of space)

* fix

* silence warning and add check for cache hit before installing cmake

* run only pytest

* use actions/cache

* lower timeout-minutes and add Device.DEFAULT step

* add nvidia to Device.DEFAULT check

* typo

* fix

* Check only for amd and run only 2 test
2025-01-08 14:27:56 +03:00
chenyu
85a4397f27 fix create_schedule_with_vars usage in allreduce benchmark [pr] (#8522)
* fix create_schedule_with_vars usage in allreduce benchmark [pr]

because i didn't know how to use it...

* increase time limit because tiny17 is slow
2025-01-07 01:30:01 -05:00
chenyu
0061dc7447 fix benchmark allreduce and add to ci [pr] (#8521) 2025-01-07 00:37:59 -05:00
nimlgen
9bc317d5d2 mockcuda (#8503)
* init mockcuda

* run gpu ocelot

* fix

* sfixes

* disable broken tests

* linter

* these fails as well

* pylint

* myypy

* this fails on real platforms as well

* mypy please
2025-01-05 01:23:57 +03:00
uuuvn
5ffc50d58c Clang JIT (#8481)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-03 11:12:55 -05:00
George Hotz
803a47494e Revert "Clang JIT (#8312)" (#8452)
This reverts commit b6266c8e41.
2024-12-30 17:49:35 -05:00
uuuvn
b6266c8e41 Clang JIT (#8312)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-12-30 17:37:53 -05:00
George Hotz
0addbad36d Happy New Year! Let's get AM merged 2024-12-30 13:15:10 -05:00
qazal
9defbc7d54 add symbolic_simple to the scheduler [pr] (#8419) 2024-12-26 20:05:08 +08:00
George Hotz
9f62c80f68 hotfix: this is a loan 2024-12-20 14:47:04 -08:00
qazal
d78e75f710 hotfix: use ubuntu-22.04 ci from 8249 (#8251) 2024-12-15 02:23:00 +02:00
George Hotz
8a50868264 touchup function.py [pr] (#8220)
* touchup function.py [pr]

* remove ALLOWED_READ_IMAGE

* eh, keep it, just change it
2024-12-13 13:07:00 -08:00
ignaciosica
0a00187dce add real AMX tests to benchmark (#8216)
* add real amx to benchmark

* add debug=2 to check tc are triggered
2024-12-13 14:03:41 -05:00
George Hotz
d9a0880d33 delete fuzz uops (not tested) [pr] (#8181) 2024-12-12 01:41:27 -08:00
chenyu
26e049ab40 add ALLOWED_READ_IMAGE=2131 to openpilot (#8166)
added as exact number check now as it's not clear if more/less than allowed is any better
2024-12-11 12:14:17 -08:00
Ahmed Harmouche
a73e3677d0 Test linearizer on webgpu (#8159)
* Test linearizer on wgpu

* Skip tests due to exceeded dims
2024-12-11 17:03:26 +01:00
chenyu
d462f8ace0 use HALF in cifar wino benchmarks (#8153)
more representative as it hits tensor cores on tinyboxes
2024-12-10 20:21:00 -05:00
Ahmed Harmouche
a8cfdc70ed Run more webgpu tests (#8142) 2024-12-10 23:20:04 +01:00
Ahmed Harmouche
ed7318a3f5 Fix puppeteer install (#8148)
Clean npm cache before puppeteer install
2024-12-10 23:06:33 +01:00
Ahmed Harmouche
71dd222f66 Fix setitem on wgpu (#8144) 2024-12-10 19:34:25 +01:00
George Hotz
f83d715f41 move checks into compile3, delete compile2 [pr] (#8127)
* move checks into compile3 [pr]

* test_vs_onnx

* test v torch works

* float16 won't compile on compile3

* actually delete compile2
2024-12-09 14:21:42 -08:00
George Hotz
87c360c4b5 hotfix: add --size 8B to llama3 2024-12-09 07:53:20 -08:00
chenyu
e9692de42b don't FUZZ_ALL_ACTIONS in fuzz_linearizer.py (#8096)
mostly for speed, this is just making sure the script runs
2024-12-06 17:22:17 -05:00
Ahmed Harmouche
ce72fe1411 u32 to f16 in tinygrad (#8074)
* f16 decompression in tinygrad

* Typing and cleanup
2024-12-06 12:00:13 +01:00