uuuvn
a51c688f39
Cleanup llvm cleanup (and some clang things too) ( #8871 )
...
* Cleanup llvm cleanup (and some clang things too)
* Tests
* Tests 2
* forgot mockgpu
* more print some sources
2025-02-05 07:49:05 +08:00
George Hotz
56fa5c1191
dsp simulator ( #8869 )
...
* dsp simulator
* progress
* fix
* close on test tiny
* working
* less waste
* line savings
* Device DSP compiler
* mock DSP at the bottom
* DSP tests
* docker caching
* test update
* need load
* skip that test for CI DSP
* last touch
* ugh
2025-02-04 09:45:04 +08:00
chenyu
836cf42c2e
fix rand_like for multi ( #8880 )
2025-02-03 19:00:14 -05:00
uuuvn
6dadb60c93
LLVM JIT (+autogen llvm instead of llvmlite) ( #8486 )
...
* LLVM JIT
* Autogen LLVM
* Update autogen
* Move things around
* even more non-determinism
* windows
* more autogen weirdness
* more windows stuff
* blind windows development try 2
* more blind windows development
* even more blind windows development
* maybe i should just set up a windows vm...
* why can't everyone just use sysv abi?
* cleanup debugging stuff
* unused import
* icache flushing isn't required on x86
* merge jit_nt and jit_unix
* more
* Temporary hack to not segfault
* better error
* bad conflict resolution
* Attempt to simplify support/llvm.py
* More refactoring
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-02 19:52:42 +08:00
chenyu
7f606fbde4
remove DEBUG=5 in windows ci test [pr] ( #8803 )
...
DEBUG=5 prints a lot of info that's slow, and is not visible if test passed on CI.
also skip two tests that took 3 minutes in python backend
2025-01-29 14:18:17 -05:00
FICTURE7
ec120ce6b9
Fix allocator memory alignment ( #8800 )
...
* Fix allocator memory alignment
* Run `test_ops.py` using LLVM and CLANG on Windows
2025-01-29 21:03:17 +03:00
b1tg
da464d039f
fix windows ci cache ( #8787 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-01-28 13:22:15 +02:00
b1tg
5d62aa28dc
Support CLANG backend on Windows ( #8768 )
...
* Support CLANG on Windows
* Put both backends in a windows ci
* remove coff loader
* use memmove
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-28 18:19:34 +09:00
b1tg
efc7971090
add windows test to ci ( #8761 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-01-27 14:53:21 +09:00
George Hotz
1b4618e257
gradient cleanup ( #8750 )
...
* switch backward to use gradient [pr]
* set device correctly, dedup
* why does that fail?
* add noop cast
* simple backward
* fix beautiful_mnist
* touchups
* set in compute_gradient
* uop_count
* uop_count was wrong
* collections
* no note
* skip that test
* update sched kernel counts
* train mnist is 65
* fix metadata and gc
* fixes
* materialize_grads
* no pathlib stuff
* add contiguous_backward, fix bugs
* add some realize
* fix multi
* remove unused backward passes [pr]
* lower line count
2025-01-26 09:30:55 +09:00
chenyu
0c759e1ff6
add bert to bechmark ci ( #8741 )
...
with `DISABLE_DROPOUT=1 BERT_LAYERS=2` for now
2025-01-24 14:45:11 -05:00
George Hotz
e82ba1454b
MultiLazyBuffer is UOp [pr] ( #8662 )
...
* MultiLazyBuffer is UOp [pr]
* this is new mlb
* this is the idea
* progress
* multitensor works
* more movement ops
* this
* MultiLazyBuffer is UOp
* cleanups
* multi axis
* fix more tests
* work
* not that
* add multi grad and move shard to ops
* mops not views
* no double contig
* sweet, all mt tests passing
* port old logic
* remove lbs
* fix realized
* whitespace
* assign tweak
* test_assign_kv_cache_multi passes
* fix is_realized
* fix JIT for multi
* just a few more lines i'll pay them back soon i swear please bro just a few more
* no split reduceop for multi
2025-01-24 13:28:55 +09:00
George Hotz
46a8c5e1e5
delete forced_realize ( #8615 )
...
* delete forced_realize
* put that back
* expectedFailures
* cleaner create_subbuffer
* more comments
---------
Co-authored-by: qazal <qazal.software@gmail.com >
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-01-20 09:40:36 -08:00
nimlgen
9d3c40601f
am: fast memory manager ( #8654 )
...
* start
* progress
* fixes
* smth
* mini fixes
* fix2
* ugh, need this for now
* faster
* cleanups
* tiny linters
* make mypy happier
* test & free pts
* ops
* linter
* cleanup vm
* fix
* remove map_from
* tiny fixes
* add test to ci
2025-01-20 16:58:22 +03:00
ignaciosica
d2234e308a
tf32 tc for nv and ptx ( #8635 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-17 17:43:57 -08:00
nimlgen
f671da6755
ci: add AM start time to benchmark ( #8637 )
...
* ci: add AM start time to benchmark
* am: unlock it
* add AMD
* revert this
2025-01-16 14:47:36 +03:00
chenyu
4ee3243c93
JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] ( #8623 )
...
is it fast?
2025-01-14 19:52:38 -05:00
George Hotz
bfbe81df71
remove cast before view ( #8613 )
...
* remove cast before view
* greener
* indexing
* that passes too
* openpilot too
* ack
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-01-14 15:04:58 -05:00
chenyu
393eec3201
raise RuntimeError for uneven shard [pr] ( #8593 )
...
no 7B llama on 6 GPUs
skip 70B
2025-01-14 14:51:48 -05:00
ignaciosica
d5a646d492
CUDA Turing TC ( #8597 )
...
* init turing tc
* reorder tc
* hotfix: remove some spaces
* revert var name to x
* consistent order of factors
* revert order of terms to match old stuff
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-14 10:35:14 -08:00
nimlgen
1ff6862a3d
ci: sleep a bit to let the driver unload the prev pid ( #8605 )
2025-01-14 15:55:23 +03:00
nimlgen
74b83c4c41
am in ci ( #8532 )
...
* try am in ci
* no sudo
* temp
* run more am test
* run half on am
* insert amdgpu
* other machine as well
2025-01-13 19:55:17 +03:00
qazal
2f71a00236
remove PYTHONPATH=. from mypy ci [pr] ( #8578 )
2025-01-12 09:52:03 -08:00
qazal
98c9e23560
remove global PYTHONPATH setting in CI (test.yml) [pr] ( #8568 )
...
* remove global PYTHONPATH setting in CI [pr]
* only run mypy in tinygrad/
* still needed for benchmarks
2025-01-11 12:47:50 -05:00
qazal
60503c8621
use CAPTURE_PROCESS_REPLAY=1 in CI [pr] ( #8564 )
2025-01-11 06:03:48 -05:00
nimlgen
aa3d612df2
add script to install amd mockgpu on macOS ( #8536 )
...
* upload artifact every time
* hm
* sh script
* hm
* hm2
* hm2
* hm2
* no sudo
* def paths
* small comments
* text
* try auth for bigger limits
2025-01-09 01:29:25 +03:00
patrini32
21c7d7c71a
MOCKGPU amd test on OSX ( #8505 )
...
* add tests
* Refactor
* cache only amd/comgr/build (saves a lot of space)
* fix
* silence warning and add check for cache hit before installing cmake
* run only pytest
* use actions/cache
* lower timeout-minutes and add Device.DEFAULT step
* add nvidia to Device.DEFAULT check
* typo
* fix
* Check only for amd and run only 2 test
2025-01-08 14:27:56 +03:00
chenyu
85a4397f27
fix create_schedule_with_vars usage in allreduce benchmark [pr] ( #8522 )
...
* fix create_schedule_with_vars usage in allreduce benchmark [pr]
because i didn't know how to use it...
* increase time limit because tiny17 is slow
2025-01-07 01:30:01 -05:00
chenyu
0061dc7447
fix benchmark allreduce and add to ci [pr] ( #8521 )
2025-01-07 00:37:59 -05:00
nimlgen
9bc317d5d2
mockcuda ( #8503 )
...
* init mockcuda
* run gpu ocelot
* fix
* sfixes
* disable broken tests
* linter
* these fails as well
* pylint
* myypy
* this fails on real platforms as well
* mypy please
2025-01-05 01:23:57 +03:00
uuuvn
5ffc50d58c
Clang JIT ( #8481 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-03 11:12:55 -05:00
George Hotz
803a47494e
Revert "Clang JIT ( #8312 )" ( #8452 )
...
This reverts commit b6266c8e41 .
2024-12-30 17:49:35 -05:00
uuuvn
b6266c8e41
Clang JIT ( #8312 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-12-30 17:37:53 -05:00
George Hotz
0addbad36d
Happy New Year! Let's get AM merged
2024-12-30 13:15:10 -05:00
qazal
9defbc7d54
add symbolic_simple to the scheduler [pr] ( #8419 )
2024-12-26 20:05:08 +08:00
George Hotz
9f62c80f68
hotfix: this is a loan
2024-12-20 14:47:04 -08:00
qazal
d78e75f710
hotfix: use ubuntu-22.04 ci from 8249 ( #8251 )
2024-12-15 02:23:00 +02:00
George Hotz
8a50868264
touchup function.py [pr] ( #8220 )
...
* touchup function.py [pr]
* remove ALLOWED_READ_IMAGE
* eh, keep it, just change it
2024-12-13 13:07:00 -08:00
ignaciosica
0a00187dce
add real AMX tests to benchmark ( #8216 )
...
* add real amx to benchmark
* add debug=2 to check tc are triggered
2024-12-13 14:03:41 -05:00
George Hotz
d9a0880d33
delete fuzz uops (not tested) [pr] ( #8181 )
2024-12-12 01:41:27 -08:00
chenyu
26e049ab40
add ALLOWED_READ_IMAGE=2131 to openpilot ( #8166 )
...
added as exact number check now as it's not clear if more/less than allowed is any better
2024-12-11 12:14:17 -08:00
Ahmed Harmouche
a73e3677d0
Test linearizer on webgpu ( #8159 )
...
* Test linearizer on wgpu
* Skip tests due to exceeded dims
2024-12-11 17:03:26 +01:00
chenyu
d462f8ace0
use HALF in cifar wino benchmarks ( #8153 )
...
more representative as it hits tensor cores on tinyboxes
2024-12-10 20:21:00 -05:00
Ahmed Harmouche
a8cfdc70ed
Run more webgpu tests ( #8142 )
2024-12-10 23:20:04 +01:00
Ahmed Harmouche
ed7318a3f5
Fix puppeteer install ( #8148 )
...
Clean npm cache before puppeteer install
2024-12-10 23:06:33 +01:00
Ahmed Harmouche
71dd222f66
Fix setitem on wgpu ( #8144 )
2024-12-10 19:34:25 +01:00
George Hotz
f83d715f41
move checks into compile3, delete compile2 [pr] ( #8127 )
...
* move checks into compile3 [pr]
* test_vs_onnx
* test v torch works
* float16 won't compile on compile3
* actually delete compile2
2024-12-09 14:21:42 -08:00
George Hotz
87c360c4b5
hotfix: add --size 8B to llama3
2024-12-09 07:53:20 -08:00
chenyu
e9692de42b
don't FUZZ_ALL_ACTIONS in fuzz_linearizer.py ( #8096 )
...
mostly for speed, this is just making sure the script runs
2024-12-06 17:22:17 -05:00
Ahmed Harmouche
ce72fe1411
u32 to f16 in tinygrad ( #8074 )
...
* f16 decompression in tinygrad
* Typing and cleanup
2024-12-06 12:00:13 +01:00