Elias Wahl
4a114756f6
New BERT dataloader ( #5881 )
...
* One file == One topic
* update test
* new dataloader
* update train script
* get index is faster
2024-08-02 15:12:23 -04:00
nimlgen
34168a64e3
optimize nv profiler ( #5856 )
...
* nv profiler fix
* cleanup hcq a bit
* fixes
* fix
* typo
* all signals put timestamp
* a bit cleaner
* merge fields
* type
* import
* tiny fix
2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov
610e454132
fix opencl_ioctl on comma ( #5814 )
...
- remove unused code
- add CP_REG_TO_MEM opcode
- fixed parse_cmd_buf for more than 1 command object by correcting
an offset
- fixed memory mappings for cases when memory was allocated with
KGSL_MEMFLAGS_USE_CPU_MAP.
KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU
address will be 0. Calling mmap() will set the GPU address.
So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory
and it resulted to crash right after get_mem.
2024-07-30 20:44:06 -07:00
David Hou
9a485f36e4
shard kvcache ( #5830 )
2024-07-30 20:29:54 -07:00
George Hotz
4e89d45513
hotfix: put contiguous back in llama
2024-07-30 18:43:48 -07:00
George Hotz
21c5e8e1b7
extreme llama speed, 57.34 tok/s ( #5827 )
...
* extreme llama speed
* mergable
2024-07-30 18:32:09 -07:00
George Hotz
e6879035a0
work to make GEMV fast ( #5824 )
...
* work to make GEMV fast
* half8 cast
* align struct
* fix amd
* float8 is a later problem
2024-07-30 17:41:40 -07:00
Francis Lata
ce61be16f1
clean up how preprocessed folder is defined ( #5813 )
2024-07-30 12:35:26 -04:00
chenyu
471b188d79
fix mypy errors in latest mypy ( #5794 )
...
* fix mypy errors in latest mypy
mypy has stricter partial and api arg checks now
* PYTHONPATH="."
2024-07-29 14:53:30 -04:00
nimlgen
ea27ec4cd0
nv switch classlist_v2 to classlist ( #5763 )
...
* nv switch classlist_v2 to classlist
* support in mockgpu
* fix mockgpu
2024-07-28 20:24:42 +03:00
chenyu
3686b6726a
move GraphException to jit.py ( #5744 )
...
same place where GraphRunner is defined
2024-07-26 19:01:12 -04:00
George Hotz
489a5b99a5
hotfix: triton_nv_matmul touchups
2024-07-24 23:24:29 +00:00
George Hotz
bf24be4c8c
triton gets 163 TFLOPS on 4090
2024-07-24 18:32:29 +00:00
George Hotz
4d47968580
fix acc folding for NV tensor cores ( #5658 )
...
* fix acc folding for NV tensor cores
* fix correctness of reduce_before_expand
2024-07-23 13:03:02 -07:00
nimlgen
08a9c0ae5e
hcq cache invalidation for beam ( #5630 )
...
* nv full cache invalidation
* the same command on amd
* linter
* fix amd
* nv no hardcoded consts
* beam default
2024-07-22 18:13:17 +03:00
George Hotz
6c6d74d922
parallel mcts ( #5626 )
...
* start work on parallel mcts
* compile was linearizing twice
* typing + more early stopping
* fix compiler error
2024-07-21 14:53:23 -07:00
George Hotz
ef179087a4
mcts exit condition wasn't right, also use it with BEAM>=100 ( #5619 )
...
* mcts exit condition wasn't right, also use it with BEAM>=100
* mcts touchups
* clean up sample
2024-07-21 10:16:47 -07:00
George Hotz
0f67ef4674
mcts graph and dedup support ( #5618 )
...
* mcts graph and dedup support
* usable graph
* mcts colors
* C=4 seems better
* C=3 even better
* sample_tree
* backprop is external function
* late expand to match algo
2024-07-20 23:29:14 -07:00
chenyu
eddc5bcfd7
MCTS tweaks ( #5616 )
...
MCTS 500 is competitive with BEAM=8 on resnet on M1 Max.
- increment trial times even with compiled error and runtime error.
- use best time of children as the node value.
2024-07-20 19:45:59 -07:00
George Hotz
1113e47f96
print best in MCTS + light up the winner in hcopt
2024-07-20 09:39:36 -07:00
George Hotz
ac99ecd94e
use statistics.median for timing ( #5606 )
2024-07-20 08:37:32 -07:00
George Hotz
06e336bccb
mcts search ( #5598 )
...
* mcts search
* mcts cleanups
* mcts cleanup
* random shuffle children order
* mcts in handcode_opt
* src and remove_node
* debug 3 to print ast
* print the type
* mcts in extra
2024-07-19 21:38:39 -07:00
Tobias Fischer
72da3fe7e6
added clip vision model ( #5595 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-07-19 18:35:51 -04:00
George Hotz
fa7e734b49
MetaOps.KERNEL ( #5543 )
2024-07-17 19:41:23 -07:00
Francis Lam
2d53abb04a
test/external/fuzz_linearizer: fix for new AST changes ( #5519 )
...
* test/external/fuzz_linearizer: fix for new AST changes
also add beautiful_mnist failures
* add CLANG and LLVM to test_failure_35 failed_platforms
* fix test_linearizer_failure names
2024-07-17 00:08:07 -04:00
Tobias Fischer
85d4ca7caa
FID Inception Model ( #5516 )
...
* added model impl
* minor cleanups
* extracted weights loading into from_pretrained
* reorganized model for better weight loading
* removed lru cache for state dict loading
2024-07-16 23:12:03 -04:00
chenyu
28972418c4
s/get_linearizer/get_kernel [run_process_replay] ( #5467 )
2024-07-13 20:32:22 -04:00
George Hotz
03c2dc8bd7
lowerer is kernel [run_process_replay] ( #5437 )
2024-07-12 18:50:55 -07:00
chenyu
00813a92a0
update Tensor.eye api to match torch ( #5433 )
...
* update Tensor.eye api to match torch
input is n for nrows and optional m for ncols
* space
* fix onnx
2024-07-12 20:25:12 -04:00
George Hotz
870dc8c350
s/Linearizer/Lowerer [run_process_replay] ( #5428 )
2024-07-12 15:54:07 -07:00
George Hotz
6707c778d0
scheduleitem is not Tuple [run_process_replay] ( #5425 )
...
* scheduleitem is not Tuple [run_process_replay]
* fix tests
* fix op + fuzzers
* fix mop test
2024-07-12 15:13:19 -07:00
George Hotz
94599c0637
fixup ast in kernel to be MetaOps.SINK [run_process_replay] ( #5424 )
...
* fixup ast in kernel to be MetaOps.SINK [run_process_replay]
* fix tests
* fix more tests
2024-07-12 14:01:03 -07:00
uuuvn
3cb94a0a15
Rename tinygrad/runtime/driver to support ( #5413 )
2024-07-12 11:06:42 -07:00
wozeparrot
a02b38c0ac
download openimages by running it ( #5396 )
2024-07-11 16:06:13 -07:00
wozeparrot
fa873df9c1
bring tinychat more inline with tinyos' version ( #5358 )
2024-07-10 13:13:52 -07:00
George Hotz
c13da83f12
tests from lowerer branch ( #5339 )
...
* tests from lowerer branch
* Update test_image_dtype.py
* Update test_image_dtype.py
* Update test_image_dtype.py
2024-07-08 21:23:19 -07:00
nimlgen
51d6f372e4
nv get classes based on device ( #5325 )
...
* nv get classes
* support in mockgpu
* choose sm based on gpu
* fix
* fix
* fix arch
2024-07-08 18:25:05 +03:00
Tobias Fischer
0c3a35e5c2
Stable Diffusion v2 Inference ( #5283 )
...
* model implementation
* clip fix, more qol options
2024-07-03 22:47:10 -04:00
chenyu
b2c3a28a5e
nn.RMSNorm ( #5272 )
...
the norm itself has no significant value to add to Tensor method, but we would want Tensor.normalize
2024-07-02 21:39:01 -04:00
Tobias Fischer
8c9c1cf62f
Pulled CLIP and UNet into Seperate Files ( #5253 )
...
* pulled clip and unet into seperate files
* reference cleanup, lru cache fix
* better pool indexing
2024-07-01 22:33:01 -04:00
nimlgen
57e89645cd
hcq spec test ( #5226 )
...
* start hcq spec test
* more test
* fixes
* run on amd as well
* test amdgpu exec
* fix amd
* amd mockgpu support sdma timestamp
2024-07-01 17:36:37 +03:00
George Hotz
14980f79dd
hotfix: unbreak llama
2024-06-30 15:27:54 -07:00
George Hotz
3df47bc21e
OpenELM + repeat_interleave ( #5234 )
...
* start writing openelm
* progress...hit bug
* repeat_interleave support
* gqa
* add rotary embedding
* spp
* i think it runs correctly
* broken
* output is good now
* cleanups
* no io_uring on android
2024-06-30 15:18:39 -07:00
nimlgen
dd7eef7d71
libc defs to autogen ( #5217 )
...
* libc defs to autogen
* amd import libc
* linter
* better a bit
* remove comment, check this
* not hardcoded path
2024-06-29 14:37:33 +03:00
qazal
3e56c8422c
remu err handling ( #5208 )
...
* add error handling
* use pre release
* minor
* works
2024-06-28 13:15:18 +03:00
reddyn12
f1c7944c44
Fix batchnorm shapes for resnet.load_pretrained ( #5167 )
...
* Fix batchnorm shapes
* make it general reshape
2024-06-26 18:44:10 -04:00
nimlgen
69f116a7e1
nv/amd profiler ( #4718 )
...
* nv/amd profiler
* fix
* fix
* profile copies
* profile logger
* fixes
* more fixes
* less lines and fixes
* fixes
* some linter
* back sync, no related change
* fix gpu2cpu time def
* simpler
* linter
* linter
* docs
* add add_event api
2024-06-23 17:10:12 +03:00
chenyu
e356807696
tinytqdm.set_description and tinytrange ( #5101 )
2024-06-22 14:45:06 -04:00
chenyu
8080298739
s/tinytqdm/tqdm ( #5103 )
...
except in unit test where tqdm is imported
2024-06-22 14:18:26 -04:00
chenyu
e468601226
update llama attention casting ( #5096 )
...
* update llama attention casting
updated scaled_dot_product_attention middle cast and removed hard-coded half in llama attention.
* fix that
2024-06-22 10:57:17 -04:00