Commit Graph

10417 Commits

Author SHA1 Message Date
George Hotz
a6b9733256 GB/s can be higher 2023-04-18 17:51:03 -07:00
George Hotz
9fb3f9ace3 Revert "move t.grad realize on SGD"
This reverts commit ccdc0290d6.
2023-04-18 17:50:08 -07:00
George Hotz
e93e04ed6e Revert "huh...this is faster"
This reverts commit aedd4685fa.
2023-04-18 17:50:07 -07:00
George Hotz
aedd4685fa huh...this is faster 2023-04-18 17:36:31 -07:00
George Hotz
dbc99c243b why did that test break? 2023-04-18 17:08:38 -07:00
George Hotz
ccdc0290d6 move t.grad realize on SGD 2023-04-18 16:47:51 -07:00
George Hotz
8b7ecd63bb Remove Zeroview (#748)
* no zeroview start

* closer

* stride mask

* st tests pass, delete ZeroView

* byebye zv

* close to working

* not contiguous with mask

* subtract, don't add

* mask on view

* ugh, that shouldn't have been in there

* shape merge

* bugfixes

* fuzzer + 4 fuzzer failures

* fuzzer for symbolic

* more fuzzing and nothing

* that fuzzer doesn't hit either

* fixes padding...ugh

* no more offsets

* working

* rewrite load and store

* all checks

* fix idxs

* progress

* bugfix

* float4_axis

* works

* cleanups

* complex valids_okay
2023-04-17 08:21:46 -07:00
Jan Henrik Høiland
4e17d27d09 Fix cuda errors when running llama example (#749) 2023-04-16 13:52:10 -07:00
George Hotz
0b5a0b9ba4 winograd comment 2023-04-16 03:36:51 -07:00
George Hotz
8b777af571 metal_conv gets over 10.4 TFLOPS... 2023-04-15 03:31:22 -07:00
George Hotz
d66e682205 metal matmul from tcores branch 2023-04-14 23:29:29 -07:00
George Hotz
732884653c osx in hlb_cifar10_torch 2023-04-14 13:12:08 -07:00
George Hotz
17e37157b6 fix backward convs (#746)
* fix backward convs

* no pushing in reduce

* late cout

* test_fold_4convs_sgd
2023-04-14 10:42:11 -07:00
George Hotz
f7f416d6f4 back to 6 for test_fold_conv_sgd 2023-04-14 07:34:00 -07:00
George Hotz
133521e730 relu UnaryOp is back 2023-04-14 07:12:53 -07:00
George Hotz
584ee6f616 don't graph consts 2023-04-14 03:32:20 -07:00
George Hotz
9a39ebefde hlb_cifar10_torch gets 80% 2023-04-14 02:47:03 -07:00
worldwalker2000
552a048a33 make maximum split the grad like torch when equal (#738)
* make maximum split grad

* added test for maximum split grad when equal

* minor expr simplification

* (2-eq)/2 only once

* update test bc one more sum output child stays
2023-04-14 00:17:46 -07:00
Jacky Lee
06ed958abd Fix train_resnet example (#744)
* Fix ResNet example

* Scientific notation
2023-04-12 13:48:39 +05:30
Sohaib
70b9072663 add Pad onnx operator and rework _padding (#740) 2023-04-06 17:07:36 +05:30
jintuzhang
8e40ff8c8d Do not specify errors when trying to load devices. (#741) 2023-04-06 17:05:36 +05:30
Jacky Lee
7a45b989a1 Device: make GPU default and METAL/CUDA if possible (#732)
* Make GPU the default device

* Compile EfficientNet with CPU

* don't print device

* use METAL and CUDA if possible

* Revert some changes to workflow

* Fix import error when checking device availability

* device lookup is now optional

* hopefully fix linter and tests

* fix workflow

* Skip device if not available

* don't change default if CPU=1

* simplify device selection

* Default to CPU if no GPU

* don't print device name...

* No need to change default in llama

* Make GPU the default device

* Compile EfficientNet with CPU

* don't print device

* use METAL and CUDA if possible

* Revert some changes to workflow

* Fix import error when checking device availability

* device lookup is now optional

* hopefully fix linter and tests

* fix workflow

* Skip device if not available

* don't change default if CPU=1

* simplify device selection

* Default to CPU if no GPU

* don't print device name...

* No need to change default in llama

* run github workflow

* Fix logic to select default

* pass if an error occurs

* use separate function for try except
2023-04-04 09:41:52 +05:30
George Hotz
94e2c49c35 test_cacheline_size that works in both places 2023-03-30 06:47:20 +04:00
George Hotz
b05c2828f7 better cacheline test 2023-03-30 06:08:54 +04:00
George Hotz
76db1af6fc better archprobe 2023-03-30 05:52:00 +04:00
George Hotz
1240c12ac5 download cifar to datasets dir 2023-03-29 12:25:42 +04:00
Jacky Lee
e5f430d8c6 Device: move below LazyBuffer (#733) 2023-03-29 10:35:11 +04:00
George Hotz
b99798f08e acc function not needed 2023-03-29 08:03:46 +04:00
George Hotz
20894991ed good changes from the M1 Tensor Core project (#730)
* good changes

* working except llvm

* llvm types

* nice acc

* archprobe

* lang.float4

* use self.acc for late acc

* fix store bug
2023-03-29 05:11:02 +04:00
Jacky Lee
156640e90d Permute examples (#731)
* examples: use permute instead of transpose

* Use transpose but change args
2023-03-29 05:07:06 +04:00
Andre Slavescu
39d6e1525f Added activation ops + tests (#729)
* activation ops

* type hints + more testing

* formatting correction + parameter testing

* fixes to shape testing

* hardtanh to use clip + removed type hints

* assign val fix
2023-03-28 13:17:53 +04:00
George Hotz
fa5516dda0 fix lint, installed pre-commit on now computer 2023-03-24 11:15:59 -07:00
George Hotz
ebc4ad6223 color the jit nicer 2023-03-24 10:54:20 -07:00
George Hotz
23f88fb026 synchronize for honest speed compare 2023-03-24 10:24:27 -07:00
George Hotz
1cb5b2d015 test_enet_se 2023-03-24 10:04:30 -07:00
Jacky Lee
fafe8e9ce2 casting: support all backends and implement half (#726)
* casting: support all backends and implement half

* map torch types in ops_torch

* reuse type map for torch buffer

* inverse dict lookup
2023-03-24 09:58:03 -07:00
George Hotz
e88b9bfe1e print gflops avg with DEBUG=2 2023-03-23 16:07:08 -07:00
George Hotz
de04208247 hotcast bug fix 2023-03-23 11:49:47 -07:00
Jacky Lee
e009b6f341 Add tests for casting (#724)
* Add tests for casting

* Skip half_matmul_upcast when TORCH=1

* Fix promotion on torch

* Fix spacing
2023-03-23 08:02:52 -07:00
George Hotz
68e45fca18 metal_matmul: bw and torch sync 2023-03-23 08:02:04 -07:00
George Hotz
bd6c3c31a9 compare to torch 2023-03-22 23:58:37 -07:00
George Hotz
c3a3db75c7 fix metal matmul example 2023-03-22 23:42:51 -07:00
George Hotz
f5aea472a3 latest torch and onnx should be fine 2023-03-22 23:33:50 -07:00
George Hotz
51e19ac25c OPTLOCAL=2 makes stable diffusion a usable speed after the cache builds 2023-03-22 19:19:11 -07:00
George Hotz
2e18469fd4 clean up display name 2023-03-22 18:32:05 -07:00
George Hotz
b12b60af20 fix binop, other tests failure (#723)
* fix binop, other tests failure

* that was a bad idea

* better layernorm

* inference kernel count tests

* new style reshape pushing

* fixup replacement

* 199 kernels is okay. fix flops

* push reshape through unaryops only

* GRAPH=2 draws the phantom ops

* found resnet issue

* non working test

* mul is cheaper than div

* OPT inflation

* SHUFFLE_PAD_OPS in OPT=2
2023-03-22 18:15:07 -07:00
George Hotz
d6f4219952 LayerNorm2d for 2 lines 2023-03-20 16:58:43 -07:00
George Hotz
128ca160ac lazy: remove required device 2023-03-20 16:31:45 -07:00
George Hotz
120d7072bd indexing merge almost works 2023-03-20 16:17:07 -07:00
George Hotz
06abbbfe7c remove the stupid register class (#721)
* remove the stupid register class

* touchups

* colorful display name
2023-03-20 15:45:12 -07:00