exceptions can be raised from either model conversion or individual backend failed. openpilot on torch mps works, but does not work with torch cpu.
seperate the expcetion block so that the benchmark can inlcude torch mps for openpilot.
* fix no grad fn for < and ==
* remove 2 line breaks
* Remove deprecated autograd variable
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* remove np from DType
* convert to dataclass
* remove dunder hash, eq, ne overrides from ImageDType
* is dataclass required for PtrDType?
* fix GPU tests
* reduce lines
* revert changes to np
* minor cleanup
cast implicitly resets shapetracker and makes it contiguous (for disk tensor), which fails for Interpreted backend if inputs contain non-contiguous st.
* wmma: enable METAL half tensor cores and clean up cstyle
* revert simple_matmul rand changes and break line in tensor
* added metal fp16->fp32 tensor core
* fix broadcasted logic if there's 0 in shapes
should always expand into 0, not the other way around. fixed matmul with 0 in input shapes.
for forwards for now though, backward is more involved and would need to change 0 size shortcuts
* fix tests
* onehot in Tensor.py
* one_hot tests
* works for all shapes, not just 1
* pylint
* not a static method
* moved around, num_classes mandatory
* pylint
* pylint
* space & moving
* formatting
* moved tests
* add llama attention test for multigpu
* test fails
* kv cache trying to shrink on sharded axis
* mask None works for scale dot product
* kv cache seems to be working but scale dot product breaks
* scaled dot product works, but the last linear layer failed
* running into the reshape case where it could be wrong for multigpu
* making sure it was the reshape
* adding contiguous doesn't solve
* need to shard more properly
* remove reshape test
* minor adjustment to scale dot product attention test
* weights are sharded wrong
* continue fix new weight sharding
* clean up
* fix attention when start_pos is 0
* remove print
* add TODOs for the best mutigpu interface
* use device from LinearizerOptions in kernel search
removed all Device.DEFAULT in search.py
* pass device string for parallel pickle
* device for interpreted backends in LinearizerOptions
* mem_estimate is always int, not symbolic
op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it.
fixed some long lines too. update_stats is a very big function
* operator does not need underscores