* onehot in Tensor.py
* one_hot tests
* works for all shapes, not just 1
* pylint
* not a static method
* moved around, num_classes mandatory
* pylint
* pylint
* space & moving
* formatting
* moved tests
* add llama attention test for multigpu
* test fails
* kv cache trying to shrink on sharded axis
* mask None works for scale dot product
* kv cache seems to be working but scale dot product breaks
* scaled dot product works, but the last linear layer failed
* running into the reshape case where it could be wrong for multigpu
* making sure it was the reshape
* adding contiguous doesn't solve
* need to shard more properly
* remove reshape test
* minor adjustment to scale dot product attention test
* weights are sharded wrong
* continue fix new weight sharding
* clean up
* fix attention when start_pos is 0
* remove print
* add TODOs for the best mutigpu interface
* use device from LinearizerOptions in kernel search
removed all Device.DEFAULT in search.py
* pass device string for parallel pickle
* device for interpreted backends in LinearizerOptions
* mem_estimate is always int, not symbolic
op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it.
fixed some long lines too. update_stats is a very big function
* operator does not need underscores
* cached size
* simplify simplify
* 0 doesn't have base
* fix test
* cleaner cache
* hmm, metal is flaky on this...might be real(ish) but useless as test
* short circuit reshape/expand properly
* better reshape bypass
* base doesn't have to be a function
* no double fetch
* pop, don't check
* make the gc happy
* avoid hasattr
* cache canonicalize
* remove assert, faster base
* don't redefine that every time
* make Embedding device aware for multigpu
* split line instead of igore because that's cheating
* add test incomplete
* add test complete
* remove comment
* fix white space
* remove nn.Embedding
* WebGL WIP
* 84% of ops passing test
* tests passing 100%
* Cleanup, refactor
* Shave off some lines
* Work on dtypes
* TestOps at 100% again
* Efficient net shaders compile in browser webgl2
* Compile all efficientnet shaders in browser
* Create empty textures for tensor buffers
* Run program. Up next weight loading
* Exported WebGL model working
* Add tests, refactor
* Explicit cast alu for GLSL
* Fix CI tests
* WebGL efficientnet demo
* Compile and run yolov8 in browser
* Fix imports
* Simplify yolo compile
* Fix bool*bool and cast cmplt to float
* More tests
* Do std tests pass on CI?
* Skip std tests on CI
* Remove explicit_cast_alu hack, and solve it in code_for_op
* Move to new dtype-less alloc api
* Remove local size hack: optimize local_size only if device has local
* Remove glsl.py, and move content to cstyle
* dont_use_locals in opts
* Fix dtype tests
* type_map in CStyleLanguage
* Make core changes smaller, cleaner, refactor export_model and demo
* Skip pad_slice
* Simplify: render_const, render_conditional
* solve bool alu for other binops, cleaner ops_webgl
* Fix noopt hack
* Remove some skipIfs
* WebGL image hack
* type_names is a better name
* global_max
* Fix dtype import
* Fix type_names -> type_map
* Fix lint
* Remove webgpu, back to 5k lines (#3040)
* remove webgpu
* max 5000 lines
* revert those to master
* retain that cstyle
---------
Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>