* pretty multinomial
p, cdf_normalized -> weight, cdf
symmetric unsqueeze / squeeze
check num_sample > 0
TODO: how do we want to handle 0/0 in general?
* no 0-dim input
* single sum
* beautiful mnist
* beautiful mnist example
* from tinygrad import Tensor
* more beautiful
* the jit is super core tinygrad
* globalcounters reset on jit run
* symlinks and exclude
* beautiful_cartpole
* evaluate is it's own function
* no symlinks
* more beautiful
* jit reset for double speed
* type hinting for JIT
* beautiful_mnist gets 98%
* beautiful_mnist < 4s with BEAM=2
* better cartpole
* use actor critic
* zero_grad got lost
* delete double relu
* stable cartpole with PPO
* beautiful_cartpole is more beautiful
* REPLAY_BUFFER
* beautiful stuff typechecks
* None support in shape
* hp tuning
* add back as_strided, move rebuilt mops to extra
* negative stride for ops_cpu
* Revert "negative stride for ops_cpu"
This reverts commit a13b6815ac.
* skip that
* style
* force rebuild of ocelot
* SzymonOzog gpuocelot
* delete that
* downgrade that
* non parallel
* force rebuild
* use llvm
* nauto
* less mem maybe
* print test
* helper_test_exception skip CUDACPU
* helper_test_exception
* shippable
* assert adequate memory has been freed
* cleaned up runtime error message
* improved metal buffer alloc error catching and reporting
* decreased lines and altered messages
* removed unnecessary _get_cur_free_space() call
* improved assert message
* added allocate massive buffer test
* added test_lru_allocator_metal_max_buffer_length
* split into two asserts and removed walrus assignment from assert expression
* update assert message and use byte data type for clarity
* add Tensor.multinomial only with replacement
* add support for 2D input in Tensor.multinomial
* fix multinomial output shape
* allow passing replacement=False to Tensor.multinomial when num_samples=1
* improve tests for Tensor.multinomial
* fix edge case in Tensor.multinomial
* Tensor.multinomial no more staticmethod
* zero in shape start
* no assert for that
* if output size is 0, return without exec
* tweak
* strides
* reduce over non-zero
* shrink and expand
* fix import
* test_elementwise where
* cannot reshape from size 0 to size 1
* compiled backend reduce over 0
* zeros for numpy
* reduce over 0 and keepdim resulted in 1
* reduce empty set default values
* compare with same input
* pad test case
* cat test case
* torch does not support that?
* metal indirect command buffers
* sub 1ms gpt
* metal batch exec is good
* remove whitespace
* input_replace
* fix ci
* useResources
* very simple cacheallocator
* update_stats
* fix CI
* minor
* remove that from jit
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>