* work on minrf example
* more
* jit sample
* t is tensor not const
* fixes
* more convs
* fix dropout
* don't print
* 504
* big patch
* onehot
* touch
* use embeddings
* dumb uses final layer
* act
* non fl
* match
* tp
* 3
* of
* ppsz
* normal
* add adln
* no t
* weird transformer
* weird transformer
* contig
* actual speed fix
* dumb
* cb
* 0
* t is 0
* mort-t
* args
* dumb days are over
* readable
* contig
* no more t mask
* mask_t
* init to zero
* clean
* steps
* work
* tt
* t
* solid
* Enhance tensor random functions with dtype support
- Updated `aten.uniform_` and `aten.normal_` to include dtype parameter in backend.py
- Added unit tests for uniform and normal tensor generation with specific dtypes in test.py
* Refactor test name for clarity
- Renamed `test_normal_dtype` to `test_normal` in `extra/torch_backend/test.py`
- Aims to improve readability and better reflect the test's purpose
* start gpu
* progress
* fixes
* read correct
* libusb
* libusb works
* support asm24
* hmm
* one access file
* fix extra
* start AMBar
* works on am
* back to usb
* patch fw
* full fast write into a bar
* ugh, minus one gpus, next please
* mute libusb for now
* usb for asm24
* 63
* hmm
* ops
* rescan
* and gpu shoudl be there
* enumerate them?
* usbgpu bus 4, 100% reliable (draft)
* lil
* works
* comments
* add DEBUG
* cleaner
* simplest
* Revert "simplest"
This reverts commit 1d00354c16.
* Revert "cleaner"
This reverts commit c5662de956.
* assert we find gpu
* that's simpler
* this back
* simpler?
* correcT
* work
* nonsense
* works with more checks
* this works
* the 6s in the right place
* reliable now
* fix after reboot
* set config
* 1s timeouts
* close to fw loading
* streams
* usbhub works
* endpoints
* fix
* want to test tiny10
* move to tiny 10
* fix gpu
* ugly speed
* smth
* mostly broken, but signals and dmas
* do not reset gpu every time
* changes to run kernels
* ugh, not working
* t10
* pg and sc files
* some prog
* um?
* somehow it works
* patched for 24
* some tries
* minimal
* moving
* back to working
* so sloooooow
* move to controller
* usb.py rewrite
* rework
* cleaner 1
* cleaner 2
* cleaner 3
* new abstractions
* aft merge
* init controller
* cleaner 4
* cleaner 5
* patcher + tiny changes
* ignore that
* cleaner 6
* after rebase
* cleaner 7
* bring it back
* start linter war
* linter 2
* autogen was missing
* fix autogen
* typing
* better?
* mypy
* extra/legacy rename and cleaner
* shuffle
* better printing
* tiny changes and tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* bug in div range folding
* simpler
* oh, this is right for indexing, but the div mod folding needs to be fixed
* reenable
* Passing test_complexity_w_unroll2 (#10068)
* Passing
* remove non_folded_divs
* Add check for negative tern in div folding
* Add test
* bump that limit
* fix casted
---------
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
* remu refactors
* scc is sgpr 253
* remove that
* rename to vcc_lo
* run cargo test in CI
* llvm-mc
* meh
* work
* work_group work 1
* seeded_lanes is dumb
* better than seeded_lanes
* does not need to be address
* 128 sgpr per wave
* scc is sgpr, we don't know which one
* null_src once more
* derive clone, wave init is cleaner
* init comes first
* add kernelize
* remove that
* kernelize returns self
* update abstractions2.py
* kernelize in test_schedule
* temp: assert BUFFER_VIEW's existence
* ASSIGN must have a buffer or subbuffer target
* assert and shrink
* fix
* padded setitem
* var
* toposort once
* extra
* base_buffer
* end with BUFFER_VIEW
* setitem for disk
* test_setitem_becomes_subbuffer
* mul slice test
* torch backend fix 1
* non-deterministic
* keep subbuffer
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
* Add amax support to Tensor operations
- Implemented amax function in backend.py for tensor max operations.
- Added unit tests for amax in test.py to ensure correct functionality.
* Fix formatting in amax output function
- Adjusted spacing in the amax output lambda function in backend.py
- Improved code readability for better maintenance