* feat: working voice 2 text using whisper
* feat: added llama generation
* feat: vits init
* feat: more accurate voice conversion
* feat: support for tts and working pipeline for the first pass
* fix: linter checks
* refactored vits initialization and inference, added mmts-tts support
* fixed process sync and now we can have an infinite conversation
* reuse output stream to remove overhead of creating a new one each time
* added pre-prompt configuration with yaml files
* adjusted code to merge PR which changed whisper
* optimized whisper, now it's blazing fast and also reduced number of lines
* added better debug printing
* use jitted encode function for whisper, added timings and removed response delim to save speed on generating those tokens
* fixed hf convert and now it's working with tinyllama
* added tinyllama config
* refactored code and made it work with all llama models
* prettier order
* prettier order
* fixed suffix for tinyllama and refactored convert_from_hf
* added missing parameters
* fixed stream release and added missing params
* jitted dp and encoder
* jitted flow forward
* removed re-init of espeak on each call to save up time
* jitted generator forward for blazing fast tts
* added contextmanager for displaying a chat log
* removed whitespace for pylint
* updated code to support latest fetch func
* wait for llama eos token and pass params from cli to llama
* listen for not fixed amount of time
* refactored code a bit
* removed thresholding and now the output streams directly to whisper
* tokenize llama output for vits batch size to work and stream each sentence to a speaker
* changed speaker
* whisper is now printing on the same line
* don't trigger llama on whisper output in parens
* added tinyllama chat model
* adjusted code to work with tinyllama chat model
* removed unused cli arg
* autofetch tokenizer and tinyllama model. add 3 chat tokens to the tokenizer
* fixed issue with long sentences by chunking them
* support for multiline llama output
* prettified log output
* adjusted sentence length
* remove quote from response to avoid funny tts
* fixed prompts
* added missing parameter
* fixed hf convert and now it's working with tinyllama
* added tinyllama config
* refactored code and made it work with all llama models
* prettier order
* prettier order
* fixed suffix for tinyllama and refactored convert_from_hf
* dynamically update help if MODEL_PARAMS changes and default size is the 1st
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* no JIT call in TransformerBlock
* idea
* move 2 reshapes to jitted function
shrink inside jitted too, 6.3ms
remove back reshapes, 5.5ms
isinstance -> __class__ 4.99ms
* think
revert ops_gpu.py
revert symbolic.py too
PYOPENCL_COMPILER_OUTPUT=1
* cleanup
* fix cache shape for conversational model
only reshape if start_pos > 0
* small cleanup
* include var_vals.keys() to st.key
* add comments
* llama small update
* everything jitted again, similar structure to gpt2
* fix typing
* add TODO for in place update cache
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* Implement scaled_dot_product_attention and test
* Support attn_mask
* Support is_causal too
* Use in llama
* Don't forget to reshape
* Set requires_grad=False for causal
* Remove staticmethod
* Remove extra spaces
* Fixes + improved test coverage for helpers.py
- added exception handling in `proc`, if an exception was thrown, the thread would hang
- made `_early_exec_process` catch any Exception, before if an exception was thrown before the process was started, it would hand the thread
* Made `_early_exec_process` catch any Exception
Otherwise, if an exception was thrown before the process was started, it would hang the thread. For example a type error for an argument passed to `subprocess.check_output`
* Fixed `from tinygrad.helpers import Timing` import
oops, for some reason my IDE cleaned that import from extra/helpers.
* Fixed import in llama.py
Another one that I skipped by accident, mybad
* Extracted a class for tests of early exec
* Normalize line endings, windows uses /r/n
* Made `cross_process` not a daemon