This makes the program wait for tracy profiler to connect before exiting
and flush profiling data after each token.
I don't know how to select the tracy iree-runtime variant
programatically -- instead, print an error and exit.
- Move statistics out of the main loop
- Add 'end-to-end' numbers
- Switch the main display unit from s to ms
- Start measuring time at 0
The new print format looks like this:
```
Number of iterations: 5
Num tokens: 1 (prompt), 512 (generated), 513 (total)
Prefill: avg. 0.01 ms (stdev 0.00), avg. 97.99 tokens/s
Decode: avg. 4840.44 ms (stdev 28.80), avg. 97.99 tokens/s
Decode end-2-end: avg. 85.78 tokens/s (w/o prompt), avg. 95.98 (w/ prompt)
```
Add a new flag `-Xiree_compile` to forward extra compiler arguments to
`iree-compile`. This flag can be set multiple times to pass more than
one extra argument.
* Switch most compile flows to use ireec.compile_file.
* re-add input type to compile_str path.
* Check if mlir_module exists before checking if it's a path or pyobject.
* Fix some save_dir cases
Print averaged results at the end of all iterations. Increase the
default number of iterations to 5.
Example:
```
Number of iterations: 5
Prefill: avg. 0.03 s, stddev 0.00
Decode: avg. 43.34 tokens/s, stdev 0.13
```
Also remove the -2 in the number of generated tokens -- I did not find
any evidence we need it.
Add flags to enable a non-internactive mode for microbenchmarking llama
models. In this mode, the system and user prompts are specified with CLI
flags, and the number of generated tokens and iterations is fixed.
Also move the stats below the response and trim any response blankspace.
Update vmfb naming for vulkan devices in order to resolve naming
conflicts in the presence of multiple vulkan devices.
Signed-Off-by: Gaurav Shukla <gaurav@nod-labs.com>
-- This commit fixes the wrong Vulkan device being selected during
runtime.
-- It also adds couple of IREE compilation flags to target specific
Vulkan device.
-- It also changes the Vulkan device listing to be more in tune with
lowering control flow.
Signed-off-by: Abhishek Varma <abhishek@nod-labs.com>
The past key values are only used within the models themselves and can
be kept on device. For vulkan int4, this gives 44 tok/s (for the first
prompt) and settles at around 26 tok/s on 7900xtx.
* WIP: MSVC ROCM support for SHARK Studio
* Make get_iree_rocm_args platform-agnostic.
* Update stable_args.py
* Update rocm arg handling in SD utils
* Guard quantization imports.
Co-authored-by: jam https://github.com/jammm
* [Llama2] Add a fix for Llama2 13B downloading/crashing
-- This commit fixes downloading/crashing of llama2 13B on wrong
.mlir file.
-- Also adds support for downloading vmfb from shark_tank in CLI.
Signed-off-by: Abhishek Varma <abhishek@nod-labs.com>
* [llama2] Add a spec file to run Llama/Vicuna CLI exe
-- This commit adds a spec file to run Llama/Vicuna CLI exe.
Signed-off-by: Abhishek Varma <abhishek@nod-labs.com>
---------
Signed-off-by: Abhishek Varma <abhishek@nod-labs.com>