Currently translated at 16.5% (273 of 1645 strings)
translationBot(ui): update translation (Polish)
Currently translated at 15.4% (254 of 1645 strings)
translationBot(ui): update translation (Polish)
Currently translated at 10.8% (178 of 1645 strings)
Co-authored-by: Nik Nikovsky <zejdzztegomaila@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/pl/
Translation: InvokeAI/Web UI
Currently translated at 100.0% (1649 of 1649 strings)
translationBot(ui): update translation (Vietnamese)
Currently translated at 100.0% (1645 of 1645 strings)
translationBot(ui): update translation (Vietnamese)
Currently translated at 100.0% (1645 of 1645 strings)
translationBot(ui): update translation (Vietnamese)
Currently translated at 100.0% (1645 of 1645 strings)
Co-authored-by: Linos <linos.coding@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/vi/
Translation: InvokeAI/Web UI
## Summary
This PR enables RAM/VRAM cache size limits to be determined dynamically
based on availability.
**Config Changes**
This PR modifies the app configs in the following ways:
- A new `device_working_mem_gb` config was added. This is the amount of
non-model working memory to keep available on the execution device (i.e.
GPU) when using dynamic cache limits. It default to 3GB.
- The `ram` and `vram` configs now default to `None`. If these configs
are set, they will take precedence over the dynamic limits. **Note: Some
users may have previously overriden the `ram` and `vram` values in their
`invokeai.yaml`. They will need to remove these configs to enable the
new dynamic limit feature.**
**Working Memory**
In addition to the new `device_working_mem_gb` config described above,
memory-intensive operations can estimate the amount of working memory
that they will need and request it from the model cache. This is
currently applied to the VAE decoding step for all models. In the
future, we may apply this to other operations as we work out which ops
tend to exceed the default working memory reservation.
**Mitigations for https://github.com/invoke-ai/InvokeAI/issues/7513**
This PR includes some mitigations for the issue described in
https://github.com/invoke-ai/InvokeAI/issues/7513. Without these
mitigations, it would occur with higher frequency when dynamic RAM
limits are used and the RAM is close to maxed-out.
## Limitations / Future Work
- Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE
decoding requires more working VRAM than available, the best we can do
is keep the full model on the CPU, but we will still hit an OOM error.
In the future, we could detect this ahead of time and switch to running
inference on the CPU for those ops.
- There is often a non-negligible amount of VRAM 'reserved' by the torch
CUDA allocator, but not used by any allocated tensors. We may be able to
tune the torch CUDA allocator to work better for our use case.
Reference:
https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
- There may be some ops that require high working memory that haven't
been updated to request extra memory yet. We will update these as we
uncover them.
- If a model is 'locked' in VRAM, it won't be partially unloaded if a
later model load requests extra working memory. This should be uncommon,
but I can think of cases where it would matter.
## Related Issues / Discussions
- #7492
- #7494
- #7500
- #7505
## QA Instructions
Run a variety of models near the cache limits to ensure that model
switching works properly for the following configurations:
- [x] CUDA, `enable_partial_loading=true`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved
in another process so there is limited RAM/VRAM remaining, all other
configs default (i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=false`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, ram/vram limits set (these should take precedence over the
dynamic limits)
- [x] MPS, all other default (i.e. dynamic memory limits)
- [x] CPU, all other default (i.e. dynamic memory limits)
## Merge Plan
- [x] Merge #7505 first and change target branch to main
## Checklist
- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
## Summary
This PR adds support for partial loading of models onto the GPU. This
enables models to run with much lower peak VRAM requirements (e.g. full
FLUX dev with 8GB of VRAM).
The partial loading feature is enabled behind a new config flag:
`enable_partial_loading=True`. This flag defaults to `False`.
**Note about performance:**
The `ram` and `vram` config limits are still applied when
`enable_partial_loading=True` is set. This can result in significant
slowdowns compared to the 'old' behaviour. Consider the case where the
VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB
model. When `enable_partial_loading=False`, we attempt to load the
entire model into VRAM, and if it fits (no OOM error) then it will run
at full speed. When `enable_partial_loading=True`, since we have the
option to partially load the model we will only load 0.75 GB into VRAM
and leave the remaining 7.25 GB in RAM. This will cause inference to be
much slower than before. To workaround this, it is important that your
`ram` and `vram` configs are carefully tuned. In a future PR, we will
add the ability to dynamically set the RAM/VRAM limits based on the
available memory / VRAM.
## Related Issues / Discussions
- #7492
- #7494
- #7500
## QA Instructions
Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device:
For all tests, we expect model memory to stay below 2 GB. Peak working
memory will be higher.
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests
Tests with `enable_partial_loading=True`, and hack to force all models
to load 10%, on CUDA device:
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests
Tests with `enable_partial_loading=False`, `vram=30`:
We expect no change in behaviour when `enable_partial_loading=False`.
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests
Other platforms:
- [x] No change in behavior on MPS, even if
`enable_partial_loading=True`.
- [x] No change in behavior on CPU-only systems, even if
`enable_partial_loading=True`.
## Merge Plan
- [x] Merge #7500 first, and change the target branch to main
## Checklist
- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
## Summary
This is an unplanned fix between PR3 and PR4 in the sequence of partial
loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround'
documented in https://github.com/invoke-ai/InvokeAI/issues/7513. In
other words, to work around a flaw in the model cache API, this fix
allows models to be loaded into VRAM _even if_ they have been dropped
from the RAM cache.
This PR also adds an info log each time that this workaround is hit. In
a future PR (#7509), we will eliminate the places in the application
code that are capable of triggering this condition.
## Related Issues / Discussions
- #7492
- #7494
- #7500
- https://github.com/invoke-ai/InvokeAI/issues/7513
## QA Instructions
- Set RAM cache limit to a small value. E.g. `ram: 4`
- Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB.
This will trigger the error condition.
- Before the fix, this test configuration would cause a `KeyError`.
After the fix, we should see an info-level log explaining that the
condition was hit, but that generation should continue successfully.
## Merge Plan
No special instructions.
## Checklist
- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
Previously, we didn't differentiate between model install errors for different types of model install sources, resulting in a buggy UX:
- If a HF model install failed, but it was a HF URL install and not a repo id install, the link to the HF model page was incorrect.
- If a non-HF URL install (e.g. civitai) failed, we treated it as a HF URL install. In this case, if the user's HF token was invalid or unset, we directed the user to set it. If the HF token was valid, we displayed an empty red toast. If it's not a HF URL install, then of course neither of these are correct.
Also, the logic for handling the toasts was a bit complicated.
This change does a few things:
- Consolidate the model install error toasts into one place - the socket.io event handler for the model install error event. There is no more global state for the toasts and there are no hooks managing them.
- Handling the different cases for errors, including all combinations of HF/non-HF and unauthorized/forbidden/unknown.
This is required to fix an issue with the MM UI's error handling.
Previously, we only included the model source as a string. That could be an arbitrary URL, file path or HF repo id, but the frontend has no parsing logic to differentiate between these different model sources.
Without access to the type of model source, it is difficult to determine how the user should proceed. For example, if it's HF URL with an HTTP unauthorized error, we should direct the user to log in to HF. But if it's a civitai URL with the same error, we should not direct the user to HF.
There are a variety of related edge cases.
With this change, the full `ModelSource` object is included in each model install event, including error events.
I had to fix some circular import issues, hence the import changes to files other than `events_common.py`.
## Summary
This PR is the third in a sequence of PRs working towards support for
partial loading of models onto the compute device (for low-VRAM
operation). This PR updates the LoRA patching code so that the following
features can cooperate fully:
- Partial loading of weights onto the GPU
- Quantized layers / weights
- Model patches (e.g. LoRA)
Note that this PR does not yet enable partial loading. It adds support
in the model patching code so that partial loading can be enabled in a
future PR.
## Technical Design Decisions
The layer patching logic has been integrated into the custom layers (via
`CustomModuleMixin`) rather than keeping it in a separate set of wrapper
layers, as before. This has the following advantages:
- It makes it easier to calculate the modified weights on the fly and
then reuse the normal forward() logic.
- In the future, it makes it possible to pass original parameters that
have been cast to the device down to the LoRA calculation without having
to re-cast (but the current implementation hasn't fully taken advantage
of this yet).
## Know Limitations
1. I haven't fully solved device management for patch types that require
the original layer value to calculate the patch. These aren't very
common, and are not compatible with some quantized layers, so leaving
this for future if there's demand.
2. There is a small speed regression for models that have CPU
bottlenecks. This seems to be caused by slightly slower method
resolution on the custom layers sub-classes. The regression does not
show up on larger models, like FLUX, that are almost entirely
GPU-limited. I think this small regression is tolerable, but if we
decide that it's not, then the slowdown can easily be reclaimed by
optimizing other CPU operations (e.g. if we only sent every 2nd progress
image, we'd see a much more significant speedup).
## Related Issues / Discussions
- https://github.com/invoke-ai/InvokeAI/pull/7492
- https://github.com/invoke-ai/InvokeAI/pull/7494
## QA Instructions
Speed tests:
- Vanilla SD1 speed regression
- Before: 3.156s (8.78 it/s)
- After: 3.54s (8.35 it/s)
- Vanilla SDXL speed regression
- Before: 6.23s (4.46 it/s)
- After: 6.45s (4.31 it/s)
- Vanilla FLUX speed regression
- Before: 12.02s (2.27 it/s)
- After: 11.91s (2.29 it/s)
LoRA tests with default configuration:
- [x] SD1: A handful of LoRA variants
- [x] SDXL: A handful of LoRA variants
- [x] flux non-quantized: multiple lora variants
- [x] flux bnb-quantized: multiple lora variants
- [x] flux ggml-quantized: muliple lora variants
- [x] flux non-quantized: FLUX control LoRA
- [x] flux bnb-quantized: FLUX control LoRA
- [x] flux ggml-quantized: FLUX control LoRA
LoRA tests with sidecar patching forced:
- [x] SD1: A handful of LoRA variants
- [x] SDXL: A handful of LoRA variants
- [x] flux non-quantized: multiple lora variants
- [x] flux bnb-quantized: multiple lora variants
- [x] flux ggml-quantized: muliple lora variants
- [x] flux non-quantized: FLUX control LoRA
- [x] flux bnb-quantized: FLUX control LoRA
- [x] flux ggml-quantized: FLUX control LoRA
Other:
- [x] Smoke testing of IP-Adapter, ControlNet
All tests repeated on:
- [x] cuda
- [x] cpu (only test SD1, because larger models are prohibitively slow)
- [x] mps (skipped FLUX tests, because my Mac doesn't have enough memory
to run them in a reasonable amount of time)
## Merge Plan
No special instructions.
## Checklist
- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_