Commit Graph

15250 Commits

Author SHA1 Message Date
Ryan Dick
85eb4f0312 Fix an edge case with model offloading from VRAM to RAM. If a GGML-quantized model is offloaded from VRAM inside of a torch.inference_mode() context manager, this will cause the following error: 'RuntimeError: Cannot set version_counter for inference tensor'. 2025-01-07 15:59:50 +00:00
psychedelicious
67e948b50d chore: bump version to v5.6.0rc1 v5.6.0rc1 2025-01-07 19:41:56 +11:00
Riccardo Giovanetti
d9a20f319f translationBot(ui): update translation (Italian)
Currently translated at 99.3% (1639 of 1649 strings)

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/
Translation: InvokeAI/Web UI
2025-01-07 19:32:50 +11:00
Riku
38d4863e09 translationBot(ui): update translation (German)
Currently translated at 71.7% (1181 of 1645 strings)

Co-authored-by: Riku <riku.block@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/de/
Translation: InvokeAI/Web UI
2025-01-07 19:32:50 +11:00
Nik Nikovsky
cd7ba14adc translationBot(ui): update translation (Polish)
Currently translated at 16.5% (273 of 1645 strings)

translationBot(ui): update translation (Polish)

Currently translated at 15.4% (254 of 1645 strings)

translationBot(ui): update translation (Polish)

Currently translated at 10.8% (178 of 1645 strings)

Co-authored-by: Nik Nikovsky <zejdzztegomaila@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/pl/
Translation: InvokeAI/Web UI
2025-01-07 19:32:50 +11:00
Linos
e5b6beb24d translationBot(ui): update translation (Vietnamese)
Currently translated at 100.0% (1649 of 1649 strings)

translationBot(ui): update translation (Vietnamese)

Currently translated at 100.0% (1645 of 1645 strings)

translationBot(ui): update translation (Vietnamese)

Currently translated at 100.0% (1645 of 1645 strings)

translationBot(ui): update translation (Vietnamese)

Currently translated at 100.0% (1645 of 1645 strings)

Co-authored-by: Linos <linos.coding@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/vi/
Translation: InvokeAI/Web UI
2025-01-07 19:32:50 +11:00
Ryan Dick
0258b6a04f Partial Loading PR5: Dynamic cache ram/vram limits (#7509)
## Summary

This PR enables RAM/VRAM cache size limits to be determined dynamically
based on availability.

**Config Changes**

This PR modifies the app configs in the following ways:
- A new `device_working_mem_gb` config was added. This is the amount of
non-model working memory to keep available on the execution device (i.e.
GPU) when using dynamic cache limits. It default to 3GB.
- The `ram` and `vram` configs now default to `None`. If these configs
are set, they will take precedence over the dynamic limits. **Note: Some
users may have previously overriden the `ram` and `vram` values in their
`invokeai.yaml`. They will need to remove these configs to enable the
new dynamic limit feature.**

**Working Memory**

In addition to the new `device_working_mem_gb` config described above,
memory-intensive operations can estimate the amount of working memory
that they will need and request it from the model cache. This is
currently applied to the VAE decoding step for all models. In the
future, we may apply this to other operations as we work out which ops
tend to exceed the default working memory reservation.

**Mitigations for https://github.com/invoke-ai/InvokeAI/issues/7513**

This PR includes some mitigations for the issue described in
https://github.com/invoke-ai/InvokeAI/issues/7513. Without these
mitigations, it would occur with higher frequency when dynamic RAM
limits are used and the RAM is close to maxed-out.

## Limitations / Future Work

- Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE
decoding requires more working VRAM than available, the best we can do
is keep the full model on the CPU, but we will still hit an OOM error.
In the future, we could detect this ahead of time and switch to running
inference on the CPU for those ops.
- There is often a non-negligible amount of VRAM 'reserved' by the torch
CUDA allocator, but not used by any allocated tensors. We may be able to
tune the torch CUDA allocator to work better for our use case.
Reference:
https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
- There may be some ops that require high working memory that haven't
been updated to request extra memory yet. We will update these as we
uncover them.
- If a model is 'locked' in VRAM, it won't be partially unloaded if a
later model load requests extra working memory. This should be uncommon,
but I can think of cases where it would matter.

## Related Issues / Discussions

- #7492 
- #7494 
- #7500 
- #7505 

## QA Instructions

Run a variety of models near the cache limits to ensure that model
switching works properly for the following configurations:
- [x] CUDA, `enable_partial_loading=true`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved
in another process so there is limited RAM/VRAM remaining, all other
configs default (i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=false`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, ram/vram limits set (these should take precedence over the
dynamic limits)
- [x] MPS, all other default (i.e. dynamic memory limits)
- [x] CPU, all other default (i.e. dynamic memory limits) 

## Merge Plan

- [x] Merge #7505 first and change target branch to main

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
2025-01-07 00:35:39 -05:00
Ryan Dick
87fdcb7f6f Partial Loading PR4: Enable partial loading (behind config flag) (#7505)
## Summary

This PR adds support for partial loading of models onto the GPU. This
enables models to run with much lower peak VRAM requirements (e.g. full
FLUX dev with 8GB of VRAM).

The partial loading feature is enabled behind a new config flag:
`enable_partial_loading=True`. This flag defaults to `False`.

**Note about performance:**
The `ram` and `vram` config limits are still applied when
`enable_partial_loading=True` is set. This can result in significant
slowdowns compared to the 'old' behaviour. Consider the case where the
VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB
model. When `enable_partial_loading=False`, we attempt to load the
entire model into VRAM, and if it fits (no OOM error) then it will run
at full speed. When `enable_partial_loading=True`, since we have the
option to partially load the model we will only load 0.75 GB into VRAM
and leave the remaining 7.25 GB in RAM. This will cause inference to be
much slower than before. To workaround this, it is important that your
`ram` and `vram` configs are carefully tuned. In a future PR, we will
add the ability to dynamically set the RAM/VRAM limits based on the
available memory / VRAM.

## Related Issues / Discussions

- #7492 
- #7494 
- #7500

## QA Instructions

Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device:
For all tests, we expect model memory to stay below 2 GB. Peak working
memory will be higher.
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests

Tests with `enable_partial_loading=True`, and hack to force all models
to load 10%, on CUDA device:
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests

Tests with `enable_partial_loading=False`, `vram=30`:
We expect no change in behaviour when  `enable_partial_loading=False`.
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests

Other platforms:
- [x] No change in behavior on MPS, even if
`enable_partial_loading=True`.
- [x] No change in behavior on CPU-only systems, even if
`enable_partial_loading=True`.

## Merge Plan

- [x] Merge #7500 first, and change the target branch to main

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
2025-01-06 23:18:31 -05:00
Ryan Dick
d7ab464176 Offload the current model when locking if it is already partially loaded and we have insufficient VRAM. 2025-01-07 02:53:44 +00:00
Ryan Dick
5eafe1ec7a Fix ModelCache execution device selection in unit tests. 2025-01-07 01:20:15 +00:00
Ryan Dick
548b3eddb8 pnpm typegen 2025-01-07 01:20:15 +00:00
Ryan Dick
5b42b7bd45 Add a utility to help with determining the working memory required for expensive operations. 2025-01-07 01:20:15 +00:00
Ryan Dick
71b97ce7be Reduce the likelihood of encountering https://github.com/invoke-ai/InvokeAI/issues/7513 by elminating places where the door was left open for this to happen. 2025-01-07 01:20:15 +00:00
Ryan Dick
b343f81644 Use torch.cuda.memory_allocated() rather than torch.cuda.memory_reserved() to be more conservative in setting dynamic VRAM cache limits. 2025-01-07 01:20:15 +00:00
Ryan Dick
4abfb35321 Tune SD3 VAE decode working memory estimate. 2025-01-07 01:20:15 +00:00
Ryan Dick
cba6528ea7 Add a 20% buffer to all VAE decode working memory estimates. 2025-01-07 01:20:15 +00:00
Ryan Dick
6a5cee61be Tune the working memory estimate for FLUX VAE decoding. 2025-01-07 01:20:15 +00:00
Ryan Dick
bd8017ecd5 Update working memory estimate for VAE decoding when tiling is being applied. 2025-01-07 01:20:15 +00:00
Ryan Dick
299eb94a05 Estimate the working memory required for VAE decoding, since this operations tends to be memory intensive. 2025-01-07 01:20:15 +00:00
Ryan Dick
fc4a22fe78 Allow expensive operations to request more working memory. 2025-01-07 01:20:13 +00:00
Ryan Dick
a167632f09 Calculate model cache size limits dynamically based on the available RAM / VRAM. 2025-01-07 01:14:20 +00:00
Ryan Dick
1321fac8f2 Remove get_cache_size() and set_cache_size() endpoints. These were unused by the frontend and refer to cache fields that are no longer accessible. 2025-01-07 01:06:20 +00:00
Ryan Dick
6a9de1fcf3 Change definition of VRAM in use for the ModelCache from sum of model weights to the total torch.cuda.memory_allocated(). 2025-01-07 00:31:53 +00:00
Ryan Dick
e5180c4e6b Add get_effective_device(...) utility to aid in determining the effective device of models that are partially loaded. 2025-01-07 00:31:00 +00:00
Ryan Dick
2619ef53ca Handle device casting in ia2_layer.py. 2025-01-07 00:31:00 +00:00
Ryan Dick
bcd29c5d74 Remove all cases where we check the 'model.device'. This is no longer trustworthy now that partial loading is permitted. 2025-01-07 00:31:00 +00:00
Ryan Dick
1b7bb70bde Improve handling of cases when application code modifies the size of a model after registering it with the model cache. 2025-01-07 00:31:00 +00:00
Ryan Dick
402dd840a1 Add seed to flaky unit test. 2025-01-07 00:31:00 +00:00
Ryan Dick
7127040c3a Remove unused function set_nested_attr(...). 2025-01-07 00:31:00 +00:00
Ryan Dick
ceb2498a67 Add log prefix to model cache logs. 2025-01-07 00:31:00 +00:00
Ryan Dick
d0bfa019be Add 'enable_partial_loading' config flag. 2025-01-07 00:31:00 +00:00
Ryan Dick
535e45cedf First pass at adding partial loading support to the ModelCache. 2025-01-07 00:30:58 +00:00
Ryan Dick
782ee7a0ec Partial Loading PR 3.5: Fix pre-mature model drops from the RAM cache (#7522)
## Summary

This is an unplanned fix between PR3 and PR4 in the sequence of partial
loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround'
documented in https://github.com/invoke-ai/InvokeAI/issues/7513. In
other words, to work around a flaw in the model cache API, this fix
allows models to be loaded into VRAM _even if_ they have been dropped
from the RAM cache.

This PR also adds an info log each time that this workaround is hit. In
a future PR (#7509), we will eliminate the places in the application
code that are capable of triggering this condition.

## Related Issues / Discussions

- #7492 
- #7494
- #7500 
- https://github.com/invoke-ai/InvokeAI/issues/7513

## QA Instructions

- Set RAM cache limit to a small value. E.g. `ram: 4`
- Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB.
This will trigger the error condition.
- Before the fix, this test configuration would cause a `KeyError`.
After the fix, we should see an info-level log explaining that the
condition was hit, but that generation should continue successfully.

## Merge Plan

No special instructions.

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
2025-01-06 19:05:48 -05:00
Ryan Dick
c579a218ef Allow models to be locked in VRAM, even if they have been dropped from the RAM cache (related: https://github.com/invoke-ai/InvokeAI/issues/7513). 2025-01-06 23:02:52 +00:00
Riku
f4f7415a3b fix(app): remove obsolete DEFAULT_PRECISION variable 2025-01-06 11:14:58 +11:00
Mary Hipp
7d6c443d6f fix(api): limit board_name length to 300 characters 2025-01-06 10:49:49 +11:00
psychedelicious
868e06eb8b tests: fix test_model_install.py 2025-01-03 11:21:23 -05:00
psychedelicious
40e4dbe1fb docs: add blurb about setting a HF token when downloading HF models by URL and not repo id 2025-01-03 11:21:23 -05:00
psychedelicious
4815b4ea80 feat(ui): tweak verbiage for model install errors 2025-01-03 11:21:23 -05:00
psychedelicious
d77a6ccd76 fix(ui): model install error toasts not updating correctly 2025-01-03 11:21:23 -05:00
psychedelicious
3e860c8338 feat(ui): starter models filter works with model base
For example, "flux" now matches any starter model with a model base of "FLUX".
2025-01-03 11:21:23 -05:00
psychedelicious
4f2ef7ce76 refactor(ui): handle hf vs civitai/other url model install errors separately
Previously, we didn't differentiate between model install errors for different types of model install sources, resulting in a buggy UX:
- If a HF model install failed, but it was a HF URL install and not a repo id install, the link to the HF model page was incorrect.
- If a non-HF URL install (e.g. civitai) failed, we treated it as a HF URL install. In this case, if the user's HF token was invalid or unset, we directed the user to set it. If the HF token was valid, we displayed an empty red toast. If it's not a HF URL install, then of course neither of these are correct.

Also, the logic for handling the toasts was a bit complicated.

This change does a few things:
- Consolidate the model install error toasts into one place - the socket.io event handler for the model install error event. There is no more global state for the toasts and there are no hooks managing them.
- Handling the different cases for errors, including all combinations of HF/non-HF and unauthorized/forbidden/unknown.
2025-01-03 11:21:23 -05:00
psychedelicious
d7e9ad52f9 chore(ui): typegen 2025-01-03 11:21:23 -05:00
psychedelicious
b6d7a44004 refactor(events): include full model source in model install events
This is required to fix an issue with the MM UI's error handling.

Previously, we only included the model source as a string. That could be an arbitrary URL, file path or HF repo id, but the frontend has no parsing logic to differentiate between these different model sources.

Without access to the type of model source, it is difficult to determine how the user should proceed. For example, if it's HF URL with an HTTP unauthorized error, we should direct the user to log in to HF. But if it's a civitai URL with the same error, we should not direct the user to HF.

There are a variety of related edge cases.

With this change, the full `ModelSource` object is included in each model install event, including error events.

I had to fix some circular import issues, hence the import changes to files other than `events_common.py`.
2025-01-03 11:21:23 -05:00
psychedelicious
e18100ae7e refactor(ui): move model install error event handling to own file
No logic change.
2025-01-03 11:21:23 -05:00
psychedelicious
ad0aa0e6b2 feat(ui): reset canvas layers only resets the layers 2025-01-03 11:02:04 -05:00
psychedelicious
157b92e0fd docs: no need to specify version for dev env setup 2025-01-03 10:59:39 -05:00
psychedelicious
fd838ad9d4 docs: update dev env docs to mirror the launcher's install method 2025-01-03 14:27:45 +11:00
psychedelicious
5e9227c052 docs: update manual install docs to mirror the launcher's install method 2025-01-03 14:27:45 +11:00
Kent Keirsey
94785231ce Update href to correct link 2025-01-02 09:39:41 +11:00