InvokeAI

mirror of https://github.com/invoke-ai/InvokeAI.git synced 2026-02-02 17:45:07 -05:00

Author	SHA1	Message	Date
psychedelicious	e09cf64779	feat: more updates to first run view	2025-01-09 11:20:05 +11:00
psychedelicious	fc8cf224ca	docs: typo	2025-01-09 11:20:05 +11:00
psychedelicious	3e1ed18a1f	Update docs/features/low-vram.md Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>	2025-01-09 11:20:05 +11:00
psychedelicious	9a84c85486	docs: add section about disabling the sysmem fallback	2025-01-09 11:20:05 +11:00
psychedelicious	e6deaa2d2f	feat(ui): minor layout tweaks for first run screen	2025-01-09 11:20:05 +11:00
psychedelicious	5246b31347	feat(ui): add low vram link to first run page	2025-01-09 11:20:05 +11:00
psychedelicious	b15dd00840	docs: add docs for low vram mode	2025-01-09 11:20:05 +11:00
psychedelicious	8808c36028	docs: update example yaml file	2025-01-09 11:20:05 +11:00
psychedelicious	89b576f10d	fix(ui): prevent canvas & main panel content from scrolling Hopefully fixes issues where, when run via the launcher, the main panel kinda just scrolls out of bounds.	2025-01-09 09:14:22 +11:00
psychedelicious	d7893a52c3	tweak(ui): whats new copy	2025-01-08 15:26:26 +11:00
Mary Hipp	b9c45c3232	Whats new update	2025-01-08 15:26:26 +11:00
David Burnett	afc9d3b98f	more ruff formating	2025-01-07 20:18:19 -05:00
David Burnett	7ddc757bdb	ruff format changes	2025-01-07 20:18:19 -05:00
David Burnett	d8da9b45cc	Fix for DEIS / DPM clash	2025-01-07 20:18:19 -05:00
Ryan Dick	607d19f4dd	We should not trust the value of since the model could be partially-loaded.	2025-01-07 19:22:31 -05:00
psychedelicious	32286f321c	docs: note that version is not req for editable install	2025-01-07 17:17:40 -05:00
psychedelicious	03f7bdc9f9	docs: fix manual install rocm pypi indices	2025-01-07 17:17:40 -05:00
Ryan Dick	4df3d0861b	Deprecate `ram`/`vram` configs for smoother migration path to dynamic limits (#7526 ) ## Summary Changes: - Deprecate `ram` and `vram` configs. If these are set in invokeai.yaml, they will be ignored. - Create new `max_cache_ram_gb` and `max_cache_vram_gb` configs with the same definitions as the old configs. The main motivation of this change is to make the migration path smoother for users who had previously added `ram` /`vram` to their config files. Now, these users will be automatically migrated into the new dynamic limit behavior (which is better in most cases). These users will have to manually re-add `max_cache_ram_gb` and `max_cache_vram_gb` to their configs if they wish to go back to specifying manual limits. ## Related Issues / Discussions See the release notes for RC v5.6.0rc1 for the old migration behavior that we are trying to improve: https://github.com/invoke-ai/InvokeAI/releases/tag/v5.6.0rc1 ## QA Instructions - [x] Test that if `ram` or `vram` are present in a user's `invokeai.yaml`, these values are ignored. - [x] Test that `max_cache_ram_gb` and `max_cache_vram_gb` are applied, if set. ## Merge Plan - Don't forget to update the RC release notes accordingly. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2025-01-07 17:03:11 -05:00
Ryan Dick	974b4671b1	Deprecate the `ram` and `vram` configs to make the migration to dynamic memory limits smoother for users who had previously overriden these values.	2025-01-07 16:45:29 +00:00
Ryan Dick	6b18f270dd	Bugfix: Offload of GGML-quantized model in `torch.inference_mode()` cm (#7525 ) ## Summary This PR contains a bugfix for an edge case with model unloading (from VRAM to RAM). Thanks to @JPPhoto for finding it. The bug was triggered under the following conditions: - A GGML-quantized model is loaded in VRAM - We run a Spandrel image-to-image invocation (which is wrapped in a `torch.inference_mode()` context manager. - The model cache attempts to unload the GGML-quantized model from VRAM to RAM. - Doing this inside of the `torch.inference_mode()` cm results in the following error: ``` [2025-01-07 15:48:17,744]::[InvokeAI]::ERROR --> Error while invoking session 98a07259-0c03-4111-a8d8-107041cb86f9, invocation d8daa90b-7e4c-4fc4-807c-50ba9be1a4ed (spandrel_image_to_image): Cannot set version_counter for inference tensor [2025-01-07 15:48:17,744]::[InvokeAI]::ERROR --> Traceback (most recent call last): File "/home/ryan/src/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node output = invocation.invoke_internal(context=context, services=self._services) File "/home/ryan/src/InvokeAI/invokeai/app/invocations/baseinvocation.py", line 300, in invoke_internal output = self.invoke(context) File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) File "/home/ryan/src/InvokeAI/invokeai/app/invocations/spandrel_image_to_image.py", line 167, in invoke with context.models.load(self.image_to_image_model) as spandrel_model: File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/load_base.py", line 60, in __enter__ self._cache.lock(self._cache_record, None) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 224, in lock self._load_locked_model(cache_entry, working_mem_bytes) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 272, in _load_locked_model vram_bytes_freed = self._offload_unlocked_models(model_vram_needed, working_mem_bytes) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 458, in _offload_unlocked_models cache_entry_bytes_freed = self._move_model_to_ram(cache_entry, vram_bytes_to_free) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 330, in _move_model_to_ram return cache_entry.cached_model.partial_unload_from_vram( File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, **kwargs) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/cached_model/cached_model_with_partial_load.py", line 182, in partial_unload_from_vram cur_state_dict = self._model.state_dict() File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1939, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1936, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1843, in _save_to_state_dict destination[prefix + name] = param if keep_vars else param.detach() RuntimeError: Cannot set version_counter for inference tensor ``` ### Explanation From the `torch.inference_mode()` docs: > Code run under this mode gets better performance by disabling view tracking and version counter bumps. Disabling version counter bumps results in the aforementioned error when saving `GGMLTensor`s to a state_dict. This incompatibility between `GGMLTensors` and `torch.inference_mode()` is likely caused by the custom tensor type implementation. There may very well be a way to get these to cooperate, but for now it is much simpler to remove the `torch.inference_mode()` contexts. Note that there are several other uses of `torch.inference_mode()` in the Invoke codebase, but they are all tight wrappers around the inference forward pass and do not contain the model load/unload process. ## Related Issues / Discussions Original discussion: https://discord.com/channels/1020123559063990373/1149506274971631688/1326180753159094303 ## QA Instructions Find a sequence of operations that triggers the condition. For me, this was: - Reserve VRAM in a separate process so that there was ~12GB left. - Fresh start of Invoke - Run FLUX inference with a GGML 8K model - Run Spandrel upscaling Tests: - [x] Confirmed that I can reproduce the error and that it is no longer hit after the change - [x] Confirm that there is no speed regression from switching from `torch.inference_mode()` to `torch.no_grad()`. - Before: `50.354s`, After: `51.536s` ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2025-01-07 11:31:20 -05:00
Ryan Dick	85eb4f0312	Fix an edge case with model offloading from VRAM to RAM. If a GGML-quantized model is offloaded from VRAM inside of a torch.inference_mode() context manager, this will cause the following error: 'RuntimeError: Cannot set version_counter for inference tensor'.	2025-01-07 15:59:50 +00:00
psychedelicious	67e948b50d	chore: bump version to v5.6.0rc1 v5.6.0rc1	2025-01-07 19:41:56 +11:00
Riccardo Giovanetti	d9a20f319f	translationBot(ui): update translation (Italian) Currently translated at 99.3% (1639 of 1649 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI	2025-01-07 19:32:50 +11:00
Riku	38d4863e09	translationBot(ui): update translation (German) Currently translated at 71.7% (1181 of 1645 strings) Co-authored-by: Riku <riku.block@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/de/ Translation: InvokeAI/Web UI	2025-01-07 19:32:50 +11:00
Nik Nikovsky	cd7ba14adc	translationBot(ui): update translation (Polish) Currently translated at 16.5% (273 of 1645 strings) translationBot(ui): update translation (Polish) Currently translated at 15.4% (254 of 1645 strings) translationBot(ui): update translation (Polish) Currently translated at 10.8% (178 of 1645 strings) Co-authored-by: Nik Nikovsky <zejdzztegomaila@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/pl/ Translation: InvokeAI/Web UI	2025-01-07 19:32:50 +11:00
Linos	e5b6beb24d	translationBot(ui): update translation (Vietnamese) Currently translated at 100.0% (1649 of 1649 strings) translationBot(ui): update translation (Vietnamese) Currently translated at 100.0% (1645 of 1645 strings) translationBot(ui): update translation (Vietnamese) Currently translated at 100.0% (1645 of 1645 strings) translationBot(ui): update translation (Vietnamese) Currently translated at 100.0% (1645 of 1645 strings) Co-authored-by: Linos <linos.coding@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/vi/ Translation: InvokeAI/Web UI	2025-01-07 19:32:50 +11:00
Ryan Dick	0258b6a04f	Partial Loading PR5: Dynamic cache ram/vram limits (#7509 ) ## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. Config Changes This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature. Working Memory In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. Mitigations for https://github.com/invoke-ai/InvokeAI/issues/7513 This PR includes some mitigations for the issue described in https://github.com/invoke-ai/InvokeAI/issues/7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2025-01-07 00:35:39 -05:00
Ryan Dick	87fdcb7f6f	Partial Loading PR4: Enable partial loading (behind config flag) (#7505 ) ## Summary This PR adds support for partial loading of models onto the GPU. This enables models to run with much lower peak VRAM requirements (e.g. full FLUX dev with 8GB of VRAM). The partial loading feature is enabled behind a new config flag: `enable_partial_loading=True`. This flag defaults to `False`. Note about performance: The `ram` and `vram` config limits are still applied when `enable_partial_loading=True` is set. This can result in significant slowdowns compared to the 'old' behaviour. Consider the case where the VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB model. When `enable_partial_loading=False`, we attempt to load the entire model into VRAM, and if it fits (no OOM error) then it will run at full speed. When `enable_partial_loading=True`, since we have the option to partially load the model we will only load 0.75 GB into VRAM and leave the remaining 7.25 GB in RAM. This will cause inference to be much slower than before. To workaround this, it is important that your `ram` and `vram` configs are carefully tuned. In a future PR, we will add the ability to dynamically set the RAM/VRAM limits based on the available memory / VRAM. ## Related Issues / Discussions - #7492 - #7494 - #7500 ## QA Instructions Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device: For all tests, we expect model memory to stay below 2 GB. Peak working memory will be higher. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=True`, and hack to force all models to load 10%, on CUDA device: - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=False`, `vram=30`: We expect no change in behaviour when `enable_partial_loading=False`. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Other platforms: - [x] No change in behavior on MPS, even if `enable_partial_loading=True`. - [x] No change in behavior on CPU-only systems, even if `enable_partial_loading=True`. ## Merge Plan - [x] Merge #7500 first, and change the target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2025-01-06 23:18:31 -05:00
Ryan Dick	d7ab464176	Offload the current model when locking if it is already partially loaded and we have insufficient VRAM.	2025-01-07 02:53:44 +00:00
Ryan Dick	5eafe1ec7a	Fix ModelCache execution device selection in unit tests.	2025-01-07 01:20:15 +00:00
Ryan Dick	548b3eddb8	pnpm typegen	2025-01-07 01:20:15 +00:00
Ryan Dick	5b42b7bd45	Add a utility to help with determining the working memory required for expensive operations.	2025-01-07 01:20:15 +00:00
Ryan Dick	71b97ce7be	Reduce the likelihood of encountering https://github.com/invoke-ai/InvokeAI/issues/7513 by elminating places where the door was left open for this to happen.	2025-01-07 01:20:15 +00:00
Ryan Dick	b343f81644	Use torch.cuda.memory_allocated() rather than torch.cuda.memory_reserved() to be more conservative in setting dynamic VRAM cache limits.	2025-01-07 01:20:15 +00:00
Ryan Dick	4abfb35321	Tune SD3 VAE decode working memory estimate.	2025-01-07 01:20:15 +00:00
Ryan Dick	cba6528ea7	Add a 20% buffer to all VAE decode working memory estimates.	2025-01-07 01:20:15 +00:00
Ryan Dick	6a5cee61be	Tune the working memory estimate for FLUX VAE decoding.	2025-01-07 01:20:15 +00:00
Ryan Dick	bd8017ecd5	Update working memory estimate for VAE decoding when tiling is being applied.	2025-01-07 01:20:15 +00:00
Ryan Dick	299eb94a05	Estimate the working memory required for VAE decoding, since this operations tends to be memory intensive.	2025-01-07 01:20:15 +00:00
Ryan Dick	fc4a22fe78	Allow expensive operations to request more working memory.	2025-01-07 01:20:13 +00:00
Ryan Dick	a167632f09	Calculate model cache size limits dynamically based on the available RAM / VRAM.	2025-01-07 01:14:20 +00:00
Ryan Dick	1321fac8f2	Remove get_cache_size() and set_cache_size() endpoints. These were unused by the frontend and refer to cache fields that are no longer accessible.	2025-01-07 01:06:20 +00:00
Ryan Dick	6a9de1fcf3	Change definition of VRAM in use for the ModelCache from sum of model weights to the total torch.cuda.memory_allocated().	2025-01-07 00:31:53 +00:00
Ryan Dick	e5180c4e6b	Add get_effective_device(...) utility to aid in determining the effective device of models that are partially loaded.	2025-01-07 00:31:00 +00:00
Ryan Dick	2619ef53ca	Handle device casting in ia2_layer.py.	2025-01-07 00:31:00 +00:00
Ryan Dick	bcd29c5d74	Remove all cases where we check the 'model.device'. This is no longer trustworthy now that partial loading is permitted.	2025-01-07 00:31:00 +00:00
Ryan Dick	1b7bb70bde	Improve handling of cases when application code modifies the size of a model after registering it with the model cache.	2025-01-07 00:31:00 +00:00
Ryan Dick	402dd840a1	Add seed to flaky unit test.	2025-01-07 00:31:00 +00:00
Ryan Dick	7127040c3a	Remove unused function set_nested_attr(...).	2025-01-07 00:31:00 +00:00
Ryan Dick	ceb2498a67	Add log prefix to model cache logs.	2025-01-07 00:31:00 +00:00

1 2 3 4 5 ...

15270 Commits