InvokeAI

mirror of https://github.com/invoke-ai/InvokeAI.git synced 2026-02-03 01:24:56 -05:00

Author	SHA1	Message	Date
Ryan Dick	6bf5b747ce	Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU (#7494 ) ## Summary This PR adds utilities to support partial loading of models from CPU to GPU. The new utilities are not yet being used by the ModelCache, so there should be no functional behavior changes in this PR. Detailed changes: - Add autocast modules that are designed to wrap common `torch.nn.Module`s and enable them to run with automatic device casting. E.g. a linear layer on the CPU can be executed with an input tensor on the GPU by streaming the weights to the GPU at runtime. - Add unit tests for the aforementioned autocast modules to verify that they work for all supported quantization formats (GGUF, BnB NF4, BnB LLM.int8()). - Add `CachedModelWithPartialLoad` and `CachedModelOnlyFullLoad` classes to manage partial loading at the model level. ## Alternative Implementations Several options were explored for supporting inference on partially-loaded models. The pros/cons of the explored options are summarized here for reference. In the end, wrapper modules were selected as the best overall solution for our use case. Option 1: Re-implement the .forward() methods of modules to add support for device conversions - This is the option implemented in this PR. - This approach is the most manual of the three, but as a result offers the broadest compatibility with unusual model types. It is manual in that we have to explicitly add support for all module types that we wish to support. Fortunately, the list of foundational module types is relatively small (e.g. the current set of implemented layers covers all but 0.04 MB of the full FLUX model.). Option 2: Implement a custom Tensor type that casts tensors to a `target_device` each time the tensor is used - This approach has the nice property that it is injected at the tensor level, and the model does not need to be modified in any way. - One challenge with this approach is handling interactions with other custom tensor types (e.g. GGMLTensor). This problem is solvable, but definitely introduces a layer of complexity. (There are likely to also be some similar issues with interactions with the BnB quantization, but I didn't get as far as testing BnB.) Option 3: Override the `__torch_function__` dispatch calls globally and cast all params to the execution device. - This approach is nice and simple: just apply a global context manager and all operations will happen on the compute device regardless of the device of the participating tensors. - Challenges: - Overriding the `__torch_function__` dispatch calls introduces some overhead even if the tensors are already on the correct device. - It is difficult to manage the autocasting context manager. E.g. it is tempting to apply it to the model's `.forward(...)` method, but we use some models with non-standard entrypoints. And we don't want to end up with nested autocasting context managers. - BnB applies quantization side effects when a param is moved to the GPU - this interacts in unexpected ways with a global context manager. ## QA Instructions Most of the changes in this PR should not impact active code, and thus should not cause any changes to behavior. The main risks come from bumping the bitsandbytes dependency and some minor modifications to the bitsandbytes quantization code. - [x] Regression test bitsandbytes NF4 quantization - [x] Regression test bitsandbytes LLM.int8() quantization - [x] Regression test on MacOS (to ensure that there are no lingering bitsandbytes import errors) I also tested the new utilities for inference on full models in another branch to validate that there were not major issues. This functionality will be tested more thoroughly in a future PR. ## Merge Plan - [x] #7492 should be merged first so that the target branch can be updated to main. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2024-12-27 09:20:24 -05:00
Ryan Dick	7d6ab0ceb2	Add a CustomModuleMixin class with a flag for enabling/disabling autocasting (since it incurs some runtime speed overhead.)	2024-12-26 20:08:30 +00:00
Ryan Dick	9692a36dd6	Use a fixture to parameterize tests in test_all_custom_modules.py so that a fresh instance of the layer under test is initialized for each test.	2024-12-26 19:41:25 +00:00
Ryan Dick	b0b699a01f	Add unit test to test that isinstance(...) behaves as expected with custom module types.	2024-12-26 18:45:56 +00:00
Ryan Dick	a8b2c4c3d2	Add inference tests for all custom module types (i.e. to test autocasting from cpu to device).	2024-12-26 18:33:46 +00:00
Ryan Dick	03944191db	Split test_autocast_modules.py into separate test files to mirror the source file structure.	2024-12-24 22:29:11 +00:00
Ryan Dick	987c9ae076	Move custom autocast modules to separate files in a custom_modules/ directory.	2024-12-24 22:21:31 +00:00
Ryan Dick	6d7314ac0a	Consolidate the LayerPatching patching modes into a single implementation.	2024-12-24 15:57:54 +00:00
Ryan Dick	80db9537ff	Rename model_patcher.py -> layer_patcher.py.	2024-12-24 15:57:54 +00:00
Ryan Dick	6f926f05b0	Update apply_smart_model_patches() so that layer restore matches the behavior of non-smart mode.	2024-12-24 15:57:54 +00:00
Ryan Dick	61253b91f1	Enable LoRAPatcher.apply_smart_lora_patches(...) throughout the stack.	2024-12-24 15:57:54 +00:00
Ryan Dick	0148512038	(minor) Rename num_layers -> num_loras in unit tests.	2024-12-24 15:57:54 +00:00
Ryan Dick	d0f35fceed	Add test_apply_smart_lora_patches_to_partially_loaded_model(...).	2024-12-24 15:57:54 +00:00
Ryan Dick	cefcb340d9	Add LoRAPatcher.smart_apply_lora_patches()	2024-12-24 15:57:54 +00:00
Ryan Dick	0fc538734b	Skip flaky test when running on Github Actions, and further reduce peak unit test memory.	2024-12-24 14:32:11 +00:00
Ryan Dick	7214d4969b	Workaround a weird quirk of QuantState.to() and add a unit test to exercise it.	2024-12-24 14:32:11 +00:00
Ryan Dick	a83a999b79	Reduce peak memory used for unit tests.	2024-12-24 14:32:11 +00:00
Ryan Dick	f8a6accf8a	Fix bitsandbytes imports to avoid ImportErrors on MacOS.	2024-12-24 14:32:11 +00:00
Ryan Dick	f8ab414f99	Add CachedModelOnlyFullLoad to mirror the CachedModelWithPartialLoad for models that cannot or should not be partially loaded.	2024-12-24 14:32:11 +00:00
Ryan Dick	c6795a1b47	Make CachedModelWithPartialLoad work with models that have non-persistent buffers.	2024-12-24 14:32:11 +00:00
Ryan Dick	0a8fc74ae9	Add CachedModelWithPartialLoad to manage partially-loaded models using the new autocast modules.	2024-12-24 14:32:11 +00:00
Ryan Dick	dc54e8763b	Add CustomInvokeLinearNF4 to enable CPU -> GPU streaming for InvokeLinearNF4 layers.	2024-12-24 14:32:11 +00:00
Ryan Dick	1b56020876	Add CustomInvokeLinear8bitLt layer for device streaming with InvokeLinear8bitLt layers.	2024-12-24 14:32:11 +00:00
Ryan Dick	3f990393a1	Simplify the state management in InvokeLinear8bitLt and add unit tests. This is in preparation for wrapping it to support streaming of weights from cpu to gpu.	2024-12-24 14:32:11 +00:00
Ryan Dick	97d56f7dc9	Add torch module autocast unit test for GGUF-quantized models.	2024-12-24 14:32:11 +00:00
Ryan Dick	fe0ef2c27c	Add torch module autocast utilities.	2024-12-24 14:32:11 +00:00
Ryan Dick	65fcbf5f60	Bump bitsandbytes. The new verson contains improvements to state_dict loading/saving for LLM.int8 and promises improved speed on some HW.	2024-12-24 14:32:11 +00:00
Ryan Dick	d3916dbdb6	Partial Loading PR1: Tidy ModelCache (#7492 ) ## Summary This PR tidies up the model cache code in preparation for further refactoring to support partial loading of models onto the GPU. These code changes should not change the functional behavior in any way. Changes: - Remove the `ModelCacheBase` class. `ModelCache` is the only implementation, so there is no benefit to the separate abstract class. - Split `CacheRecord` and `CacheStats` out into their own files. - Remove the `ModelLocker` class. This extra layer of indirection was not providing any benefit. Locking is now done directly with the `ModelCache`. - Tidy up relative imports that were contributing to circular import issues. - Pull the 'submodel' concern out of the `ModelCache`. The `ModelCache` should not need to be aware of the model manager submodel system. - Delete unused properties from the `ModelCache` (e.g. `.lazy_offloading`, `.storage_device`, etc.) ## QA Instructions I ran smoke tests with a variety of SD1, SDXL and FLUX models. No change to behavior is expected. ## Merge Plan <!--WHEN APPLICABLE: Large PRs, or PRs that touch sensitive things like DB schemas, may need some care when merging. For example, a careful rebase by the change author, timing to not interfere with a pending release, or a message to contributors on discord after merging.--> ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2024-12-24 09:30:44 -05:00
Ryan Dick	55b13c1da3	(minor) Add TODO comment regarding the location of get_model_cache_key().	2024-12-24 14:23:19 +00:00
Ryan Dick	7dc3e0fdbe	Get rid of ModelLocker. It was an unnecessary layer of indirection.	2024-12-24 14:23:18 +00:00
Ryan Dick	a39bcf7e85	Move lock(...) and unlock(...) logic from ModelLocker to the ModelCache and make a bunch of ModelCache properties/methods private.	2024-12-24 14:23:18 +00:00
Ryan Dick	a7c72992a6	Pull get_model_cache_key(...) out of ModelCache. The ModelCache should not be concerned with implementation details like the submodel_type.	2024-12-24 14:23:18 +00:00
Ryan Dick	d30a9ced38	Rename model_cache_default.py -> model_cache.py.	2024-12-24 14:23:18 +00:00
Ryan Dick	e0bfa6157b	Remove ModelCacheBase.	2024-12-24 14:23:18 +00:00
Ryan Dick	83ea6420e2	Move CacheStats to its own file.	2024-12-24 14:23:18 +00:00
Ryan Dick	ce11a1952e	Move CacheRecord out to its own file.	2024-12-24 14:23:18 +00:00
Ryan Dick	e48dee4c4a	Rip out ModelLockerBase.	2024-12-24 14:23:18 +00:00
Simon Fuhrmann	712674b6dd	Add Stereogram Nodes to communityNodes.md	2024-12-23 13:51:53 -05:00
psychedelicious	de0043f443	docs: update download links for launcher	2024-12-23 13:23:14 +11:00
Riku	d21506da6f	feat(ci): add typegen check workflow	2024-12-22 06:05:17 +11:00
psychedelicious	a49894901a	docs: fix installation docs home again	2024-12-20 17:35:50 +11:00
psychedelicious	e7e26c8a93	docs: fix installation docs home	2024-12-20 17:12:44 +11:00
psychedelicious	9adcd2cc31	docs: update install-related docs	2024-12-20 17:01:34 +11:00
Kent Keirsey	f9edd009f5	Update README.md	2024-12-20 17:01:34 +11:00
Kent Keirsey	91a4160e36	Update Installation Docs	2024-12-20 17:01:34 +11:00
Kent Keirsey	9c9cec1b43	Update README.md	2024-12-20 17:01:34 +11:00
psychedelicious	948ecf9333	chore: bump version to v5.5.0 v5.5.0	2024-12-20 16:17:23 +11:00
psychedelicious	1038f7bcab	Update invokeai_version.py v5.5.0rc1	2024-12-20 10:17:09 +11:00
Riccardo Giovanetti	c7d9e2d62a	translationBot(ui): update translation (Italian) Currently translated at 99.3% (1635 of 1645 strings) translationBot(ui): update translation (Italian) Currently translated at 99.3% (1634 of 1645 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI	2024-12-20 10:07:15 +11:00
Riku	11c3a2e15d	translationBot(ui): update translation (German) Currently translated at 70.8% (1165 of 1645 strings) Co-authored-by: Riku <riku.block@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/de/ Translation: InvokeAI/Web UI	2024-12-20 10:07:15 +11:00

1 2 3 4 5 ...

15281 Commits