InvokeAI

mirror of https://github.com/invoke-ai/InvokeAI.git synced 2026-02-03 09:25:02 -05:00

Author	SHA1	Message	Date
Ryan Dick	9a0a226ce1	Fix bitsandbytes imports in unit tests on MacOS.	2024-12-30 10:41:48 -05:00
Ryan Dick	477d87ec31	Fix layer patch dtype selection for CLIP text encoder models.	2024-12-29 21:48:51 +00:00
Ryan Dick	8b4b0ff0cf	Fix bug in CustomConv1d and CustomConv2d patch calculations.	2024-12-29 19:10:19 +00:00
Ryan Dick	6fd9b0a274	Delete old sidecar wrapper implementation. This functionality has moved into the custom layers.	2024-12-29 17:33:08 +00:00
Ryan Dick	52fc5a64d4	Add a unit test for a LoRA patch applied to a quantized linear layer with weights streamed from CPU to GPU.	2024-12-29 17:14:55 +00:00
Ryan Dick	a8bef59699	First pass at making custom layer patches work with weights streamed from the CPU to the GPU.	2024-12-29 17:01:37 +00:00
Ryan Dick	6d49ee839c	Switch the LayerPatcher to use 'custom modules' to manage layer patching.	2024-12-29 01:18:30 +00:00
Ryan Dick	0525f967c2	Fix the _autocast_forward_with_patches() function for CustomConv1d and CustomConv2d.	2024-12-29 00:22:37 +00:00
Ryan Dick	2855bb6b41	Update BaseLayerPatch.get_parameters(...) to accept a dict of orig_parameters rather than orig_module. This will enable compatibility between patching and cpu->gpu streaming.	2024-12-28 21:12:53 +00:00
Ryan Dick	20acfc9a00	Raise in CustomEmbedding and CustomGroupNorm if a patch is applied.	2024-12-28 20:49:17 +00:00
Ryan Dick	918f541af8	Add unit test for a SetParameterLayer patch applied to a CustomFluxRMSNorm layer.	2024-12-28 20:44:48 +00:00
Ryan Dick	93e76b61d6	Add CustomFluxRMSNorm layer.	2024-12-28 20:33:38 +00:00
Ryan Dick	f692e217ea	Add patch support to CustomConv1d and CustomConv2d (no unit tests yet).	2024-12-27 22:23:17 +00:00
Ryan Dick	f2981979f9	Get custom layer patches working with all quantized linear layer types.	2024-12-27 22:00:22 +00:00
Ryan Dick	ef970a1cdc	Add support for FluxControlLoRALayer in CustomLinear layers and add a unit test for it.	2024-12-27 21:00:47 +00:00
Ryan Dick	5ee7405f97	Add more unit tests for custom module LoRA patching: multiple LoRAs and ConcatenatedLoRALayers.	2024-12-27 19:47:21 +00:00
Ryan Dick	e24e386a27	Add support for patches to CustomModuleMixin and add a single unit test (more to come).	2024-12-27 18:57:13 +00:00
Ryan Dick	b06d61e3c0	Improve custom layer wrap/unwrap logic.	2024-12-27 16:29:48 +00:00
Ryan Dick	6bf5b747ce	Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU (#7494 ) ## Summary This PR adds utilities to support partial loading of models from CPU to GPU. The new utilities are not yet being used by the ModelCache, so there should be no functional behavior changes in this PR. Detailed changes: - Add autocast modules that are designed to wrap common `torch.nn.Module`s and enable them to run with automatic device casting. E.g. a linear layer on the CPU can be executed with an input tensor on the GPU by streaming the weights to the GPU at runtime. - Add unit tests for the aforementioned autocast modules to verify that they work for all supported quantization formats (GGUF, BnB NF4, BnB LLM.int8()). - Add `CachedModelWithPartialLoad` and `CachedModelOnlyFullLoad` classes to manage partial loading at the model level. ## Alternative Implementations Several options were explored for supporting inference on partially-loaded models. The pros/cons of the explored options are summarized here for reference. In the end, wrapper modules were selected as the best overall solution for our use case. Option 1: Re-implement the .forward() methods of modules to add support for device conversions - This is the option implemented in this PR. - This approach is the most manual of the three, but as a result offers the broadest compatibility with unusual model types. It is manual in that we have to explicitly add support for all module types that we wish to support. Fortunately, the list of foundational module types is relatively small (e.g. the current set of implemented layers covers all but 0.04 MB of the full FLUX model.). Option 2: Implement a custom Tensor type that casts tensors to a `target_device` each time the tensor is used - This approach has the nice property that it is injected at the tensor level, and the model does not need to be modified in any way. - One challenge with this approach is handling interactions with other custom tensor types (e.g. GGMLTensor). This problem is solvable, but definitely introduces a layer of complexity. (There are likely to also be some similar issues with interactions with the BnB quantization, but I didn't get as far as testing BnB.) Option 3: Override the `__torch_function__` dispatch calls globally and cast all params to the execution device. - This approach is nice and simple: just apply a global context manager and all operations will happen on the compute device regardless of the device of the participating tensors. - Challenges: - Overriding the `__torch_function__` dispatch calls introduces some overhead even if the tensors are already on the correct device. - It is difficult to manage the autocasting context manager. E.g. it is tempting to apply it to the model's `.forward(...)` method, but we use some models with non-standard entrypoints. And we don't want to end up with nested autocasting context managers. - BnB applies quantization side effects when a param is moved to the GPU - this interacts in unexpected ways with a global context manager. ## QA Instructions Most of the changes in this PR should not impact active code, and thus should not cause any changes to behavior. The main risks come from bumping the bitsandbytes dependency and some minor modifications to the bitsandbytes quantization code. - [x] Regression test bitsandbytes NF4 quantization - [x] Regression test bitsandbytes LLM.int8() quantization - [x] Regression test on MacOS (to ensure that there are no lingering bitsandbytes import errors) I also tested the new utilities for inference on full models in another branch to validate that there were not major issues. This functionality will be tested more thoroughly in a future PR. ## Merge Plan - [x] #7492 should be merged first so that the target branch can be updated to main. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2024-12-27 09:20:24 -05:00
Ryan Dick	7d6ab0ceb2	Add a CustomModuleMixin class with a flag for enabling/disabling autocasting (since it incurs some runtime speed overhead.)	2024-12-26 20:08:30 +00:00
Ryan Dick	9692a36dd6	Use a fixture to parameterize tests in test_all_custom_modules.py so that a fresh instance of the layer under test is initialized for each test.	2024-12-26 19:41:25 +00:00
Ryan Dick	b0b699a01f	Add unit test to test that isinstance(...) behaves as expected with custom module types.	2024-12-26 18:45:56 +00:00
Ryan Dick	a8b2c4c3d2	Add inference tests for all custom module types (i.e. to test autocasting from cpu to device).	2024-12-26 18:33:46 +00:00
Ryan Dick	03944191db	Split test_autocast_modules.py into separate test files to mirror the source file structure.	2024-12-24 22:29:11 +00:00
Ryan Dick	987c9ae076	Move custom autocast modules to separate files in a custom_modules/ directory.	2024-12-24 22:21:31 +00:00
Ryan Dick	6d7314ac0a	Consolidate the LayerPatching patching modes into a single implementation.	2024-12-24 15:57:54 +00:00
Ryan Dick	80db9537ff	Rename model_patcher.py -> layer_patcher.py.	2024-12-24 15:57:54 +00:00
Ryan Dick	6f926f05b0	Update apply_smart_model_patches() so that layer restore matches the behavior of non-smart mode.	2024-12-24 15:57:54 +00:00
Ryan Dick	61253b91f1	Enable LoRAPatcher.apply_smart_lora_patches(...) throughout the stack.	2024-12-24 15:57:54 +00:00
Ryan Dick	0148512038	(minor) Rename num_layers -> num_loras in unit tests.	2024-12-24 15:57:54 +00:00
Ryan Dick	d0f35fceed	Add test_apply_smart_lora_patches_to_partially_loaded_model(...).	2024-12-24 15:57:54 +00:00
Ryan Dick	cefcb340d9	Add LoRAPatcher.smart_apply_lora_patches()	2024-12-24 15:57:54 +00:00
Ryan Dick	0fc538734b	Skip flaky test when running on Github Actions, and further reduce peak unit test memory.	2024-12-24 14:32:11 +00:00
Ryan Dick	7214d4969b	Workaround a weird quirk of QuantState.to() and add a unit test to exercise it.	2024-12-24 14:32:11 +00:00
Ryan Dick	a83a999b79	Reduce peak memory used for unit tests.	2024-12-24 14:32:11 +00:00
Ryan Dick	f8a6accf8a	Fix bitsandbytes imports to avoid ImportErrors on MacOS.	2024-12-24 14:32:11 +00:00
Ryan Dick	f8ab414f99	Add CachedModelOnlyFullLoad to mirror the CachedModelWithPartialLoad for models that cannot or should not be partially loaded.	2024-12-24 14:32:11 +00:00
Ryan Dick	c6795a1b47	Make CachedModelWithPartialLoad work with models that have non-persistent buffers.	2024-12-24 14:32:11 +00:00
Ryan Dick	0a8fc74ae9	Add CachedModelWithPartialLoad to manage partially-loaded models using the new autocast modules.	2024-12-24 14:32:11 +00:00
Ryan Dick	dc54e8763b	Add CustomInvokeLinearNF4 to enable CPU -> GPU streaming for InvokeLinearNF4 layers.	2024-12-24 14:32:11 +00:00
Ryan Dick	1b56020876	Add CustomInvokeLinear8bitLt layer for device streaming with InvokeLinear8bitLt layers.	2024-12-24 14:32:11 +00:00
Ryan Dick	3f990393a1	Simplify the state management in InvokeLinear8bitLt and add unit tests. This is in preparation for wrapping it to support streaming of weights from cpu to gpu.	2024-12-24 14:32:11 +00:00
Ryan Dick	97d56f7dc9	Add torch module autocast unit test for GGUF-quantized models.	2024-12-24 14:32:11 +00:00
Ryan Dick	fe0ef2c27c	Add torch module autocast utilities.	2024-12-24 14:32:11 +00:00
Ryan Dick	65fcbf5f60	Bump bitsandbytes. The new verson contains improvements to state_dict loading/saving for LLM.int8 and promises improved speed on some HW.	2024-12-24 14:32:11 +00:00
Ryan Dick	d3916dbdb6	Partial Loading PR1: Tidy ModelCache (#7492 ) ## Summary This PR tidies up the model cache code in preparation for further refactoring to support partial loading of models onto the GPU. These code changes should not change the functional behavior in any way. Changes: - Remove the `ModelCacheBase` class. `ModelCache` is the only implementation, so there is no benefit to the separate abstract class. - Split `CacheRecord` and `CacheStats` out into their own files. - Remove the `ModelLocker` class. This extra layer of indirection was not providing any benefit. Locking is now done directly with the `ModelCache`. - Tidy up relative imports that were contributing to circular import issues. - Pull the 'submodel' concern out of the `ModelCache`. The `ModelCache` should not need to be aware of the model manager submodel system. - Delete unused properties from the `ModelCache` (e.g. `.lazy_offloading`, `.storage_device`, etc.) ## QA Instructions I ran smoke tests with a variety of SD1, SDXL and FLUX models. No change to behavior is expected. ## Merge Plan <!--WHEN APPLICABLE: Large PRs, or PRs that touch sensitive things like DB schemas, may need some care when merging. For example, a careful rebase by the change author, timing to not interfere with a pending release, or a message to contributors on discord after merging.--> ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_	2024-12-24 09:30:44 -05:00
Ryan Dick	55b13c1da3	(minor) Add TODO comment regarding the location of get_model_cache_key().	2024-12-24 14:23:19 +00:00
Ryan Dick	7dc3e0fdbe	Get rid of ModelLocker. It was an unnecessary layer of indirection.	2024-12-24 14:23:18 +00:00
Ryan Dick	a39bcf7e85	Move lock(...) and unlock(...) logic from ModelLocker to the ModelCache and make a bunch of ModelCache properties/methods private.	2024-12-24 14:23:18 +00:00
Ryan Dick	a7c72992a6	Pull get_model_cache_key(...) out of ModelCache. The ModelCache should not be concerned with implementation details like the submodel_type.	2024-12-24 14:23:18 +00:00

... 9 10 11 12 13 ...

15699 Commits