InvokeAI

mirror of https://github.com/invoke-ai/InvokeAI.git synced 2026-02-10 03:55:05 -05:00

Author	SHA1	Message	Date
Kent Keirsey	1f63b60021	Implementing support for Non-Standard LoRA Format (#7985 ) * integrate loRA * idk anymore tbh * enable fused matrix for quantized models * integrate loRA * idk anymore tbh * enable fused matrix for quantized models * ruff fix --------- Co-authored-by: Sam <bhaskarmdutt@gmail.com> Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>	2025-05-05 09:40:38 -04:00
psychedelicious	814406d98a	feat(mm): siglip model loading supports partial loading In the previous commit, the LLaVA model was updated to support partial loading. In this commit, the SigLIP model is updated in the same way. This model is used for FLUX Redux. It's <4GB and only ever run in isolation, so it won't benefit from partial loading for the vast majority of users. Regardless, I think it is best if we make _all_ models work with partial loading. PS: I also fixed the initial load dtype issue, described in the prev commit. It's probably a non-issue for this model, but we may as well fix it.	2025-04-18 10:12:03 +10:00
psychedelicious	c054501103	feat(mm): llava model loading supports partial loading; fix OOM crash on initial load The model manager has two types of model cache entries: - `CachedModelOnlyFullLoad`: The model may only ever be loaded and unloaded as a single object. - `CachedModelWithPartialLoad`: The model may be partially loaded and unloaded. Partial loaded is enabled by overwriting certain torch layer classes, adding the ability to autocast the layer to a device on-the-fly. See `CustomLinear` for an example. So, to take advantage of partial loading and be cached as a `CachedModelWithPartialLoad`, the model must inherit from `torch.nn.Module`. The LLaVA classes provided by `transformers` do inherit from `torch.nn.Module`, but we wrap those classes in a separate class called `LlavaOnevisionModel`. The wrapper encapsulate both the LLaVA model and its "processor" - a lightweight class that prepares model inputs like text and images. While it is more elegant to encapsulate both model and processor classes in a single entity, this prevents the model cache from enabling partial loading for the chunky vLLM model. Fixing this involved a few changes. - Update the `LlavaOnevisionModelLoader` class to operate on the vLLM model directly, instead the `LlavaOnevisionModel` wrapper class. - Instantiate the processor directly in the node. The processor is lightweight and does its business on the CPU. We don't need to worry about caching in the model manager. - Remove caching support code from the `LlavaOnevisionModel` wrapper class. It's not needed, because we do not cache this class. The class now only handles running the models provided to it. - Rename `LlavaOnevisionModel` to `LlavaOnevisionPipeline` to better represent its purpose. These changes have a bonus effect of fixing an OOM crash when initially loading the models. This was most apparent when loading LLaVA 7B, which is pretty chunky. The initial load is onto CPU RAM. In the old version of the loaders, we ignored the loader's target dtype for the initial load. Instead, we loaded the model at `transformers`'s "default" dtype of fp32. LLaVA 7B is fp16 and weighs ~17GB. Loading as fp32 means we need double that amount (~34GB) of CPU RAM. Many users only have 32GB RAM, so this causes a _CPU_ OOM - which is a hard crash of the whole process. With the updated loaders, the initial load logic now uses the target dtype for the initial load. LLaVA now needs the expected ~17GB RAM for its initial load. PS: If we didn't make the accompanying partial loading changes, we still could have solved this OOM. We'd just need to pass the initial load dtype to the wrapper class and have it load on that dtype. But we may as well fix both issues. PPS: There are other models whose model classes are wrappers around a torch module class, and thus cannot be partially loaded. However, these models are typically fairly small and/or are run only on their own, so they don't benefit as much from partial loading. It's the really big models (like LLaVA 7B) that benefit most from the partial loading.	2025-04-18 10:12:03 +10:00
Ryan Dick	46316e43f0	typegen	2025-04-10 10:50:13 +10:00
Ryan Dick	321c2d358c	Add CogView4 model loader. And various other fixes to get a CogView4 workflow running (though quality is still below expectations).	2025-04-10 10:50:13 +10:00
psychedelicious	8294e2cdea	feat(mm): support size calculation for onnx models	2025-04-07 11:37:55 +10:00
psychedelicious	7004fde41b	fix(mm): vllm model calculates its own size	2025-03-27 09:36:14 +11:00
Billy	182580ff69	Imports	2025-03-26 12:55:10 +11:00
Ryan Dick	2ef1ecf381	Fix copy-paste errors.	2025-03-18 11:53:06 +11:00
Ryan Dick	e9714fe476	Add LLaVA Onevision model loading and inference support.	2025-03-18 11:53:06 +11:00
Ryan Dick	8e28888bc4	Fix SigLipPipeline model size calculation.	2025-03-06 10:31:17 +11:00
Ryan Dick	f1fde792ee	Get FLUX Redux working: model loading and inference.	2025-03-06 10:31:17 +11:00
Billy	f2689598c0	Formatting	2025-03-06 09:11:00 +11:00
Ryan Dick	cc9d215a9b	Add endpoint for emptying the model cache. Also, adds a threading lock to the ModelCache to make it thread-safe.	2025-01-30 09:18:28 -05:00
Ryan Dick	f7315f0432	Make the default max RAM cache size more conservative.	2025-01-30 08:46:59 -05:00
Ryan Dick	229834a5e8	Performance optimizations for LoRAs applied on top of GGML-quantized tensors.	2025-01-28 14:51:35 +00:00
Ryan Dick	5d472ac1b8	Move quantized weight handling for patch layers up from ConcatenatedLoRALayer to CustomModuleMixin.	2025-01-28 14:51:35 +00:00
Ryan Dick	28514ba59a	Update ConcatenatedLoRALayer to work with all sub-layer types.	2025-01-28 14:51:35 +00:00
Ryan Dick	0db6639b4b	Add FLUX OneTrainer model probing.	2025-01-28 14:51:35 +00:00
Ryan Dick	0cf51cefe8	Revise the logic for calculating the RAM model cache limit.	2025-01-16 23:46:07 +00:00
Ryan Dick	da589b3f1f	Memory optimization to load state dicts one module at a time in CachedModelWithPartialLoad when we are not storing a CPU copy of the state dict (i.e. when keep_ram_copy_of_weights=False).	2025-01-16 17:00:33 +00:00
Ryan Dick	36a3869af0	Add keep_ram_copy_of_weights config option.	2025-01-16 15:35:25 +00:00
Ryan Dick	c76d08d1fd	Add keep_ram_copy option to CachedModelOnlyFullLoad.	2025-01-16 15:08:23 +00:00
Ryan Dick	04087c38ce	Add keep_ram_copy option to CachedModelWithPartialLoad.	2025-01-16 14:51:44 +00:00
Ryan Dick	b2bb359d47	Update the model loading logic for several of the large FLUX-related models to ensure that the model is initialized on the meta device prior to loading the state dict into it. This helps to keep peak memory down.	2025-01-16 02:30:28 +00:00
Ryan Dick	d7ab464176	Offload the current model when locking if it is already partially loaded and we have insufficient VRAM.	2025-01-07 02:53:44 +00:00
Ryan Dick	5b42b7bd45	Add a utility to help with determining the working memory required for expensive operations.	2025-01-07 01:20:15 +00:00
Ryan Dick	b343f81644	Use torch.cuda.memory_allocated() rather than torch.cuda.memory_reserved() to be more conservative in setting dynamic VRAM cache limits.	2025-01-07 01:20:15 +00:00
Ryan Dick	fc4a22fe78	Allow expensive operations to request more working memory.	2025-01-07 01:20:13 +00:00
Ryan Dick	a167632f09	Calculate model cache size limits dynamically based on the available RAM / VRAM.	2025-01-07 01:14:20 +00:00
Ryan Dick	6a9de1fcf3	Change definition of VRAM in use for the ModelCache from sum of model weights to the total torch.cuda.memory_allocated().	2025-01-07 00:31:53 +00:00
Ryan Dick	e5180c4e6b	Add get_effective_device(...) utility to aid in determining the effective device of models that are partially loaded.	2025-01-07 00:31:00 +00:00
Ryan Dick	1b7bb70bde	Improve handling of cases when application code modifies the size of a model after registering it with the model cache.	2025-01-07 00:31:00 +00:00
Ryan Dick	7127040c3a	Remove unused function set_nested_attr(...).	2025-01-07 00:31:00 +00:00
Ryan Dick	ceb2498a67	Add log prefix to model cache logs.	2025-01-07 00:31:00 +00:00
Ryan Dick	d0bfa019be	Add 'enable_partial_loading' config flag.	2025-01-07 00:31:00 +00:00
Ryan Dick	535e45cedf	First pass at adding partial loading support to the ModelCache.	2025-01-07 00:30:58 +00:00
Ryan Dick	c579a218ef	Allow models to be locked in VRAM, even if they have been dropped from the RAM cache (related: https://github.com/invoke-ai/InvokeAI/issues/7513 ).	2025-01-06 23:02:52 +00:00
Ryan Dick	8b4b0ff0cf	Fix bug in CustomConv1d and CustomConv2d patch calculations.	2024-12-29 19:10:19 +00:00
Ryan Dick	a8bef59699	First pass at making custom layer patches work with weights streamed from the CPU to the GPU.	2024-12-29 17:01:37 +00:00
Ryan Dick	6d49ee839c	Switch the LayerPatcher to use 'custom modules' to manage layer patching.	2024-12-29 01:18:30 +00:00
Ryan Dick	0525f967c2	Fix the _autocast_forward_with_patches() function for CustomConv1d and CustomConv2d.	2024-12-29 00:22:37 +00:00
Ryan Dick	2855bb6b41	Update BaseLayerPatch.get_parameters(...) to accept a dict of orig_parameters rather than orig_module. This will enable compatibility between patching and cpu->gpu streaming.	2024-12-28 21:12:53 +00:00
Ryan Dick	20acfc9a00	Raise in CustomEmbedding and CustomGroupNorm if a patch is applied.	2024-12-28 20:49:17 +00:00
Ryan Dick	918f541af8	Add unit test for a SetParameterLayer patch applied to a CustomFluxRMSNorm layer.	2024-12-28 20:44:48 +00:00
Ryan Dick	93e76b61d6	Add CustomFluxRMSNorm layer.	2024-12-28 20:33:38 +00:00
Ryan Dick	f692e217ea	Add patch support to CustomConv1d and CustomConv2d (no unit tests yet).	2024-12-27 22:23:17 +00:00
Ryan Dick	f2981979f9	Get custom layer patches working with all quantized linear layer types.	2024-12-27 22:00:22 +00:00
Ryan Dick	ef970a1cdc	Add support for FluxControlLoRALayer in CustomLinear layers and add a unit test for it.	2024-12-27 21:00:47 +00:00
Ryan Dick	e24e386a27	Add support for patches to CustomModuleMixin and add a single unit test (more to come).	2024-12-27 18:57:13 +00:00

1 2 3 4 5

249 Commits