Commit Graph

275 Commits

Author SHA1 Message Date
skunkworxdark
604763d20f Update flux.py
Replace T5Tokenizer with T5TokenizerFast
2025-07-03 08:04:08 +10:00
Billy
3cd4306eec Update import path 2025-06-26 19:47:06 +10:00
Billy
2832ca300f Formatting 2025-06-24 07:26:42 +10:00
Billy
de5f413440 Filter bundle_emb for all LoRAs 2025-06-24 07:12:11 +10:00
Billy
150a876c73 Formatting 2025-06-23 13:52:19 +10:00
Billy
62c3b01e4f Merge branch 'main' into OMI 2025-06-23 13:52:07 +10:00
Billy
e1157f343b Support for Flux and SDXL 2025-06-23 13:51:16 +10:00
Billy
4ee54eac1d Another attempt 2025-06-20 14:10:06 +10:00
Billy
1fd83f5e68 Import 2025-06-19 11:01:50 +10:00
Billy
637487c573 Convert FROM OMI to diffusers 2025-06-19 11:00:27 +10:00
Billy
4e98e7d0a2 Typo: dot should be comma 2025-06-19 10:47:24 +10:00
Billy
12f65d800d Formatting 2025-06-19 09:40:58 +10:00
Billy
45d09f8f51 Use OMI conversion utils 2025-06-19 09:40:49 +10:00
Billy
9b4fdb493e Loader 2025-06-18 10:53:54 +10:00
Billy
47e21d6e04 Formatting 2025-06-17 13:56:38 +10:00
Billy
84ab4a1c30 Convert from OMI to default LoRA state dict 2025-06-17 13:56:22 +10:00
Kevin Turner
2981591c36 test: add some aitoolkit lora tests 2025-06-16 19:08:11 +10:00
Kevin Turner
ab8c739cd8 fix(LoRA): add ai-toolkit to lora loader 2025-06-16 19:08:11 +10:00
David Burnett
6c0bd7d150 fix import ordering, remove code I reverted that the resync added back 2025-05-19 11:16:23 +10:00
David Burnett
8abcc99ced add check for state_dict, required to load TI's 2025-05-19 11:16:23 +10:00
David Burnett
73ab4b8895 fix offload device 2025-05-19 11:16:23 +10:00
David Burnett
86719f2065 revert to overload due to failing tests, use Torch futures instead 2025-05-19 11:16:23 +10:00
psychedelicious
5f12b9185f feat(mm): add cache_snapshot to model cache clear callback 2025-05-15 16:06:47 +10:00
psychedelicious
d958d2e5a0 feat(mm): iterate on cache callbacks API 2025-05-15 14:37:22 +10:00
psychedelicious
823ca214e6 feat(mm): iterate on cache callbacks API 2025-05-15 13:28:51 +10:00
psychedelicious
a33da450fd feat(mm): support cache callbacks 2025-05-15 11:23:58 +10:00
Kent Keirsey
1f63b60021 Implementing support for Non-Standard LoRA Format (#7985)
* integrate loRA

* idk anymore tbh

* enable fused matrix for quantized models

* integrate loRA

* idk anymore tbh

* enable fused matrix for quantized models

* ruff fix

---------

Co-authored-by: Sam <bhaskarmdutt@gmail.com>
Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>
2025-05-05 09:40:38 -04:00
psychedelicious
814406d98a feat(mm): siglip model loading supports partial loading
In the previous commit, the LLaVA model was updated to support partial loading.

In this commit, the SigLIP model is updated in the same way.

This model is used for FLUX Redux. It's <4GB and only ever run in isolation, so it won't benefit from partial loading for the vast majority of users. Regardless, I think it is best if we make _all_ models work with partial loading.

PS: I also fixed the initial load dtype issue, described in the prev commit. It's probably a non-issue for this model, but we may as well fix it.
2025-04-18 10:12:03 +10:00
psychedelicious
c054501103 feat(mm): llava model loading supports partial loading; fix OOM crash on initial load
The model manager has two types of model cache entries:
- `CachedModelOnlyFullLoad`: The model may only ever be loaded and unloaded as a single object.
- `CachedModelWithPartialLoad`: The model may be partially loaded and unloaded.

Partial loaded is enabled by overwriting certain torch layer classes, adding the ability to autocast the layer to a device on-the-fly. See `CustomLinear` for an example.

So, to take advantage of partial loading and be cached as a `CachedModelWithPartialLoad`, the model must inherit from `torch.nn.Module`.

The LLaVA classes provided by `transformers` do inherit from `torch.nn.Module`, but we wrap those classes in a separate class called `LlavaOnevisionModel`. The wrapper encapsulate both the LLaVA model and its "processor" - a lightweight class that prepares model inputs like text and images.

While it is more elegant to encapsulate both model and processor classes in a single entity, this prevents the model cache from enabling partial loading for the chunky vLLM model.

Fixing this involved a few changes.
- Update the `LlavaOnevisionModelLoader` class to operate on the vLLM model directly, instead the `LlavaOnevisionModel` wrapper class.
- Instantiate the processor directly in the node. The processor is lightweight and does its business on the CPU. We don't need to worry about caching in the model manager.
- Remove caching support code from the `LlavaOnevisionModel` wrapper class. It's not needed, because we do not cache this class. The class now only handles running the models provided to it.
- Rename `LlavaOnevisionModel` to `LlavaOnevisionPipeline` to better represent its purpose.

These changes have a bonus effect of fixing an OOM crash when initially loading the models. This was most apparent when loading LLaVA 7B, which is pretty chunky.

The initial load is onto CPU RAM. In the old version of the loaders, we ignored the loader's target dtype for the initial load. Instead, we loaded the model at `transformers`'s "default" dtype of fp32.

LLaVA 7B is fp16 and weighs ~17GB. Loading as fp32 means we need double that amount (~34GB) of CPU RAM. Many users only have 32GB RAM, so this causes a _CPU_ OOM - which is a hard crash of the whole process.

With the updated loaders, the initial load logic now uses the target dtype for the initial load. LLaVA now needs the expected ~17GB RAM for its initial load.

PS: If we didn't make the accompanying partial loading changes, we still could have solved this OOM. We'd just need to pass the initial load dtype to the wrapper class and have it load on that dtype. But we may as well fix both issues.

PPS: There are other models whose model classes are wrappers around a torch module class, and thus cannot be partially loaded. However, these models are typically fairly small and/or are run only on their own, so they don't benefit as much from partial loading. It's the really big models (like LLaVA 7B) that benefit most from the partial loading.
2025-04-18 10:12:03 +10:00
Ryan Dick
46316e43f0 typegen 2025-04-10 10:50:13 +10:00
Ryan Dick
321c2d358c Add CogView4 model loader. And various other fixes to get a CogView4 workflow running (though quality is still below expectations). 2025-04-10 10:50:13 +10:00
psychedelicious
8294e2cdea feat(mm): support size calculation for onnx models 2025-04-07 11:37:55 +10:00
psychedelicious
7004fde41b fix(mm): vllm model calculates its own size 2025-03-27 09:36:14 +11:00
Billy
182580ff69 Imports 2025-03-26 12:55:10 +11:00
Ryan Dick
2ef1ecf381 Fix copy-paste errors. 2025-03-18 11:53:06 +11:00
Ryan Dick
e9714fe476 Add LLaVA Onevision model loading and inference support. 2025-03-18 11:53:06 +11:00
Ryan Dick
8e28888bc4 Fix SigLipPipeline model size calculation. 2025-03-06 10:31:17 +11:00
Ryan Dick
f1fde792ee Get FLUX Redux working: model loading and inference. 2025-03-06 10:31:17 +11:00
Billy
f2689598c0 Formatting 2025-03-06 09:11:00 +11:00
Ryan Dick
cc9d215a9b Add endpoint for emptying the model cache. Also, adds a threading lock to the ModelCache to make it thread-safe. 2025-01-30 09:18:28 -05:00
Ryan Dick
f7315f0432 Make the default max RAM cache size more conservative. 2025-01-30 08:46:59 -05:00
Ryan Dick
229834a5e8 Performance optimizations for LoRAs applied on top of GGML-quantized tensors. 2025-01-28 14:51:35 +00:00
Ryan Dick
5d472ac1b8 Move quantized weight handling for patch layers up from ConcatenatedLoRALayer to CustomModuleMixin. 2025-01-28 14:51:35 +00:00
Ryan Dick
28514ba59a Update ConcatenatedLoRALayer to work with all sub-layer types. 2025-01-28 14:51:35 +00:00
Ryan Dick
0db6639b4b Add FLUX OneTrainer model probing. 2025-01-28 14:51:35 +00:00
Ryan Dick
0cf51cefe8 Revise the logic for calculating the RAM model cache limit. 2025-01-16 23:46:07 +00:00
Ryan Dick
da589b3f1f Memory optimization to load state dicts one module at a time in CachedModelWithPartialLoad when we are not storing a CPU copy of the state dict (i.e. when keep_ram_copy_of_weights=False). 2025-01-16 17:00:33 +00:00
Ryan Dick
36a3869af0 Add keep_ram_copy_of_weights config option. 2025-01-16 15:35:25 +00:00
Ryan Dick
c76d08d1fd Add keep_ram_copy option to CachedModelOnlyFullLoad. 2025-01-16 15:08:23 +00:00
Ryan Dick
04087c38ce Add keep_ram_copy option to CachedModelWithPartialLoad. 2025-01-16 14:51:44 +00:00