Commit Graph

15166 Commits

Author SHA1 Message Date
Ryan Dick
f01e41ceaf First pass at dynamically calculating the working memory requirements for the VAE decoding operation. Still need to tune SD3 and FLUX. 2024-12-19 15:26:16 -05:00
Ryan Dick
609ed06265 Add AutoencoderKL to the list of models that opt-out of partial loading. 2024-12-19 15:25:23 -05:00
Ryan Dick
f9e899a6ba Make pinned pytorch version slightly more specific. We need at least 2.4 for access to torch.nn.functional.rms_norm(...). 2024-12-19 14:03:01 -05:00
Ryan Dick
9262c0ec53 Do not raise if a cache entry is deleted twice and ensure that OOM errors propagate up the stack. 2024-12-19 18:32:01 +00:00
Ryan Dick
7fddb06dc4 Add a list of models that opt-out of partial loading. 2024-12-19 16:00:56 +00:00
Ryan Dick
239297caf6 Tidy the API for overriding the working_mem_bytes for a particular operation. 2024-12-19 05:05:04 +00:00
Ryan Dick
20f0b2f4fa Update app config docstring. 2024-12-19 04:33:26 +00:00
Ryan Dick
cfb8815355 Remove unused and outdated get_cache_size and set_cache_size endpoints. 2024-12-19 04:06:08 +00:00
Ryan Dick
c866b5a799 Allow legacy ram/vram configs to override default behavior if set. 2024-12-19 04:06:08 +00:00
Ryan Dick
3b76812d43 Only support partial model loading on CUDA. 2024-12-18 19:13:15 -05:00
Ryan Dick
a8f3471fc7 Drop models from the cache if we fail loading/unloading them. 2024-12-18 23:53:25 +00:00
Ryan Dick
6d8dee05a9 Use the cpu state dict strategy for managing CachedModelOnlyFullLoad memory. 2024-12-18 22:52:57 +00:00
Ryan Dick
e684e49299 Do not apply the autocast context when models are fully loaded onto the GPU - it adds some overhead. 2024-12-18 21:51:39 +00:00
Ryan Dick
4ce2042d65 Add remove_autocast_from_module_forward(...) utility. 2024-12-18 20:28:32 +00:00
Ryan Dick
05a50b557a Update logic to enforce max size of RAM cache to avoid overfilling. 2024-12-18 20:21:38 +00:00
Ryan Dick
85e1e9587e Add info logs each time a model is loaded. 2024-12-18 19:52:54 +00:00
Ryan Dick
8e763e87bb Allow invocations to request more working VRAM when loading a model via the ModelCache. 2024-12-18 19:52:34 +00:00
Ryan Dick
4a4360a40c Add enable_partial_loading config. 2024-12-18 17:17:08 +00:00
Ryan Dick
612d6b00e3 In FluxTextEncoderInvocation, make sure model is locked before loading next model. 2024-12-18 17:12:12 +00:00
Ryan Dick
7a5dd084ad Update MPS cache limit logic. 2024-12-17 23:44:17 -05:00
Ryan Dick
79a4d0890f WIP - add device_working_mem_gb config 2024-12-18 03:31:37 +00:00
Ryan Dick
e0c899104b Consolidate the LayerPatching patching modes into a single implementation. 2024-12-17 18:33:36 +00:00
Ryan Dick
c37bb6375c Rename model_patcher.py -> layer_patcher.py. 2024-12-17 17:19:12 +00:00
Ryan Dick
4716170988 Use torch.device('cpu') instead of 'cpu' when calling .to(), because some custom models don't support the latter. 2024-12-17 17:14:42 +00:00
Ryan Dick
463196d781 Update apply_smart_model_patches() so that layer restore matches the behavior of non-smart mode. 2024-12-17 17:13:45 +00:00
Ryan Dick
e1e756800d Enable LoRAPatcher.apply_smart_lora_patches(...) throughout the stack. 2024-12-17 15:50:51 +00:00
Ryan Dick
ab337594b8 (minor) Rename num_layers -> num_loras in unit tests. 2024-12-17 15:39:01 +00:00
Ryan Dick
699e4e5995 Add test_apply_smart_lora_patches_to_partially_loaded_model(...). 2024-12-17 15:32:51 +00:00
Ryan Dick
33f17520ca Add LoRAPatcher.smart_apply_lora_patches() 2024-12-17 15:29:04 +00:00
Ryan Dick
46d061212c Update CachedModelWithPartialLoad to operate on state_dicts rather than moving torch.nn.Modules around. 2024-12-17 15:18:55 +00:00
Ryan Dick
829dddefc8 Bump bitsandbytes. The new verson contains improvements to state_dict loading/saving for LLM.int8 and promises improved speed on some HW. 2024-12-17 15:18:55 +00:00
Ryan Dick
b6c159cfdb Fix bug with partial offload of model buffers. 2024-12-17 15:18:55 +00:00
Ryan Dick
5a31c467a3 Fix bug in ModelCache that was causing it to offload more models from VRAM than necessary. 2024-12-17 15:18:55 +00:00
Ryan Dick
13dbde2429 Fix handling of torch.nn.Module buffers in CachedModelWithPartialLoad. 2024-12-17 15:18:55 +00:00
Ryan Dick
a8ee72d7fb Maintain a read-only CPU state dict copy in CachedModelWithPartialLoad. 2024-12-17 15:18:55 +00:00
Ryan Dick
7a002e1b05 Memoize frequently accessed values in CachedModelWithPartialLoad. 2024-12-17 15:18:55 +00:00
Ryan Dick
b50dd8502f More ModelCache logging improvements. 2024-12-17 15:18:55 +00:00
Ryan Dick
f4c13b057d Cleanup of ModelCache and added a bunch of debug logging. 2024-12-17 15:18:55 +00:00
Ryan Dick
cb884ee567 Fix a couple of bugs to get basic vanilla partial model load working with the model cache. 2024-12-17 15:18:55 +00:00
Ryan Dick
050d4465e6 WIP - first pass at overhauling ModelCache to work with partial loads. 2024-12-17 15:18:55 +00:00
Ryan Dick
e48bb844b9 Delete experimental torch device autocasting solutions and clean up TorchFunctionAutocastDeviceContext. 2024-12-17 15:18:55 +00:00
Ryan Dick
57eb05983b Create CachedModelOnlyFullLoad class. 2024-12-17 15:18:55 +00:00
Ryan Dick
dc3be08653 Move CachedModelWithPartialLoad into the main model_cache/ directory. 2024-12-17 15:18:55 +00:00
Ryan Dick
ae1041286f Get rid of ModelLocker. It was an unnecessary layer of indirection. 2024-12-17 15:18:55 +00:00
Ryan Dick
6e270cc5bf Move lock(...) and unlock(...) logic from ModelLocker to the ModelCache and make a bunch of ModelCache properties/methods private. 2024-12-17 15:18:55 +00:00
Ryan Dick
6dc447aba8 Pull get_model_cache_key(...) out of ModelCache. The ModelCache should not be concerned with implementation details like the submodel_type. 2024-12-17 15:18:55 +00:00
Ryan Dick
a4c0fcb6c8 Rename model_cache_default.py -> model_cache.py. 2024-12-17 15:18:55 +00:00
Ryan Dick
1f3580716c Remove ModelCacheBase. 2024-12-17 15:18:55 +00:00
Ryan Dick
405e53f80a Move CacheStats to its own file. 2024-12-17 15:18:55 +00:00
Ryan Dick
be120ff587 Move CacheRecord out to its own file. 2024-12-17 15:18:55 +00:00