tidy(ui): remove extraneous cursor sync

fix(ui): minor canvas overflow
feat(ui): rp hotkeys
2026-01-17 14:28:03 -05:00 · 2024-04-23 12:11:47 +10:00 · 2024-04-23 12:11:47 +10:00 · 2024-04-23 12:11:47 +10:00 · 2024-04-23 12:11:47 +10:00 · 2024-04-23 12:11:47 +10:00
272 changed files with 8239 additions and 2759 deletions
--- a/docs/assets/gallery/board_settings.png
+++ b/docs/assets/gallery/board_settings.png
--- a/docs/assets/gallery/board_tabs.png
+++ b/docs/assets/gallery/board_tabs.png
--- a/docs/assets/gallery/board_thumbnails.png
+++ b/docs/assets/gallery/board_thumbnails.png
--- a/docs/assets/gallery/gallery.png
+++ b/docs/assets/gallery/gallery.png
--- a/docs/assets/gallery/image_menu.png
+++ b/docs/assets/gallery/image_menu.png
--- a/docs/assets/gallery/info_button.png
+++ b/docs/assets/gallery/info_button.png
--- a/docs/assets/gallery/thumbnail_menu.png
+++ b/docs/assets/gallery/thumbnail_menu.png
--- a/docs/assets/gallery/top_controls.png
+++ b/docs/assets/gallery/top_controls.png
--- a/docs/features/GALLERY.md
+++ b/docs/features/GALLERY.md
@@ -0,0 +1,92 @@
+---
+title: InvokeAI Gallery Panel
+---
+
+# :material-web: InvokeAI Gallery Panel
+
+## Quick guided walkthrough of the Gallery Panel's features
+
+The Gallery Panel is a fast way to review, find, and make use of images you've
+generated and loaded. The Gallery is divided into Boards. The Uncategorized board is always 
+present but you can create your own for better organization.
+
+![image](../assets/gallery/gallery.png)
+
+### Board Display and Settings
+
+At the very top of the Gallery Panel are the boards disclosure and settings buttons.
+
+![image](../assets/gallery/top_controls.png)
+
+The disclosure button shows the name of the currently selected board and allows you to show and hide the board thumbnails (shown in the image below).
+
+![image](../assets/gallery/board_thumbnails.png)
+
+The settings button opens a list of options.
+
+![image](../assets/gallery/board_settings.png)
+
+- ***Image Size*** this slider lets you control the size of the image previews (images of three different sizes).
+- ***Auto-Switch to New Images*** if you turn this on, whenever a new image is generated, it will automatically be loaded into the current image panel on the Text to Image tab and into the result panel on the [Image to Image](IMG2IMG.md) tab. This will happen invisibly if you are on any other tab when the image is generated.
+- ***Auto-Assign Board on Click*** whenever an image is generated or saved, it always gets put in a board. The board it gets put into is marked with AUTO (image of board marked). Turning on Auto-Assign Board on Click will make whichever board you last selected be the destination when you click Invoke. That means you can click Invoke, select a different board, and then click Invoke again and the two images will be put in two different boards. (bold)It's the board selected when Invoke is clicked that's used, not the board that's selected when the image is finished generating.(bold) Turning this off, enables the Auto-Add Board drop down which lets you set one specific board to always put generated images into. This also enables and disables the Auto-add to this Board menu item described below.
+- ***Always Show Image Size Badge*** this toggles whether to show image sizes for each image preview (show two images, one with sizes shown, one without)
+
+Below these two buttons, you'll see the Search Boards text entry area. You use this to search for specific boards by the name of the board.
+Next to it is the Add Board (+) button which lets you add new boards. Boards can be renamed by clicking on the name of the board under its thumbnail and typing in the new name.
+
+### Board Thumbnail Menu
+
+Each board has a context menu (ctrl+click / right-click).
+
+![image](../assets/gallery/thumbnail_menu.png)
+
+- ***Auto-add to this Board*** if you've disabled Auto-Assign Board on Click in the board settings, you can use this option to set this board to be where new images are put.
+- ***Download Board*** this will add all the images in the board into a zip file and provide a link to it in a notification (image of notification)
+- ***Delete Board*** this will delete the board
+> [!CAUTION]
+> This will delete all the images in the board and the board itself.
+
+### Board Contents
+
+Every board is organized by two tabs, Images and Assets.
+
+![image](../assets/gallery/board_tabs.png)
+
+Images are the Invoke-generated images that are placed into the board. Assets are images that you upload into Invoke to be used as an [Image Prompt](https://support.invoke.ai/support/solutions/articles/151000159340-using-the-image-prompt-adapter-ip-adapter-) or in the [Image to Image](IMG2IMG.md) tab.
+
+### Image Thumbnail Menu
+
+Every image generated by Invoke has its generation information stored as text inside the image file itself. This can be read directly by selecting the image and clicking on the Info button ![image](../assets/gallery/info_button.png) in any of the image result panels. 
+
+Each image also has a context menu (ctrl+click / right-click).
+
+![image](../assets/gallery/image_menu.png)
+
+ The options are (items marked with an * will not work with images that lack generation information):
+- ***Open in New Tab*** this will open the image alone in a new browser tab, separate from the Invoke interface.
+- ***Download Image*** this will trigger your browser to download the image.
+- ***Load Workflow **** this will load any workflow settings into the Workflow tab and automatically open it.
+- ***Remix Image **** this will load all of the image's generation information, (bold)excluding its Seed, into the left hand control panel
+- ***Use Prompt **** this will load only the image's text prompts into the left-hand control panel
+- ***Use Seed **** this will load only the image's Seed into the left-hand control panel
+- ***Use All **** this will load all of the image's generation information into the left-hand control panel
+- ***Send to Image to Image*** this will put the image into the left-hand panel in the Image to Image tab ana automatically open it
+- ***Send to Unified Canvas*** This will (bold)replace whatever is already present(bold) in the Unified Canvas tab with the image and automatically open the tab
+- ***Change Board*** this will oipen a small window that will let you move the image to a different board. This is the same as dragging the image to that board's thumbnail.
+- ***Star Image*** this will add the image to the board's list of starred images that are always kept at the top of the gallery. This is the same as clicking on the star on the top right-hand side of the image that appears when you hover over the image with the mouse
+- ***Delete Image*** this will delete the image from the board
+> [!CAUTION] 
+> This will delete the image entirely from Invoke.
+
+## Summary
+
+This walkthrough only covers the Gallery interface and Boards. Actually generating images is handled by [Prompts](PROMPTS.md), the [Image to Image](IMG2IMG.md) tab, and the [Unified Canvas](UNIFIED_CANVAS.md).
+
+## Acknowledgements
+
+A huge shout-out to the core team working to make the Web GUI a reality,
+including [psychedelicious](https://github.com/psychedelicious),
+[Kyle0654](https://github.com/Kyle0654) and
+[blessedcoolant](https://github.com/blessedcoolant).
+[hipsterusername](https://github.com/hipsterusername) was the team's unofficial
+cheerleader and added tooltips/docs.
--- a/docs/features/PROMPTS.md
+++ b/docs/features/PROMPTS.md
@@ -108,40 +108,6 @@ Can be used with .and():
 Each will give you different results - try them out and see what you prefer!


-
-### Cross-Attention Control ('prompt2prompt')
-
-Sometimes an image you generate is almost right, and you just want to change one
-detail without affecting the rest. You could use a photo editor and inpainting
-to overpaint the area, but that's a pain. Here's where `prompt2prompt` comes in
-handy.
-
-Generate an image with a given prompt, record the seed of the image, and then
-use the `prompt2prompt` syntax to substitute words in the original prompt for
-words in a new prompt. This works for `img2img` as well.
-
-For example, consider the prompt `a cat.swap(dog) playing with a ball in the forest`. Normally, because the words interact with each other when doing a stable diffusion image generation, these two prompts would generate different compositions:
-  - `a cat playing with a ball in the forest`
-  - `a dog playing with a ball in the forest`
-
-| `a cat playing with a ball in the forest` | `a dog playing with a ball in the forest` |
-| --- | --- |
-| img | img |
-
-
-      - For multiple word swaps, use parentheses: `a (fluffy cat).swap(barking dog) playing with a ball in the forest`.
-      - To swap a comma, use quotes: `a ("fluffy, grey cat").swap("big, barking dog") playing with a ball in the forest`.
- Supports options `t_start` and `t_end` (each 0-1) loosely corresponding to (bloc97's)[(https://github.com/bloc97/CrossAttentionControl)] `prompt_edit_tokens_start/_end` but with the math swapped to make it easier to
-  intuitively understand. `t_start` and `t_end` are used to control on which steps cross-attention control should run. With the default values `t_start=0` and `t_end=1`, cross-attention control is active on every step of image generation. Other values can be used to turn cross-attention control off for part of the image generation process.
-    - For example, if doing a diffusion with 10 steps for the prompt is `a cat.swap(dog, t_start=0.3, t_end=1.0) playing with a ball in the forest`, the first 3 steps will be run as `a cat playing with a ball in the forest`, while the last 7 steps will run as `a dog playing with a ball in the forest`, but the pixels that represent `dog` will be locked to the pixels that would have represented `cat` if the `cat` prompt had been used instead.
-    - Conversely, for `a cat.swap(dog, t_start=0, t_end=0.7) playing with a ball in the forest`, the first 7 steps will run as `a dog playing with a ball in the forest` with the pixels that represent `dog` locked to the same pixels that would have represented `cat` if the `cat` prompt was being used instead. The final 3 steps will just run `a cat playing with a ball in the forest`.
-    > For img2img, the step sequence does not start at 0 but instead at `(1.0-strength)` - so if the img2img `strength` is `0.7`, `t_start` and `t_end` must both be greater than `0.3` (`1.0-0.7`) to have any effect.
-
-Prompt2prompt `.swap()` is not compatible with xformers, which will be temporarily disabled when doing a `.swap()` - so you should expect to use more VRAM and run slower that with xformers enabled.
-
-The `prompt2prompt` code is based off
-[bloc97's colab](https://github.com/bloc97/CrossAttentionControl).
-
 ### Escaping parentheses and speech marks 

 If the model you are using has parentheses () or speech marks "" as part of its
--- a/docs/features/WEB.md
+++ b/docs/features/WEB.md
@@ -54,7 +54,7 @@ main sections:
   of buttons at the top lets you modify and manipulate the image in
   various ways.

-3. A **gallery** section on the left that contains a history of the images you
+3. A **gallery** section on the right that contains a history of the images you
   have generated. These images are read and written to the directory specified
   in the `INVOKEAIROOT/invokeai.yaml` initialization file, usually a directory
   named `outputs` in `INVOKEAIROOT`.
--- a/docs/help/FAQ.md
+++ b/docs/help/FAQ.md
@@ -40,6 +40,25 @@ Follow the same steps to scan and import the missing models.
 - Check the `ram` setting in `invokeai.yaml`. This setting tells Invoke how much of your system RAM can be used to cache models. Having this too high or too low can slow things down. That said, it's generally safest to not set this at all and instead let Invoke manage it.
 - Check the `vram` setting in `invokeai.yaml`. This setting tells Invoke how much of your GPU VRAM can be used to cache models. Counter-intuitively, if this setting is too high, Invoke will need to do a lot of shuffling of models as it juggles the VRAM cache and the currently-loaded model. The default value of 0.25 is generally works well for GPUs without 16GB or more VRAM. Even on a 24GB card, the default works well.
 - Check that your generations are happening on your GPU (if you have one). InvokeAI will log what is being used for generation upon startup. If your GPU isn't used, re-install to ensure the correct versions of torch get installed.
+- If you are on Windows, you may have exceeded your GPU's VRAM capacity and are using slower [shared GPU memory](#shared-gpu-memory-windows). There's a guide to opt out of this behaviour in the linked FAQ entry.
+
+## Shared GPU Memory (Windows)
+
+!!! tip "Nvidia GPUs with driver 536.40"
+
+    This only applies to current Nvidia cards with driver 536.40 or later, released in June 2023.
+
+When the GPU doesn't have enough VRAM for a task, Windows is able to allocate some of its CPU RAM to the GPU. This is much slower than VRAM, but it does allow the system to generate when it otherwise might no have enough VRAM.
+
+When shared GPU memory is used, generation slows down dramatically - but at least it doesn't crash.
+
+If you'd like to opt out of this behavior and instead get an error when you exceed your GPU's VRAM, follow [this guide from Nvidia](https://nvidia.custhelp.com/app/answers/detail/a_id/5490).
+
+Here's how to get the python path required in the linked guide:
+
+- Run `invoke.bat`.
+- Select option 2 for developer console.
+- At least one python path will be printed. Copy the path that includes your invoke installation directory (typically the first).

 ## Installer cannot find python (Windows)

--- a/docs/installation/INSTALL_DEVELOPMENT.md
+++ b/docs/installation/INSTALL_DEVELOPMENT.md
@@ -23,6 +23,7 @@ If you have an interest in how InvokeAI works, or you would like to add features

 1. [Fork and clone] the [InvokeAI repo].
 1. Follow the [manual installation] docs to create a new virtual environment for the development install.
+   - Create a new folder outside the repo root for the installation and create the venv inside that folder.
   - When installing the InvokeAI package, add `-e` to the command so you get an [editable install].
 1. Install the [frontend dev toolchain] and do a production build of the UI as described.
 1. You can now run the app as described in the [manual installation] docs.
--- a/invokeai/app/api/routers/app_info.py
+++ b/invokeai/app/api/routers/app_info.py
@@ -12,7 +12,7 @@ from pydantic import BaseModel, Field

 from invokeai.app.invocations.upscale import ESRGAN_MODELS
 from invokeai.app.services.invocation_cache.invocation_cache_common import InvocationCacheStatus
-from invokeai.backend.image_util.patchmatch import PatchMatch
+from invokeai.backend.image_util.infill_methods.patchmatch import PatchMatch
 from invokeai.backend.image_util.safety_checker import SafetyChecker
 from invokeai.backend.util.logging import logging
 from invokeai.version import __version__
@@ -100,7 +100,7 @@ async def get_app_deps() -> AppDependencyVersions:

@app_router.get("/config", operation_id="get_config", status_code=200, response_model=AppConfig)
 async def get_config() -> AppConfig:
-    infill_methods = ["tile", "lama", "cv2"]
+    infill_methods = ["tile", "lama", "cv2", "color"]  # TODO: add mosaic back
    if PatchMatch.patchmatch_available():
        infill_methods.append("patchmatch")

--- a/invokeai/app/api_app.py
+++ b/invokeai/app/api_app.py
@@ -28,7 +28,7 @@ from invokeai.app.api.no_cache_staticfiles import NoCacheStaticFiles
 from invokeai.app.invocations.model import ModelIdentifierField
 from invokeai.app.services.config.config_default import get_config
 from invokeai.app.services.session_processor.session_processor_common import ProgressImage
-from invokeai.backend.util.devices import get_torch_device_name
+from invokeai.backend.util.devices import TorchDevice

 from ..backend.util.logging import InvokeAILogger
 from .api.dependencies import ApiDependencies
@@ -63,7 +63,7 @@ logger = InvokeAILogger.get_logger(config=app_config)
 mimetypes.add_type("application/javascript", ".js")
 mimetypes.add_type("text/css", ".css")

-torch_device_name = get_torch_device_name()
+torch_device_name = TorchDevice.get_torch_device_name()
 logger.info(f"Using torch device: {torch_device_name}")


--- a/invokeai/app/invocations/compel.py
+++ b/invokeai/app/invocations/compel.py
@@ -5,20 +5,26 @@ from compel import Compel, ReturnedEmbeddingsType
 from compel.prompt_parser import Blend, Conjunction, CrossAttentionControlSubstitute, FlattenedPrompt, Fragment
 from transformers import CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer

-from invokeai.app.invocations.fields import FieldDescriptions, Input, InputField, OutputField, UIComponent
+from invokeai.app.invocations.fields import (
+    ConditioningField,
+    FieldDescriptions,
+    Input,
+    InputField,
+    OutputField,
+    TensorField,
+    UIComponent,
+)
 from invokeai.app.invocations.primitives import ConditioningOutput
 from invokeai.app.services.shared.invocation_context import InvocationContext
 from invokeai.app.util.ti_utils import generate_ti_list
-from invokeai.backend.lora.lora_model import LoRAModelRaw
-from invokeai.backend.lora.lora_model_patcher import LoraModelPatcher
+from invokeai.backend.lora import LoRAModelRaw
 from invokeai.backend.model_patcher import ModelPatcher
 from invokeai.backend.stable_diffusion.diffusion.conditioning_data import (
    BasicConditioningInfo,
    ConditioningFieldData,
-    ExtraConditioningInfo,
    SDXLConditioningInfo,
 )
-from invokeai.backend.util.devices import torch_dtype
+from invokeai.backend.util.devices import TorchDevice

 from .baseinvocation import BaseInvocation, BaseInvocationOutput, invocation, invocation_output
 from .model import CLIPField
@@ -37,7 +43,7 @@ from .model import CLIPField
    title="Prompt",
    tags=["prompt", "compel"],
    category="conditioning",
-    version="1.1.1",
+    version="1.2.0",
 )
 class CompelInvocation(BaseInvocation):
    """Parse prompt using compel package to conditioning."""
@@ -52,6 +58,9 @@ class CompelInvocation(BaseInvocation):
        description=FieldDescriptions.clip,
        input=Input.Connection,
    )
+    mask: Optional[TensorField] = InputField(
+        default=None, description="A mask defining the region that this conditioning prompt applies to."
+    )

    @torch.no_grad()
    def invoke(self, context: InvocationContext) -> ConditioningOutput:
@@ -81,7 +90,7 @@ class CompelInvocation(BaseInvocation):
            ),
            text_encoder_info as text_encoder,
            # Apply the LoRA after text_encoder has been moved to its target device for faster patching.
-            LoraModelPatcher.apply_lora_text_encoder(text_encoder, _lora_loader()),
+            ModelPatcher.apply_lora_text_encoder(text_encoder, _lora_loader()),
            # Apply CLIP Skip after LoRA to prevent LoRA application from failing on skipped layers.
            ModelPatcher.apply_clip_skip(text_encoder_model, self.clip.skipped_layers),
        ):
@@ -90,7 +99,7 @@ class CompelInvocation(BaseInvocation):
                tokenizer=tokenizer,
                text_encoder=text_encoder,
                textual_inversion_manager=ti_manager,
-                dtype_for_device_getter=torch_dtype,
+                dtype_for_device_getter=TorchDevice.choose_torch_dtype,
                truncate_long_prompts=False,
            )

@@ -99,27 +108,19 @@ class CompelInvocation(BaseInvocation):
            if context.config.get().log_tokenization:
                log_tokenization_for_conjunction(conjunction, tokenizer)

-            c, options = compel.build_conditioning_tensor_for_conjunction(conjunction)
-
-            ec = ExtraConditioningInfo(
-                tokens_count_including_eos_bos=get_max_token_count(tokenizer, conjunction),
-                cross_attention_control_args=options.get("cross_attention_control", None),
-            )
+            c, _options = compel.build_conditioning_tensor_for_conjunction(conjunction)

        c = c.detach().to("cpu")

-        conditioning_data = ConditioningFieldData(
-            conditionings=[
-                BasicConditioningInfo(
-                    embeds=c,
-                    extra_conditioning=ec,
-                )
-            ]
-        )
+        conditioning_data = ConditioningFieldData(conditionings=[BasicConditioningInfo(embeds=c)])

        conditioning_name = context.conditioning.save(conditioning_data)
-
-        return ConditioningOutput.build(conditioning_name)
+        return ConditioningOutput(
+            conditioning=ConditioningField(
+                conditioning_name=conditioning_name,
+                mask=self.mask,
+            )
+        )


 class SDXLPromptInvocationBase:
@@ -133,7 +134,7 @@ class SDXLPromptInvocationBase:
        get_pooled: bool,
        lora_prefix: str,
        zero_on_empty: bool,
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[ExtraConditioningInfo]]:
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        tokenizer_info = context.models.load(clip_field.tokenizer)
        tokenizer_model = tokenizer_info.model
        assert isinstance(tokenizer_model, CLIPTokenizer)
@@ -160,7 +161,7 @@ class SDXLPromptInvocationBase:
                )
            else:
                c_pooled = None
-            return c, c_pooled, None
+            return c, c_pooled

        def _lora_loader() -> Iterator[Tuple[LoRAModelRaw, float]]:
            for lora in clip_field.loras:
@@ -182,7 +183,7 @@ class SDXLPromptInvocationBase:
            ),
            text_encoder_info as text_encoder,
            # Apply the LoRA after text_encoder has been moved to its target device for faster patching.
-            LoraModelPatcher.apply_lora(text_encoder, _lora_loader(), lora_prefix),
+            ModelPatcher.apply_lora(text_encoder, _lora_loader(), lora_prefix),
            # Apply CLIP Skip after LoRA to prevent LoRA application from failing on skipped layers.
            ModelPatcher.apply_clip_skip(text_encoder_model, clip_field.skipped_layers),
        ):
@@ -192,7 +193,7 @@ class SDXLPromptInvocationBase:
                tokenizer=tokenizer,
                text_encoder=text_encoder,
                textual_inversion_manager=ti_manager,
-                dtype_for_device_getter=torch_dtype,
+                dtype_for_device_getter=TorchDevice.choose_torch_dtype,
                truncate_long_prompts=False,  # TODO:
                returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,  # TODO: clip skip
                requires_pooled=get_pooled,
@@ -205,17 +206,12 @@ class SDXLPromptInvocationBase:
                log_tokenization_for_conjunction(conjunction, tokenizer)

            # TODO: ask for optimizations? to not run text_encoder twice
-            c, options = compel.build_conditioning_tensor_for_conjunction(conjunction)
+            c, _options = compel.build_conditioning_tensor_for_conjunction(conjunction)
            if get_pooled:
                c_pooled = compel.conditioning_provider.get_pooled_embeddings([prompt])
            else:
                c_pooled = None

-            ec = ExtraConditioningInfo(
-                tokens_count_including_eos_bos=get_max_token_count(tokenizer, conjunction),
-                cross_attention_control_args=options.get("cross_attention_control", None),
-            )
-
        del tokenizer
        del text_encoder
        del tokenizer_info
@@ -225,7 +221,7 @@ class SDXLPromptInvocationBase:
        if c_pooled is not None:
            c_pooled = c_pooled.detach().to("cpu")

-        return c, c_pooled, ec
+        return c, c_pooled


@invocation(
@@ -233,7 +229,7 @@ class SDXLPromptInvocationBase:
    title="SDXL Prompt",
    tags=["sdxl", "compel", "prompt"],
    category="conditioning",
-    version="1.1.1",
+    version="1.2.0",
 )
 class SDXLCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
    """Parse prompt using compel package to conditioning."""
@@ -256,20 +252,19 @@ class SDXLCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
    target_height: int = InputField(default=1024, description="")
    clip: CLIPField = InputField(description=FieldDescriptions.clip, input=Input.Connection, title="CLIP 1")
    clip2: CLIPField = InputField(description=FieldDescriptions.clip, input=Input.Connection, title="CLIP 2")
+    mask: Optional[TensorField] = InputField(
+        default=None, description="A mask defining the region that this conditioning prompt applies to."
+    )

    @torch.no_grad()
    def invoke(self, context: InvocationContext) -> ConditioningOutput:
-        c1, c1_pooled, ec1 = self.run_clip_compel(
-            context, self.clip, self.prompt, False, "lora_te1_", zero_on_empty=True
-        )
+        c1, c1_pooled = self.run_clip_compel(context, self.clip, self.prompt, False, "lora_te1_", zero_on_empty=True)
        if self.style.strip() == "":
-            c2, c2_pooled, ec2 = self.run_clip_compel(
+            c2, c2_pooled = self.run_clip_compel(
                context, self.clip2, self.prompt, True, "lora_te2_", zero_on_empty=True
            )
        else:
-            c2, c2_pooled, ec2 = self.run_clip_compel(
-                context, self.clip2, self.style, True, "lora_te2_", zero_on_empty=True
-            )
+            c2, c2_pooled = self.run_clip_compel(context, self.clip2, self.style, True, "lora_te2_", zero_on_empty=True)

        original_size = (self.original_height, self.original_width)
        crop_coords = (self.crop_top, self.crop_left)
@@ -308,17 +303,19 @@ class SDXLCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
        conditioning_data = ConditioningFieldData(
            conditionings=[
                SDXLConditioningInfo(
-                    embeds=torch.cat([c1, c2], dim=-1),
-                    pooled_embeds=c2_pooled,
-                    add_time_ids=add_time_ids,
-                    extra_conditioning=ec1,
+                    embeds=torch.cat([c1, c2], dim=-1), pooled_embeds=c2_pooled, add_time_ids=add_time_ids
                )
            ]
        )

        conditioning_name = context.conditioning.save(conditioning_data)

-        return ConditioningOutput.build(conditioning_name)
+        return ConditioningOutput(
+            conditioning=ConditioningField(
+                conditioning_name=conditioning_name,
+                mask=self.mask,
+            )
+        )


@invocation(
@@ -346,7 +343,7 @@ class SDXLRefinerCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase
    @torch.no_grad()
    def invoke(self, context: InvocationContext) -> ConditioningOutput:
        # TODO: if there will appear lora for refiner - write proper prefix
-        c2, c2_pooled, ec2 = self.run_clip_compel(context, self.clip2, self.style, True, "<NONE>", zero_on_empty=False)
+        c2, c2_pooled = self.run_clip_compel(context, self.clip2, self.style, True, "<NONE>", zero_on_empty=False)

        original_size = (self.original_height, self.original_width)
        crop_coords = (self.crop_top, self.crop_left)
@@ -355,14 +352,7 @@ class SDXLRefinerCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase

        assert c2_pooled is not None
        conditioning_data = ConditioningFieldData(
-            conditionings=[
-                SDXLConditioningInfo(
-                    embeds=c2,
-                    pooled_embeds=c2_pooled,
-                    add_time_ids=add_time_ids,
-                    extra_conditioning=ec2,  # or None
-                )
-            ]
+            conditionings=[SDXLConditioningInfo(embeds=c2, pooled_embeds=c2_pooled, add_time_ids=add_time_ids)]
        )

        conditioning_name = context.conditioning.save(conditioning_data)
--- a/invokeai/app/invocations/fields.py
+++ b/invokeai/app/invocations/fields.py
@@ -203,6 +203,12 @@ class DenoiseMaskField(BaseModel):
    gradient: bool = Field(default=False, description="Used for gradient inpainting")


+class TensorField(BaseModel):
+    """A tensor primitive field."""
+
+    tensor_name: str = Field(description="The name of a tensor.")
+
+
 class LatentsField(BaseModel):
    """A latents tensor primitive field"""

@@ -226,7 +232,11 @@ class ConditioningField(BaseModel):
    """A conditioning tensor primitive value"""

    conditioning_name: str = Field(description="The name of conditioning tensor")
-    # endregion
+    mask: Optional[TensorField] = Field(
+        default=None,
+        description="The mask associated with this conditioning tensor. Excluded regions should be set to False, "
+        "included regions should be set to True.",
+    )


 class MetadataField(RootModel[dict[str, Any]]):
--- a/invokeai/app/invocations/infill.py
+++ b/invokeai/app/invocations/infill.py
@@ -1,154 +1,91 @@
-# Copyright (c) 2022 Kyle Schouviller (https://github.com/kyle0654) and the InvokeAI Team
+from abc import abstractmethod
+from typing import Literal, get_args

-import math
-from typing import Literal, Optional, get_args
-
-import numpy as np
-from PIL import Image, ImageOps
+from PIL import Image

 from invokeai.app.invocations.fields import ColorField, ImageField
 from invokeai.app.invocations.primitives import ImageOutput
 from invokeai.app.services.shared.invocation_context import InvocationContext
-from invokeai.app.util.download_with_progress import download_with_progress_bar
 from invokeai.app.util.misc import SEED_MAX
-from invokeai.backend.image_util.cv2_inpaint import cv2_inpaint
-from invokeai.backend.image_util.lama import LaMA
-from invokeai.backend.image_util.patchmatch import PatchMatch
+from invokeai.backend.image_util.infill_methods.cv2_inpaint import cv2_inpaint
+from invokeai.backend.image_util.infill_methods.lama import LaMA
+from invokeai.backend.image_util.infill_methods.mosaic import infill_mosaic
+from invokeai.backend.image_util.infill_methods.patchmatch import PatchMatch, infill_patchmatch
+from invokeai.backend.image_util.infill_methods.tile import infill_tile
+from invokeai.backend.util.logging import InvokeAILogger

 from .baseinvocation import BaseInvocation, invocation
 from .fields import InputField, WithBoard, WithMetadata
 from .image import PIL_RESAMPLING_MAP, PIL_RESAMPLING_MODES

+logger = InvokeAILogger.get_logger()

-def infill_methods() -> list[str]:
-    methods = ["tile", "solid", "lama", "cv2"]
+
+def get_infill_methods():
+    methods = Literal["tile", "color", "lama", "cv2"]  # TODO: add mosaic back
    if PatchMatch.patchmatch_available():
-        methods.insert(0, "patchmatch")
+        methods = Literal["patchmatch", "tile", "color", "lama", "cv2"]  # TODO: add mosaic back
    return methods


-INFILL_METHODS = Literal[tuple(infill_methods())]
+INFILL_METHODS = get_infill_methods()
 DEFAULT_INFILL_METHOD = "patchmatch" if "patchmatch" in get_args(INFILL_METHODS) else "tile"


-def infill_lama(im: Image.Image) -> Image.Image:
-    lama = LaMA()
-    return lama(im)
+class InfillImageProcessorInvocation(BaseInvocation, WithMetadata, WithBoard):
+    """Base class for invocations that preprocess images for Infilling"""

+    image: ImageField = InputField(description="The image to process")

-def infill_patchmatch(im: Image.Image) -> Image.Image:
-    if im.mode != "RGBA":
-        return im
+    @abstractmethod
+    def infill(self, image: Image.Image) -> Image.Image:
+        """Infill the image with the specified method"""
+        pass

-    # Skip patchmatch if patchmatch isn't available
-    if not PatchMatch.patchmatch_available():
-        return im
+    def load_image(self, context: InvocationContext) -> tuple[Image.Image, bool]:
+        """Process the image to have an alpha channel before being infilled"""
+        image = context.images.get_pil(self.image.image_name)
+        has_alpha = True if image.mode == "RGBA" else False
+        return image, has_alpha

-    # Patchmatch (note, we may want to expose patch_size? Increasing it significantly impacts performance though)
-    im_patched_np = PatchMatch.inpaint(im.convert("RGB"), ImageOps.invert(im.split()[-1]), patch_size=3)
-    im_patched = Image.fromarray(im_patched_np, mode="RGB")
-    return im_patched
+    def invoke(self, context: InvocationContext) -> ImageOutput:
+        # Retrieve and process image to be infilled
+        input_image, has_alpha = self.load_image(context)

+        # If the input image has no alpha channel, return it
+        if has_alpha is False:
+            return ImageOutput.build(context.images.get_dto(self.image.image_name))

-def infill_cv2(im: Image.Image) -> Image.Image:
-    return cv2_inpaint(im)
+        # Perform Infill action
+        infilled_image = self.infill(input_image)

+        # Create ImageDTO for Infilled Image
+        infilled_image_dto = context.images.save(image=infilled_image)

-def get_tile_images(image: np.ndarray, width=8, height=8):
-    _nrows, _ncols, depth = image.shape
-    _strides = image.strides
-
-    nrows, _m = divmod(_nrows, height)
-    ncols, _n = divmod(_ncols, width)
-    if _m != 0 or _n != 0:
-        return None
-
-    return np.lib.stride_tricks.as_strided(
-        np.ravel(image),
-        shape=(nrows, ncols, height, width, depth),
-        strides=(height * _strides[0], width * _strides[1], *_strides),
-        writeable=False,
-    )
-
-
-def tile_fill_missing(im: Image.Image, tile_size: int = 16, seed: Optional[int] = None) -> Image.Image:
-    # Only fill if there's an alpha layer
-    if im.mode != "RGBA":
-        return im
-
-    a = np.asarray(im, dtype=np.uint8)
-
-    tile_size_tuple = (tile_size, tile_size)
-
-    # Get the image as tiles of a specified size
-    tiles = get_tile_images(a, *tile_size_tuple).copy()
-
-    # Get the mask as tiles
-    tiles_mask = tiles[:, :, :, :, 3]
-
-    # Find any mask tiles with any fully transparent pixels (we will be replacing these later)
-    tmask_shape = tiles_mask.shape
-    tiles_mask = tiles_mask.reshape(math.prod(tiles_mask.shape))
-    n, ny = (math.prod(tmask_shape[0:2])), math.prod(tmask_shape[2:])
-    tiles_mask = tiles_mask > 0
-    tiles_mask = tiles_mask.reshape((n, ny)).all(axis=1)
-
-    # Get RGB tiles in single array and filter by the mask
-    tshape = tiles.shape
-    tiles_all = tiles.reshape((math.prod(tiles.shape[0:2]), *tiles.shape[2:]))
-    filtered_tiles = tiles_all[tiles_mask]
-
-    if len(filtered_tiles) == 0:
-        return im
-
-    # Find all invalid tiles and replace with a random valid tile
-    replace_count = (tiles_mask == False).sum()  # noqa: E712
-    rng = np.random.default_rng(seed=seed)
-    tiles_all[np.logical_not(tiles_mask)] = filtered_tiles[rng.choice(filtered_tiles.shape[0], replace_count), :, :, :]
-
-    # Convert back to an image
-    tiles_all = tiles_all.reshape(tshape)
-    tiles_all = tiles_all.swapaxes(1, 2)
-    st = tiles_all.reshape(
-        (
-            math.prod(tiles_all.shape[0:2]),
-            math.prod(tiles_all.shape[2:4]),
-            tiles_all.shape[4],
-        )
-    )
-    si = Image.fromarray(st, mode="RGBA")
-
-    return si
+        # Return Infilled Image
+        return ImageOutput.build(infilled_image_dto)


@invocation("infill_rgba", title="Solid Color Infill", tags=["image", "inpaint"], category="inpaint", version="1.2.2")
-class InfillColorInvocation(BaseInvocation, WithMetadata, WithBoard):
+class InfillColorInvocation(InfillImageProcessorInvocation):
    """Infills transparent areas of an image with a solid color"""

-    image: ImageField = InputField(description="The image to infill")
    color: ColorField = InputField(
        default=ColorField(r=127, g=127, b=127, a=255),
        description="The color to use to infill",
    )

-    def invoke(self, context: InvocationContext) -> ImageOutput:
-        image = context.images.get_pil(self.image.image_name)
-
+    def infill(self, image: Image.Image):
        solid_bg = Image.new("RGBA", image.size, self.color.tuple())
        infilled = Image.alpha_composite(solid_bg, image.convert("RGBA"))
-
        infilled.paste(image, (0, 0), image.split()[-1])
-
-        image_dto = context.images.save(image=infilled)
-
-        return ImageOutput.build(image_dto)
+        return infilled


@invocation("infill_tile", title="Tile Infill", tags=["image", "inpaint"], category="inpaint", version="1.2.3")
-class InfillTileInvocation(BaseInvocation, WithMetadata, WithBoard):
+class InfillTileInvocation(InfillImageProcessorInvocation):
    """Infills transparent areas of an image with tiles of the image"""

-    image: ImageField = InputField(description="The image to infill")
    tile_size: int = InputField(default=32, ge=1, description="The tile size (px)")
    seed: int = InputField(
        default=0,
@@ -157,92 +94,74 @@ class InfillTileInvocation(BaseInvocation, WithMetadata, WithBoard):
        description="The seed to use for tile generation (omit for random)",
    )

-    def invoke(self, context: InvocationContext) -> ImageOutput:
-        image = context.images.get_pil(self.image.image_name)
-
-        infilled = tile_fill_missing(image.copy(), seed=self.seed, tile_size=self.tile_size)
-        infilled.paste(image, (0, 0), image.split()[-1])
-
-        image_dto = context.images.save(image=infilled)
-
-        return ImageOutput.build(image_dto)
+    def infill(self, image: Image.Image):
+        output = infill_tile(image, seed=self.seed, tile_size=self.tile_size)
+        return output.infilled


@invocation(
    "infill_patchmatch", title="PatchMatch Infill", tags=["image", "inpaint"], category="inpaint", version="1.2.2"
 )
-class InfillPatchMatchInvocation(BaseInvocation, WithMetadata, WithBoard):
+class InfillPatchMatchInvocation(InfillImageProcessorInvocation):
    """Infills transparent areas of an image using the PatchMatch algorithm"""

-    image: ImageField = InputField(description="The image to infill")
    downscale: float = InputField(default=2.0, gt=0, description="Run patchmatch on downscaled image to speedup infill")
    resample_mode: PIL_RESAMPLING_MODES = InputField(default="bicubic", description="The resampling mode")

-    def invoke(self, context: InvocationContext) -> ImageOutput:
-        image = context.images.get_pil(self.image.image_name).convert("RGBA")
-
+    def infill(self, image: Image.Image):
        resample_mode = PIL_RESAMPLING_MAP[self.resample_mode]

-        infill_image = image.copy()
        width = int(image.width / self.downscale)
        height = int(image.height / self.downscale)
-        infill_image = infill_image.resize(
+
+        infilled = image.resize(
            (width, height),
            resample=resample_mode,
        )
-
-        if PatchMatch.patchmatch_available():
-            infilled = infill_patchmatch(infill_image)
-        else:
-            raise ValueError("PatchMatch is not available on this system")
-
+        infilled = infill_patchmatch(image)
        infilled = infilled.resize(
            (image.width, image.height),
            resample=resample_mode,
        )
-
        infilled.paste(image, (0, 0), mask=image.split()[-1])
-        # image.paste(infilled, (0, 0), mask=image.split()[-1])

-        image_dto = context.images.save(image=infilled)
-
-        return ImageOutput.build(image_dto)
+        return infilled


@invocation("infill_lama", title="LaMa Infill", tags=["image", "inpaint"], category="inpaint", version="1.2.2")
-class LaMaInfillInvocation(BaseInvocation, WithMetadata, WithBoard):
+class LaMaInfillInvocation(InfillImageProcessorInvocation):
    """Infills transparent areas of an image using the LaMa model"""

-    image: ImageField = InputField(description="The image to infill")
-
-    def invoke(self, context: InvocationContext) -> ImageOutput:
-        image = context.images.get_pil(self.image.image_name)
-
-        # Downloads the LaMa model if it doesn't already exist
-        download_with_progress_bar(
-            name="LaMa Inpainting Model",
-            url="https://github.com/Sanster/models/releases/download/add_big_lama/big-lama.pt",
-            dest_path=context.config.get().models_path / "core/misc/lama/lama.pt",
-        )
-
-        infilled = infill_lama(image.copy())
-
-        image_dto = context.images.save(image=infilled)
-
-        return ImageOutput.build(image_dto)
+    def infill(self, image: Image.Image):
+        lama = LaMA()
+        return lama(image)


@invocation("infill_cv2", title="CV2 Infill", tags=["image", "inpaint"], category="inpaint", version="1.2.2")
-class CV2InfillInvocation(BaseInvocation, WithMetadata, WithBoard):
+class CV2InfillInvocation(InfillImageProcessorInvocation):
    """Infills transparent areas of an image using OpenCV Inpainting"""

+    def infill(self, image: Image.Image):
+        return cv2_inpaint(image)
+
+
+# @invocation(
+#     "infill_mosaic", title="Mosaic Infill", tags=["image", "inpaint", "outpaint"], category="inpaint", version="1.0.0"
+# )
+class MosaicInfillInvocation(InfillImageProcessorInvocation):
+    """Infills transparent areas of an image with a mosaic pattern drawing colors from the rest of the image"""
+
    image: ImageField = InputField(description="The image to infill")
+    tile_width: int = InputField(default=64, description="Width of the tile")
+    tile_height: int = InputField(default=64, description="Height of the tile")
+    min_color: ColorField = InputField(
+        default=ColorField(r=0, g=0, b=0, a=255),
+        description="The min threshold for color",
+    )
+    max_color: ColorField = InputField(
+        default=ColorField(r=255, g=255, b=255, a=255),
+        description="The max threshold for color",
+    )

-    def invoke(self, context: InvocationContext) -> ImageOutput:
-        image = context.images.get_pil(self.image.image_name)
-
-        infilled = infill_cv2(image.copy())
-
-        image_dto = context.images.save(image=infilled)
-
-        return ImageOutput.build(image_dto)
+    def infill(self, image: Image.Image):
+        return infill_mosaic(image, (self.tile_width, self.tile_height), self.min_color.tuple(), self.max_color.tuple())
--- a/invokeai/app/invocations/ip_adapter.py
+++ b/invokeai/app/invocations/ip_adapter.py
@@ -1,11 +1,11 @@
 from builtins import float
-from typing import List, Literal, Union
+from typing import List, Literal, Optional, Union

 from pydantic import BaseModel, Field, field_validator, model_validator
 from typing_extensions import Self

 from invokeai.app.invocations.baseinvocation import BaseInvocation, BaseInvocationOutput, invocation, invocation_output
-from invokeai.app.invocations.fields import FieldDescriptions, Input, InputField, OutputField, UIType
+from invokeai.app.invocations.fields import FieldDescriptions, Input, InputField, OutputField, TensorField, UIType
 from invokeai.app.invocations.model import ModelIdentifierField
 from invokeai.app.invocations.primitives import ImageField
 from invokeai.app.invocations.util import validate_begin_end_step, validate_weights
@@ -23,13 +23,19 @@ class IPAdapterField(BaseModel):
    image: Union[ImageField, List[ImageField]] = Field(description="The IP-Adapter image prompt(s).")
    ip_adapter_model: ModelIdentifierField = Field(description="The IP-Adapter model to use.")
    image_encoder_model: ModelIdentifierField = Field(description="The name of the CLIP image encoder model.")
-    weight: Union[float, List[float]] = Field(default=1, description="The weight given to the ControlNet")
+    weight: Union[float, List[float]] = Field(default=1, description="The weight given to the IP-Adapter.")
+    target_blocks: List[str] = Field(default=[], description="The IP Adapter blocks to apply")
    begin_step_percent: float = Field(
        default=0, ge=0, le=1, description="When the IP-Adapter is first applied (% of total steps)"
    )
    end_step_percent: float = Field(
        default=1, ge=0, le=1, description="When the IP-Adapter is last applied (% of total steps)"
    )
+    mask: Optional[TensorField] = Field(
+        default=None,
+        description="The bool mask associated with this IP-Adapter. Excluded regions should be set to False, included "
+        "regions should be set to True.",
+    )

    @field_validator("weight")
    @classmethod
@@ -52,7 +58,7 @@ class IPAdapterOutput(BaseInvocationOutput):
 CLIP_VISION_MODEL_MAP = {"ViT-H": "ip_adapter_sd_image_encoder", "ViT-G": "ip_adapter_sdxl_image_encoder"}


-@invocation("ip_adapter", title="IP-Adapter", tags=["ip_adapter", "control"], category="ip_adapter", version="1.2.2")
+@invocation("ip_adapter", title="IP-Adapter", tags=["ip_adapter", "control"], category="ip_adapter", version="1.4.0")
 class IPAdapterInvocation(BaseInvocation):
    """Collects IP-Adapter info to pass to other nodes."""

@@ -65,20 +71,26 @@ class IPAdapterInvocation(BaseInvocation):
        ui_order=-1,
        ui_type=UIType.IPAdapterModel,
    )
-    clip_vision_model: Literal["auto", "ViT-H", "ViT-G"] = InputField(
+    clip_vision_model: Literal["ViT-H", "ViT-G"] = InputField(
        description="CLIP Vision model to use. Overrides model settings. Mandatory for checkpoint models.",
-        default="auto",
+        default="ViT-H",
        ui_order=2,
    )
    weight: Union[float, List[float]] = InputField(
        default=1, description="The weight given to the IP-Adapter", title="Weight"
    )
+    method: Literal["full", "style", "composition"] = InputField(
+        default="full", description="The method to apply the IP-Adapter"
+    )
    begin_step_percent: float = InputField(
        default=0, ge=0, le=1, description="When the IP-Adapter is first applied (% of total steps)"
    )
    end_step_percent: float = InputField(
        default=1, ge=0, le=1, description="When the IP-Adapter is last applied (% of total steps)"
    )
+    mask: Optional[TensorField] = InputField(
+        default=None, description="A mask defining the region that this IP-Adapter applies to."
+    )

    @field_validator("weight")
    @classmethod
@@ -96,27 +108,43 @@ class IPAdapterInvocation(BaseInvocation):
        ip_adapter_info = context.models.get_config(self.ip_adapter_model.key)
        assert isinstance(ip_adapter_info, (IPAdapterInvokeAIConfig, IPAdapterCheckpointConfig))

-        if self.clip_vision_model == "auto":
-            if isinstance(ip_adapter_info, IPAdapterInvokeAIConfig):
-                image_encoder_model_id = ip_adapter_info.image_encoder_model_id
-                image_encoder_model_name = image_encoder_model_id.split("/")[-1].strip()
-            else:
-                raise RuntimeError(
-                    "You need to set the appropriate CLIP Vision model for checkpoint IP Adapter models."
-                )
+        if isinstance(ip_adapter_info, IPAdapterInvokeAIConfig):
+            image_encoder_model_id = ip_adapter_info.image_encoder_model_id
+            image_encoder_model_name = image_encoder_model_id.split("/")[-1].strip()
        else:
            image_encoder_model_name = CLIP_VISION_MODEL_MAP[self.clip_vision_model]

        image_encoder_model = self._get_image_encoder(context, image_encoder_model_name)

+        if self.method == "style":
+            if ip_adapter_info.base == "sd-1":
+                target_blocks = ["up_blocks.1"]
+            elif ip_adapter_info.base == "sdxl":
+                target_blocks = ["up_blocks.0.attentions.1"]
+            else:
+                raise ValueError(f"Unsupported IP-Adapter base type: '{ip_adapter_info.base}'.")
+        elif self.method == "composition":
+            if ip_adapter_info.base == "sd-1":
+                target_blocks = ["down_blocks.2", "mid_block"]
+            elif ip_adapter_info.base == "sdxl":
+                target_blocks = ["down_blocks.2.attentions.1"]
+            else:
+                raise ValueError(f"Unsupported IP-Adapter base type: '{ip_adapter_info.base}'.")
+        elif self.method == "full":
+            target_blocks = ["block"]
+        else:
+            raise ValueError(f"Unexpected IP-Adapter method: '{self.method}'.")
+
        return IPAdapterOutput(
            ip_adapter=IPAdapterField(
                image=self.image,
                ip_adapter_model=self.ip_adapter_model,
                image_encoder_model=ModelIdentifierField.from_config(image_encoder_model),
                weight=self.weight,
+                target_blocks=target_blocks,
                begin_step_percent=self.begin_step_percent,
                end_step_percent=self.end_step_percent,
+                mask=self.mask,
            ),
        )

--- a/invokeai/app/invocations/latent.py
+++ b/invokeai/app/invocations/latent.py
@@ -1,5 +1,5 @@
 # Copyright (c) 2023 Kyle Schouviller (https://github.com/kyle0654)
-
+import inspect
 import math
 from contextlib import ExitStack
 from functools import singledispatchmethod
@@ -9,6 +9,7 @@ import einops
 import numpy as np
 import numpy.typing as npt
 import torch
+import torchvision
 import torchvision.transforms as T
 from diffusers import AutoencoderKL, AutoencoderTiny
 from diffusers.configuration_utils import ConfigMixin
@@ -48,31 +49,35 @@ from invokeai.app.invocations.t2i_adapter import T2IAdapterField
 from invokeai.app.services.shared.invocation_context import InvocationContext
 from invokeai.app.util.controlnet_utils import prepare_control_image
 from invokeai.backend.ip_adapter.ip_adapter import IPAdapter, IPAdapterPlus
-from invokeai.backend.lora.lora_model import LoRAModelRaw
-from invokeai.backend.lora.lora_model_patcher import LoraModelPatcher
+from invokeai.backend.lora import LoRAModelRaw
 from invokeai.backend.model_manager import BaseModelType, LoadedModel
 from invokeai.backend.model_patcher import ModelPatcher
 from invokeai.backend.stable_diffusion import PipelineIntermediateState, set_seamless
-from invokeai.backend.stable_diffusion.diffusion.conditioning_data import ConditioningData, IPAdapterConditioningInfo
+from invokeai.backend.stable_diffusion.diffusion.conditioning_data import (
+    BasicConditioningInfo,
+    IPAdapterConditioningInfo,
+    IPAdapterData,
+    Range,
+    SDXLConditioningInfo,
+    TextConditioningData,
+    TextConditioningRegions,
+)
+from invokeai.backend.util.mask import to_standard_float_mask
 from invokeai.backend.util.silence_warnings import SilenceWarnings

 from ...backend.stable_diffusion.diffusers_pipeline import (
    ControlNetData,
-    IPAdapterData,
    StableDiffusionGeneratorPipeline,
    T2IAdapterData,
    image_resized_to_grid_as_tensor,
 )
 from ...backend.stable_diffusion.schedulers import SCHEDULER_MAP
-from ...backend.util.devices import choose_precision, choose_torch_device
+from ...backend.util.devices import TorchDevice
 from .baseinvocation import BaseInvocation, BaseInvocationOutput, invocation, invocation_output
 from .controlnet_image_processors import ControlField
 from .model import ModelIdentifierField, UNetField, VAEField

-if choose_torch_device() == torch.device("mps"):
-    from torch import mps
-
-DEFAULT_PRECISION = choose_precision(choose_torch_device())
+DEFAULT_PRECISION = TorchDevice.choose_torch_dtype()


@invocation_output("scheduler_output")
@@ -276,10 +281,10 @@ def get_scheduler(
 class DenoiseLatentsInvocation(BaseInvocation):
    """Denoises noisy latents to decodable images"""

-    positive_conditioning: ConditioningField = InputField(
+    positive_conditioning: Union[ConditioningField, list[ConditioningField]] = InputField(
        description=FieldDescriptions.positive_cond, input=Input.Connection, ui_order=0
    )
-    negative_conditioning: ConditioningField = InputField(
+    negative_conditioning: Union[ConditioningField, list[ConditioningField]] = InputField(
        description=FieldDescriptions.negative_cond, input=Input.Connection, ui_order=1
    )
    noise: Optional[LatentsField] = InputField(
@@ -357,33 +362,168 @@ class DenoiseLatentsInvocation(BaseInvocation):
                raise ValueError("cfg_scale must be greater than 1")
        return v

+    def _get_text_embeddings_and_masks(
+        self,
+        cond_list: list[ConditioningField],
+        context: InvocationContext,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> tuple[Union[list[BasicConditioningInfo], list[SDXLConditioningInfo]], list[Optional[torch.Tensor]]]:
+        """Get the text embeddings and masks from the input conditioning fields."""
+        text_embeddings: Union[list[BasicConditioningInfo], list[SDXLConditioningInfo]] = []
+        text_embeddings_masks: list[Optional[torch.Tensor]] = []
+        for cond in cond_list:
+            cond_data = context.conditioning.load(cond.conditioning_name)
+            text_embeddings.append(cond_data.conditionings[0].to(device=device, dtype=dtype))
+
+            mask = cond.mask
+            if mask is not None:
+                mask = context.tensors.load(mask.tensor_name)
+            text_embeddings_masks.append(mask)
+
+        return text_embeddings, text_embeddings_masks
+
+    def _preprocess_regional_prompt_mask(
+        self, mask: Optional[torch.Tensor], target_height: int, target_width: int, dtype: torch.dtype
+    ) -> torch.Tensor:
+        """Preprocess a regional prompt mask to match the target height and width.
+        If mask is None, returns a mask of all ones with the target height and width.
+        If mask is not None, resizes the mask to the target height and width using 'nearest' interpolation.
+
+        Returns:
+            torch.Tensor: The processed mask. shape: (1, 1, target_height, target_width).
+        """
+
+        if mask is None:
+            return torch.ones((1, 1, target_height, target_width), dtype=dtype)
+
+        mask = to_standard_float_mask(mask, out_dtype=dtype)
+
+        tf = torchvision.transforms.Resize(
+            (target_height, target_width), interpolation=torchvision.transforms.InterpolationMode.NEAREST
+        )
+
+        # Add a batch dimension to the mask, because torchvision expects shape (batch, channels, h, w).
+        mask = mask.unsqueeze(0)  # Shape: (1, h, w) -> (1, 1, h, w)
+        resized_mask = tf(mask)
+        return resized_mask
+
+    def _concat_regional_text_embeddings(
+        self,
+        text_conditionings: Union[list[BasicConditioningInfo], list[SDXLConditioningInfo]],
+        masks: Optional[list[Optional[torch.Tensor]]],
+        latent_height: int,
+        latent_width: int,
+        dtype: torch.dtype,
+    ) -> tuple[Union[BasicConditioningInfo, SDXLConditioningInfo], Optional[TextConditioningRegions]]:
+        """Concatenate regional text embeddings into a single embedding and track the region masks accordingly."""
+        if masks is None:
+            masks = [None] * len(text_conditionings)
+        assert len(text_conditionings) == len(masks)
+
+        is_sdxl = type(text_conditionings[0]) is SDXLConditioningInfo
+
+        all_masks_are_none = all(mask is None for mask in masks)
+
+        text_embedding = []
+        pooled_embedding = None
+        add_time_ids = None
+        cur_text_embedding_len = 0
+        processed_masks = []
+        embedding_ranges = []
+
+        for prompt_idx, text_embedding_info in enumerate(text_conditionings):
+            mask = masks[prompt_idx]
+
+            if is_sdxl:
+                # We choose a random SDXLConditioningInfo's pooled_embeds and add_time_ids here, with a preference for
+                # prompts without a mask. We prefer prompts without a mask, because they are more likely to contain
+                # global prompt information.  In an ideal case, there should be exactly one global prompt without a
+                # mask, but we don't enforce this.
+
+                # HACK(ryand): The fact that we have to choose a single pooled_embedding and add_time_ids here is a
+                # fundamental interface issue. The SDXL Compel nodes are not designed to be used in the way that we use
+                # them for regional prompting. Ideally, the DenoiseLatents invocation should accept a single
+                # pooled_embeds tensor and a list of standard text embeds with region masks. This change would be a
+                # pretty major breaking change to a popular node, so for now we use this hack.
+                if pooled_embedding is None or mask is None:
+                    pooled_embedding = text_embedding_info.pooled_embeds
+                if add_time_ids is None or mask is None:
+                    add_time_ids = text_embedding_info.add_time_ids
+
+            text_embedding.append(text_embedding_info.embeds)
+            if not all_masks_are_none:
+                embedding_ranges.append(
+                    Range(
+                        start=cur_text_embedding_len, end=cur_text_embedding_len + text_embedding_info.embeds.shape[1]
+                    )
+                )
+                processed_masks.append(
+                    self._preprocess_regional_prompt_mask(mask, latent_height, latent_width, dtype=dtype)
+                )
+
+            cur_text_embedding_len += text_embedding_info.embeds.shape[1]
+
+        text_embedding = torch.cat(text_embedding, dim=1)
+        assert len(text_embedding.shape) == 3  # batch_size, seq_len, token_len
+
+        regions = None
+        if not all_masks_are_none:
+            regions = TextConditioningRegions(
+                masks=torch.cat(processed_masks, dim=1),
+                ranges=embedding_ranges,
+            )
+
+        if is_sdxl:
+            return SDXLConditioningInfo(
+                embeds=text_embedding, pooled_embeds=pooled_embedding, add_time_ids=add_time_ids
+            ), regions
+        return BasicConditioningInfo(embeds=text_embedding), regions
+
    def get_conditioning_data(
        self,
        context: InvocationContext,
-        scheduler: Scheduler,
        unet: UNet2DConditionModel,
-        seed: int,
-    ) -> ConditioningData:
-        positive_cond_data = context.conditioning.load(self.positive_conditioning.conditioning_name)
-        c = positive_cond_data.conditionings[0].to(device=unet.device, dtype=unet.dtype)
+        latent_height: int,
+        latent_width: int,
+    ) -> TextConditioningData:
+        # Normalize self.positive_conditioning and self.negative_conditioning to lists.
+        cond_list = self.positive_conditioning
+        if not isinstance(cond_list, list):
+            cond_list = [cond_list]
+        uncond_list = self.negative_conditioning
+        if not isinstance(uncond_list, list):
+            uncond_list = [uncond_list]

-        negative_cond_data = context.conditioning.load(self.negative_conditioning.conditioning_name)
-        uc = negative_cond_data.conditionings[0].to(device=unet.device, dtype=unet.dtype)
-
-        conditioning_data = ConditioningData(
-            unconditioned_embeddings=uc,
-            text_embeddings=c,
-            guidance_scale=self.cfg_scale,
-            guidance_rescale_multiplier=self.cfg_rescale_multiplier,
+        cond_text_embeddings, cond_text_embedding_masks = self._get_text_embeddings_and_masks(
+            cond_list, context, unet.device, unet.dtype
+        )
+        uncond_text_embeddings, uncond_text_embedding_masks = self._get_text_embeddings_and_masks(
+            uncond_list, context, unet.device, unet.dtype
        )

-        conditioning_data = conditioning_data.add_scheduler_args_if_applicable(  # FIXME
-            scheduler,
-            # for ddim scheduler
-            eta=0.0,  # ddim_eta
-            # for ancestral and sde schedulers
-            # flip all bits to have noise different from initial
-            generator=torch.Generator(device=unet.device).manual_seed(seed ^ 0xFFFFFFFF),
+        cond_text_embedding, cond_regions = self._concat_regional_text_embeddings(
+            text_conditionings=cond_text_embeddings,
+            masks=cond_text_embedding_masks,
+            latent_height=latent_height,
+            latent_width=latent_width,
+            dtype=unet.dtype,
+        )
+        uncond_text_embedding, uncond_regions = self._concat_regional_text_embeddings(
+            text_conditionings=uncond_text_embeddings,
+            masks=uncond_text_embedding_masks,
+            latent_height=latent_height,
+            latent_width=latent_width,
+            dtype=unet.dtype,
+        )
+
+        conditioning_data = TextConditioningData(
+            uncond_text=uncond_text_embedding,
+            cond_text=cond_text_embedding,
+            uncond_regions=uncond_regions,
+            cond_regions=cond_regions,
+            guidance_scale=self.cfg_scale,
+            guidance_rescale_multiplier=self.cfg_rescale_multiplier,
        )
        return conditioning_data

@@ -489,8 +629,10 @@ class DenoiseLatentsInvocation(BaseInvocation):
        self,
        context: InvocationContext,
        ip_adapter: Optional[Union[IPAdapterField, list[IPAdapterField]]],
-        conditioning_data: ConditioningData,
        exit_stack: ExitStack,
+        latent_height: int,
+        latent_width: int,
+        dtype: torch.dtype,
    ) -> Optional[list[IPAdapterData]]:
        """If IP-Adapter is enabled, then this function loads the requisite models, and adds the image prompt embeddings
        to the `conditioning_data` (in-place).
@@ -506,7 +648,6 @@ class DenoiseLatentsInvocation(BaseInvocation):
            return None

        ip_adapter_data_list = []
-        conditioning_data.ip_adapter_conditioning = []
        for single_ip_adapter in ip_adapter:
            ip_adapter_model: Union[IPAdapter, IPAdapterPlus] = exit_stack.enter_context(
                context.models.load(single_ip_adapter.ip_adapter_model)
@@ -529,16 +670,20 @@ class DenoiseLatentsInvocation(BaseInvocation):
                    single_ipa_images, image_encoder_model
                )

-                conditioning_data.ip_adapter_conditioning.append(
-                    IPAdapterConditioningInfo(image_prompt_embeds, uncond_image_prompt_embeds)
-                )
+            mask = single_ip_adapter.mask
+            if mask is not None:
+                mask = context.tensors.load(mask.tensor_name)
+            mask = self._preprocess_regional_prompt_mask(mask, latent_height, latent_width, dtype=dtype)

            ip_adapter_data_list.append(
                IPAdapterData(
                    ip_adapter_model=ip_adapter_model,
                    weight=single_ip_adapter.weight,
+                    target_blocks=single_ip_adapter.target_blocks,
                    begin_step_percent=single_ip_adapter.begin_step_percent,
                    end_step_percent=single_ip_adapter.end_step_percent,
+                    ip_adapter_conditioning=IPAdapterConditioningInfo(image_prompt_embeds, uncond_image_prompt_embeds),
+                    mask=mask,
                )
            )

@@ -628,6 +773,7 @@ class DenoiseLatentsInvocation(BaseInvocation):
        steps: int,
        denoising_start: float,
        denoising_end: float,
+        seed: int,
    ) -> Tuple[int, List[int], int]:
        assert isinstance(scheduler, ConfigMixin)
        if scheduler.config.get("cpu_only", False):
@@ -656,7 +802,15 @@ class DenoiseLatentsInvocation(BaseInvocation):
        timesteps = timesteps[t_start_idx : t_start_idx + t_end_idx]
        num_inference_steps = len(timesteps) // scheduler.order

-        return num_inference_steps, timesteps, init_timestep
+        scheduler_step_kwargs = {}
+        scheduler_step_signature = inspect.signature(scheduler.step)
+        if "generator" in scheduler_step_signature.parameters:
+            # At some point, someone decided that schedulers that accept a generator should use the original seed with
+            # all bits flipped. I don't know the original rationale for this, but now we must keep it like this for
+            # reproducibility.
+            scheduler_step_kwargs = {"generator": torch.Generator(device=device).manual_seed(seed ^ 0xFFFFFFFF)}
+
+        return num_inference_steps, timesteps, init_timestep, scheduler_step_kwargs

    def prep_inpaint_mask(
        self, context: InvocationContext, latents: torch.Tensor
@@ -731,7 +885,7 @@ class DenoiseLatentsInvocation(BaseInvocation):
                set_seamless(unet_info.model, self.unet.seamless_axes),  # FIXME
                unet_info as unet,
                # Apply the LoRA after unet has been moved to its target device for faster patching.
-                LoraModelPatcher.apply_lora_unet(unet, _lora_loader()),
+                ModelPatcher.apply_lora_unet(unet, _lora_loader()),
            ):
                assert isinstance(unet, UNet2DConditionModel)
                latents = latents.to(device=unet.device, dtype=unet.dtype)
@@ -750,7 +904,11 @@ class DenoiseLatentsInvocation(BaseInvocation):
                )

                pipeline = self.create_pipeline(unet, scheduler)
-                conditioning_data = self.get_conditioning_data(context, scheduler, unet, seed)
+
+                _, _, latent_height, latent_width = latents.shape
+                conditioning_data = self.get_conditioning_data(
+                    context=context, unet=unet, latent_height=latent_height, latent_width=latent_width
+                )

                controlnet_data = self.prep_control_data(
                    context=context,
@@ -764,16 +922,19 @@ class DenoiseLatentsInvocation(BaseInvocation):
                ip_adapter_data = self.prep_ip_adapter_data(
                    context=context,
                    ip_adapter=self.ip_adapter,
-                    conditioning_data=conditioning_data,
                    exit_stack=exit_stack,
+                    latent_height=latent_height,
+                    latent_width=latent_width,
+                    dtype=unet.dtype,
                )

-                num_inference_steps, timesteps, init_timestep = self.init_scheduler(
+                num_inference_steps, timesteps, init_timestep, scheduler_step_kwargs = self.init_scheduler(
                    scheduler,
                    device=unet.device,
                    steps=self.steps,
                    denoising_start=self.denoising_start,
                    denoising_end=self.denoising_end,
+                    seed=seed,
                )

                result_latents = pipeline.latents_from_embeddings(
@@ -786,6 +947,7 @@ class DenoiseLatentsInvocation(BaseInvocation):
                    masked_latents=masked_latents,
                    gradient_mask=gradient_mask,
                    num_inference_steps=num_inference_steps,
+                    scheduler_step_kwargs=scheduler_step_kwargs,
                    conditioning_data=conditioning_data,
                    control_data=controlnet_data,
                    ip_adapter_data=ip_adapter_data,
@@ -795,12 +957,10 @@ class DenoiseLatentsInvocation(BaseInvocation):

            # https://discuss.huggingface.co/t/memory-usage-by-later-pipeline-stages/23699
            result_latents = result_latents.to("cpu")
-            torch.cuda.empty_cache()
-            if choose_torch_device() == torch.device("mps"):
-                mps.empty_cache()
+            TorchDevice.empty_cache()

            name = context.tensors.save(tensor=result_latents)
-        return LatentsOutput.build(latents_name=name, latents=result_latents, seed=seed)
+        return LatentsOutput.build(latents_name=name, latents=result_latents, seed=None)


@invocation(
@@ -864,9 +1024,7 @@ class LatentsToImageInvocation(BaseInvocation, WithMetadata, WithBoard):
                vae.disable_tiling()

            # clear memory as vae decode can request a lot
-            torch.cuda.empty_cache()
-            if choose_torch_device() == torch.device("mps"):
-                mps.empty_cache()
+            TorchDevice.empty_cache()

            with torch.inference_mode():
                # copied from diffusers pipeline
@@ -878,9 +1036,7 @@ class LatentsToImageInvocation(BaseInvocation, WithMetadata, WithBoard):

                image = VaeImageProcessor.numpy_to_pil(np_image)[0]

-        torch.cuda.empty_cache()
-        if choose_torch_device() == torch.device("mps"):
-            mps.empty_cache()
+        TorchDevice.empty_cache()

        image_dto = context.images.save(image=image)

@@ -919,9 +1075,7 @@ class ResizeLatentsInvocation(BaseInvocation):

    def invoke(self, context: InvocationContext) -> LatentsOutput:
        latents = context.tensors.load(self.latents.latents_name)
-
-        # TODO:
-        device = choose_torch_device()
+        device = TorchDevice.choose_torch_device()

        resized_latents = torch.nn.functional.interpolate(
            latents.to(device),
@@ -932,9 +1086,8 @@ class ResizeLatentsInvocation(BaseInvocation):

        # https://discuss.huggingface.co/t/memory-usage-by-later-pipeline-stages/23699
        resized_latents = resized_latents.to("cpu")
-        torch.cuda.empty_cache()
-        if device == torch.device("mps"):
-            mps.empty_cache()
+
+        TorchDevice.empty_cache()

        name = context.tensors.save(tensor=resized_latents)
        return LatentsOutput.build(latents_name=name, latents=resized_latents, seed=self.latents.seed)
@@ -961,8 +1114,7 @@ class ScaleLatentsInvocation(BaseInvocation):
    def invoke(self, context: InvocationContext) -> LatentsOutput:
        latents = context.tensors.load(self.latents.latents_name)

-        # TODO:
-        device = choose_torch_device()
+        device = TorchDevice.choose_torch_device()

        # resizing
        resized_latents = torch.nn.functional.interpolate(
@@ -974,9 +1126,7 @@ class ScaleLatentsInvocation(BaseInvocation):

        # https://discuss.huggingface.co/t/memory-usage-by-later-pipeline-stages/23699
        resized_latents = resized_latents.to("cpu")
-        torch.cuda.empty_cache()
-        if device == torch.device("mps"):
-            mps.empty_cache()
+        TorchDevice.empty_cache()

        name = context.tensors.save(tensor=resized_latents)
        return LatentsOutput.build(latents_name=name, latents=resized_latents, seed=self.latents.seed)
@@ -1108,8 +1258,7 @@ class BlendLatentsInvocation(BaseInvocation):
        if latents_a.shape != latents_b.shape:
            raise Exception("Latents to blend must be the same size.")

-        # TODO:
-        device = choose_torch_device()
+        device = TorchDevice.choose_torch_device()

        def slerp(
            t: Union[float, npt.NDArray[Any]],  # FIXME: maybe use np.float32 here?
@@ -1162,9 +1311,8 @@ class BlendLatentsInvocation(BaseInvocation):

        # https://discuss.huggingface.co/t/memory-usage-by-later-pipeline-stages/23699
        blended_latents = blended_latents.to("cpu")
-        torch.cuda.empty_cache()
-        if device == torch.device("mps"):
-            mps.empty_cache()
+
+        TorchDevice.empty_cache()

        name = context.tensors.save(tensor=blended_latents)
        return LatentsOutput.build(latents_name=name, latents=blended_latents)
@@ -1255,7 +1403,7 @@ class IdealSizeInvocation(BaseInvocation):
        return tuple((x - x % multiple_of) for x in args)

    def invoke(self, context: InvocationContext) -> IdealSizeOutput:
-        unet_config = context.models.get_config(**self.unet.unet.model_dump())
+        unet_config = context.models.get_config(self.unet.unet.key)
        aspect = self.width / self.height
        dimension: float = 512
        if unet_config.base == BaseModelType.StableDiffusion2:
--- a/invokeai/app/invocations/mask.py
+++ b/invokeai/app/invocations/mask.py
@@ -0,0 +1,120 @@
+import numpy as np
+import torch
+
+from invokeai.app.invocations.baseinvocation import BaseInvocation, Classification, InvocationContext, invocation
+from invokeai.app.invocations.fields import ImageField, InputField, TensorField, WithMetadata
+from invokeai.app.invocations.primitives import MaskOutput
+
+
+@invocation(
+    "rectangle_mask",
+    title="Create Rectangle Mask",
+    tags=["conditioning"],
+    category="conditioning",
+    version="1.0.1",
+)
+class RectangleMaskInvocation(BaseInvocation, WithMetadata):
+    """Create a rectangular mask."""
+
+    width: int = InputField(description="The width of the entire mask.")
+    height: int = InputField(description="The height of the entire mask.")
+    x_left: int = InputField(description="The left x-coordinate of the rectangular masked region (inclusive).")
+    y_top: int = InputField(description="The top y-coordinate of the rectangular masked region (inclusive).")
+    rectangle_width: int = InputField(description="The width of the rectangular masked region.")
+    rectangle_height: int = InputField(description="The height of the rectangular masked region.")
+
+    def invoke(self, context: InvocationContext) -> MaskOutput:
+        mask = torch.zeros((1, self.height, self.width), dtype=torch.bool)
+        mask[:, self.y_top : self.y_top + self.rectangle_height, self.x_left : self.x_left + self.rectangle_width] = (
+            True
+        )
+
+        mask_tensor_name = context.tensors.save(mask)
+        return MaskOutput(
+            mask=TensorField(tensor_name=mask_tensor_name),
+            width=self.width,
+            height=self.height,
+        )
+
+
+@invocation(
+    "alpha_mask_to_tensor",
+    title="Alpha Mask to Tensor",
+    tags=["conditioning"],
+    category="conditioning",
+    version="1.0.0",
+    classification=Classification.Beta,
+)
+class AlphaMaskToTensorInvocation(BaseInvocation):
+    """Convert a mask image to a tensor. Opaque regions are 1 and transparent regions are 0."""
+
+    image: ImageField = InputField(description="The mask image to convert.")
+    invert: bool = InputField(default=False, description="Whether to invert the mask.")
+
+    def invoke(self, context: InvocationContext) -> MaskOutput:
+        image = context.images.get_pil(self.image.image_name)
+        mask = torch.zeros((1, image.height, image.width), dtype=torch.bool)
+        if self.invert:
+            mask[0] = torch.tensor(np.array(image)[:, :, 3] == 0, dtype=torch.bool)
+        else:
+            mask[0] = torch.tensor(np.array(image)[:, :, 3] > 0, dtype=torch.bool)
+
+        return MaskOutput(
+            mask=TensorField(tensor_name=context.tensors.save(mask)),
+            height=mask.shape[1],
+            width=mask.shape[2],
+        )
+
+
+@invocation(
+    "invert_tensor_mask",
+    title="Invert Tensor Mask",
+    tags=["conditioning"],
+    category="conditioning",
+    version="1.0.0",
+    classification=Classification.Beta,
+)
+class InvertTensorMaskInvocation(BaseInvocation):
+    """Inverts a tensor mask."""
+
+    mask: TensorField = InputField(description="The tensor mask to convert.")
+
+    def invoke(self, context: InvocationContext) -> MaskOutput:
+        mask = context.tensors.load(self.mask.tensor_name)
+        inverted = ~mask
+
+        return MaskOutput(
+            mask=TensorField(tensor_name=context.tensors.save(inverted)),
+            height=inverted.shape[1],
+            width=inverted.shape[2],
+        )
+
+
+@invocation(
+    "image_mask_to_tensor",
+    title="Image Mask to Tensor",
+    tags=["conditioning"],
+    category="conditioning",
+    version="1.0.0",
+)
+class ImageMaskToTensorInvocation(BaseInvocation, WithMetadata):
+    """Convert a mask image to a tensor. Converts the image to grayscale and uses thresholding at the specified value."""
+
+    image: ImageField = InputField(description="The mask image to convert.")
+    cutoff: int = InputField(ge=0, le=255, description="Cutoff (<)", default=128)
+    invert: bool = InputField(default=False, description="Whether to invert the mask.")
+
+    def invoke(self, context: InvocationContext) -> MaskOutput:
+        image = context.images.get_pil(self.image.image_name, mode="L")
+
+        mask = torch.zeros((1, image.height, image.width), dtype=torch.bool)
+        if self.invert:
+            mask[0] = torch.tensor(np.array(image)[:, :] >= self.cutoff, dtype=torch.bool)
+        else:
+            mask[0] = torch.tensor(np.array(image)[:, :] < self.cutoff, dtype=torch.bool)
+
+        return MaskOutput(
+            mask=TensorField(tensor_name=context.tensors.save(mask)),
+            height=mask.shape[1],
+            width=mask.shape[2],
+        )
--- a/invokeai/app/invocations/metadata.py
+++ b/invokeai/app/invocations/metadata.py
@@ -36,6 +36,7 @@ class IPAdapterMetadataField(BaseModel):
    image: ImageField = Field(description="The IP-Adapter image prompt.")
    ip_adapter_model: ModelIdentifierField = Field(description="The IP-Adapter model.")
    clip_vision_model: Literal["ViT-H", "ViT-G"] = Field(description="The CLIP Vision model")
+    method: Literal["full", "style", "composition"] = Field(description="Method to apply IP Weights with")
    weight: Union[float, list[float]] = Field(description="The weight given to the IP-Adapter")
    begin_step_percent: float = Field(description="When the IP-Adapter is first applied (% of total steps)")
    end_step_percent: float = Field(description="When the IP-Adapter is last applied (% of total steps)")
--- a/invokeai/app/invocations/noise.py
+++ b/invokeai/app/invocations/noise.py
@@ -9,7 +9,7 @@ from invokeai.app.invocations.fields import FieldDescriptions, InputField, Laten
 from invokeai.app.services.shared.invocation_context import InvocationContext
 from invokeai.app.util.misc import SEED_MAX

-from ...backend.util.devices import choose_torch_device, torch_dtype
+from ...backend.util.devices import TorchDevice
 from .baseinvocation import (
    BaseInvocation,
    BaseInvocationOutput,
@@ -46,7 +46,7 @@ def get_noise(
            height // downsampling_factor,
            width // downsampling_factor,
        ],
-        dtype=torch_dtype(device),
+        dtype=TorchDevice.choose_torch_dtype(device=device),
        device=noise_device_type,
        generator=generator,
    ).to("cpu")
@@ -111,14 +111,14 @@ class NoiseInvocation(BaseInvocation):

    @field_validator("seed", mode="before")
    def modulo_seed(cls, v):
-        """Returns the seed modulo (SEED_MAX + 1) to ensure it is within the valid range."""
+        """Return the seed modulo (SEED_MAX + 1) to ensure it is within the valid range."""
        return v % (SEED_MAX + 1)

    def invoke(self, context: InvocationContext) -> NoiseOutput:
        noise = get_noise(
            width=self.width,
            height=self.height,
-            device=choose_torch_device(),
+            device=TorchDevice.choose_torch_device(),
            seed=self.seed,
            use_cpu=self.use_cpu,
        )
--- a/invokeai/app/invocations/primitives.py
+++ b/invokeai/app/invocations/primitives.py
@@ -15,6 +15,7 @@ from invokeai.app.invocations.fields import (
    InputField,
    LatentsField,
    OutputField,
+    TensorField,
    UIComponent,
 )
 from invokeai.app.services.images.images_common import ImageDTO
@@ -405,9 +406,19 @@ class ColorInvocation(BaseInvocation):

 # endregion

+
 # region Conditioning


+@invocation_output("mask_output")
+class MaskOutput(BaseInvocationOutput):
+    """A torch mask tensor."""
+
+    mask: TensorField = OutputField(description="The mask.")
+    width: int = OutputField(description="The width of the mask in pixels.")
+    height: int = OutputField(description="The height of the mask in pixels.")
+
+
@invocation_output("conditioning_output")
 class ConditioningOutput(BaseInvocationOutput):
    """Base class for nodes that output a single conditioning tensor"""
--- a/invokeai/app/invocations/upscale.py
+++ b/invokeai/app/invocations/upscale.py
@@ -4,7 +4,6 @@ from typing import Literal

 import cv2
 import numpy as np
-import torch
 from PIL import Image
 from pydantic import ConfigDict

@@ -14,7 +13,7 @@ from invokeai.app.services.shared.invocation_context import InvocationContext
 from invokeai.app.util.download_with_progress import download_with_progress_bar
 from invokeai.backend.image_util.basicsr.rrdbnet_arch import RRDBNet
 from invokeai.backend.image_util.realesrgan.realesrgan import RealESRGAN
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice

 from .baseinvocation import BaseInvocation, invocation
 from .fields import InputField, WithBoard, WithMetadata
@@ -35,9 +34,6 @@ ESRGAN_MODEL_URLS: dict[str, str] = {
    "RealESRGAN_x2plus.pth": "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth",
 }

-if choose_torch_device() == torch.device("mps"):
-    from torch import mps
-

@invocation("esrgan", title="Upscale (RealESRGAN)", tags=["esrgan", "upscale"], category="esrgan", version="1.3.2")
 class ESRGANInvocation(BaseInvocation, WithMetadata, WithBoard):
@@ -120,9 +116,7 @@ class ESRGANInvocation(BaseInvocation, WithMetadata, WithBoard):
        upscaled_image = upscaler.upscale(cv2_image)
        pil_image = Image.fromarray(cv2.cvtColor(upscaled_image, cv2.COLOR_BGR2RGB)).convert("RGBA")

-        torch.cuda.empty_cache()
-        if choose_torch_device() == torch.device("mps"):
-            mps.empty_cache()
+        TorchDevice.empty_cache()

        image_dto = context.images.save(image=pil_image)

--- a/invokeai/app/services/config/config_default.py
+++ b/invokeai/app/services/config/config_default.py
@@ -27,12 +27,12 @@ DEFAULT_RAM_CACHE = 10.0
 DEFAULT_VRAM_CACHE = 0.25
 DEFAULT_CONVERT_CACHE = 20.0
 DEVICE = Literal["auto", "cpu", "cuda", "cuda:1", "mps"]
-PRECISION = Literal["auto", "float16", "bfloat16", "float32", "autocast"]
+PRECISION = Literal["auto", "float16", "bfloat16", "float32"]
 ATTENTION_TYPE = Literal["auto", "normal", "xformers", "sliced", "torch-sdp"]
 ATTENTION_SLICE_SIZE = Literal["auto", "balanced", "max", 1, 2, 3, 4, 5, 6, 7, 8]
 LOG_FORMAT = Literal["plain", "color", "syslog", "legacy"]
 LOG_LEVEL = Literal["debug", "info", "warning", "error", "critical"]
-CONFIG_SCHEMA_VERSION = "4.0.0"
+CONFIG_SCHEMA_VERSION = "4.0.1"


 def get_default_ram_cache_size() -> float:
@@ -105,7 +105,7 @@ class InvokeAIAppConfig(BaseSettings):
        lazy_offload: Keep models in VRAM until their space is needed.
        log_memory_usage: If True, a memory snapshot will be captured before and after every model cache operation, and the result will be logged (at debug level). There is a time cost to capturing the memory snapshots, so it is recommended to only enable this feature if you are actively inspecting the model cache's behaviour.
        device: Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.<br>Valid values: `auto`, `cpu`, `cuda`, `cuda:1`, `mps`
-        precision: Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.<br>Valid values: `auto`, `float16`, `bfloat16`, `float32`, `autocast`
+        precision: Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.<br>Valid values: `auto`, `float16`, `bfloat16`, `float32`
        sequential_guidance: Whether to calculate guidance in serial instead of in parallel, lowering memory requirements.
        attention_type: Attention type.<br>Valid values: `auto`, `normal`, `xformers`, `sliced`, `torch-sdp`
        attention_slice_size: Slice size, valid when attention_type=="sliced".<br>Valid values: `auto`, `balanced`, `max`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`
@@ -370,6 +370,9 @@ def migrate_v3_config_dict(config_dict: dict[str, Any]) -> InvokeAIAppConfig:
            # `max_vram_cache_size` was renamed to `vram` some time in v3, but both names were used
            if k == "max_vram_cache_size" and "vram" not in category_dict:
                parsed_config_dict["vram"] = v
+            # autocast was removed in v4.0.1
+            if k == "precision" and v == "autocast":
+                parsed_config_dict["precision"] = "auto"
            if k == "conf_path":
                parsed_config_dict["legacy_models_yaml_path"] = v
            if k == "legacy_conf_dir":
@@ -392,6 +395,28 @@ def migrate_v3_config_dict(config_dict: dict[str, Any]) -> InvokeAIAppConfig:
    return config


+def migrate_v4_0_0_config_dict(config_dict: dict[str, Any]) -> InvokeAIAppConfig:
+    """Migrate v4.0.0 config dictionary to a current config object.
+
+    Args:
+        config_dict: A dictionary of settings from a v4.0.0 config file.
+
+    Returns:
+        An instance of `InvokeAIAppConfig` with the migrated settings.
+    """
+    parsed_config_dict: dict[str, Any] = {}
+    for k, v in config_dict.items():
+        # autocast was removed from precision in v4.0.1
+        if k == "precision" and v == "autocast":
+            parsed_config_dict["precision"] = "auto"
+        else:
+            parsed_config_dict[k] = v
+        if k == "schema_version":
+            parsed_config_dict[k] = CONFIG_SCHEMA_VERSION
+    config = DefaultInvokeAIAppConfig.model_validate(parsed_config_dict)
+    return config
+
+
 def load_and_migrate_config(config_path: Path) -> InvokeAIAppConfig:
    """Load and migrate a config file to the latest version.

@@ -418,17 +443,21 @@ def load_and_migrate_config(config_path: Path) -> InvokeAIAppConfig:
            raise RuntimeError(f"Failed to load and migrate v3 config file {config_path}: {e}") from e
        migrated_config.write_file(config_path)
        return migrated_config
-    else:
-        # Attempt to load as a v4 config file
-        try:
-            # Meta is not included in the model fields, so we need to validate it separately
-            config = InvokeAIAppConfig.model_validate(loaded_config_dict)
-            assert (
-                config.schema_version == CONFIG_SCHEMA_VERSION
-            ), f"Invalid schema version, expected {CONFIG_SCHEMA_VERSION}: {config.schema_version}"
-            return config
-        except Exception as e:
-            raise RuntimeError(f"Failed to load config file {config_path}: {e}") from e
+
+    if loaded_config_dict["schema_version"] == "4.0.0":
+        loaded_config_dict = migrate_v4_0_0_config_dict(loaded_config_dict)
+        loaded_config_dict.write_file(config_path)
+
+    # Attempt to load as a v4 config file
+    try:
+        # Meta is not included in the model fields, so we need to validate it separately
+        config = InvokeAIAppConfig.model_validate(loaded_config_dict)
+        assert (
+            config.schema_version == CONFIG_SCHEMA_VERSION
+        ), f"Invalid schema version, expected {CONFIG_SCHEMA_VERSION}: {config.schema_version}"
+        return config
+    except Exception as e:
+        raise RuntimeError(f"Failed to load config file {config_path}: {e}") from e


@lru_cache(maxsize=1)
--- a/invokeai/app/services/model_install/model_install_default.py
+++ b/invokeai/app/services/model_install/model_install_default.py
@@ -3,7 +3,6 @@
 import locale
 import os
 import re
-import signal
 import threading
 import time
 from hashlib import sha256
@@ -13,6 +12,7 @@ from shutil import copyfile, copytree, move, rmtree
 from tempfile import mkdtemp
 from typing import Any, Dict, List, Optional, Union

+import torch
 import yaml
 from huggingface_hub import HfFolder
 from pydantic.networks import AnyHttpUrl
@@ -42,7 +42,8 @@ from invokeai.backend.model_manager.metadata.metadata_base import HuggingFaceMet
 from invokeai.backend.model_manager.probe import ModelProbe
 from invokeai.backend.model_manager.search import ModelSearch
 from invokeai.backend.util import InvokeAILogger
-from invokeai.backend.util.devices import choose_precision, choose_torch_device
+from invokeai.backend.util.catch_sigint import catch_sigint
+from invokeai.backend.util.devices import TorchDevice

 from .model_install_base import (
    MODEL_SOURCE_TO_TYPE_MAP,
@@ -111,17 +112,6 @@ class ModelInstallService(ModelInstallServiceBase):
    def start(self, invoker: Optional[Invoker] = None) -> None:
        """Start the installer thread."""

-        # Yes, this is weird. When the installer thread is running, the
-        # thread masks the ^C signal. When we receive a
-        # sigINT, we stop the thread, reset sigINT, and send a new
-        # sigINT to the parent process.
-        def sigint_handler(signum, frame):
-            self.stop()
-            signal.signal(signal.SIGINT, signal.SIG_DFL)
-            signal.raise_signal(signal.SIGINT)
-
-        signal.signal(signal.SIGINT, sigint_handler)
-
        with self._lock:
            if self._running:
                raise Exception("Attempt to start the installer service twice")
@@ -131,7 +121,8 @@ class ModelInstallService(ModelInstallServiceBase):
            # In normal use, we do not want to scan the models directory - it should never have orphaned models.
            # We should only do the scan when the flag is set (which should only be set when testing).
            if self.app_config.scan_models_on_startup:
-                self._register_orphaned_models()
+                with catch_sigint():
+                    self._register_orphaned_models()

            # Check all models' paths and confirm they exist. A model could be missing if it was installed on a volume
            # that isn't currently mounted. In this case, we don't want to delete the model from the database, but we do
@@ -634,11 +625,10 @@ class ModelInstallService(ModelInstallServiceBase):
            self._next_job_id += 1
        return id

-    @staticmethod
-    def _guess_variant() -> Optional[ModelRepoVariant]:
+    def _guess_variant(self) -> Optional[ModelRepoVariant]:
        """Guess the best HuggingFace variant type to download."""
-        precision = choose_precision(choose_torch_device())
-        return ModelRepoVariant.FP16 if precision == "float16" else None
+        precision = TorchDevice.choose_torch_dtype()
+        return ModelRepoVariant.FP16 if precision == torch.float16 else None

    def _import_local_model(self, source: LocalModelSource, config: Optional[Dict[str, Any]]) -> ModelInstallJob:
        return ModelInstallJob(
@@ -754,6 +744,8 @@ class ModelInstallService(ModelInstallServiceBase):
            self._download_cache[download_job.source] = install_job  # matches a download job to an install job
            install_job.download_parts.add(download_job)

+        # only start the jobs once install_job.download_parts is fully populated
+        for download_job in install_job.download_parts:
            self._download_queue.submit_download_job(
                download_job,
                on_start=self._download_started_callback,
@@ -762,6 +754,7 @@ class ModelInstallService(ModelInstallServiceBase):
                on_error=self._download_error_callback,
                on_cancelled=self._download_cancelled_callback,
            )
+
        return install_job

    def _stat_size(self, path: Path) -> int:
--- a/invokeai/app/services/model_load/model_load_base.py
+++ b/invokeai/app/services/model_load/model_load_base.py
@@ -5,8 +5,7 @@ from abc import ABC, abstractmethod
 from typing import Optional

 from invokeai.app.services.shared.invocation_context import InvocationContextData
-from invokeai.backend.model_manager import AnyModelConfig, SubModelType
-from invokeai.backend.model_manager.any_model_type import AnyModel
+from invokeai.backend.model_manager import AnyModel, AnyModelConfig, SubModelType
 from invokeai.backend.model_manager.load import LoadedModel
 from invokeai.backend.model_manager.load.convert_cache import ModelConvertCacheBase
 from invokeai.backend.model_manager.load.model_cache.model_cache_base import ModelCacheBase
--- a/invokeai/app/services/model_load/model_load_default.py
+++ b/invokeai/app/services/model_load/model_load_default.py
@@ -6,8 +6,7 @@ from typing import Optional, Type
 from invokeai.app.services.config import InvokeAIAppConfig
 from invokeai.app.services.invoker import Invoker
 from invokeai.app.services.shared.invocation_context import InvocationContextData
-from invokeai.backend.model_manager import AnyModelConfig, SubModelType
-from invokeai.backend.model_manager.any_model_type import AnyModel
+from invokeai.backend.model_manager import AnyModel, AnyModelConfig, SubModelType
 from invokeai.backend.model_manager.load import (
    LoadedModel,
    ModelLoaderRegistry,
--- a/invokeai/app/services/model_manager/init.py
+++ b/invokeai/app/services/model_manager/init.py
@@ -1,6 +1,6 @@
 """Initialization file for model manager service."""

-from invokeai.backend.model_manager import AnyModelConfig, BaseModelType, ModelType, SubModelType
+from invokeai.backend.model_manager import AnyModel, AnyModelConfig, BaseModelType, ModelType, SubModelType
 from invokeai.backend.model_manager.load import LoadedModel

 from .model_manager_default import ModelManagerService, ModelManagerServiceBase
@@ -8,6 +8,7 @@ from .model_manager_default import ModelManagerService, ModelManagerServiceBase
 __all__ = [
    "ModelManagerServiceBase",
    "ModelManagerService",
+    "AnyModel",
    "AnyModelConfig",
    "BaseModelType",
    "ModelType",
--- a/invokeai/app/services/model_manager/model_manager_default.py
+++ b/invokeai/app/services/model_manager/model_manager_default.py
@@ -1,12 +1,14 @@
 # Copyright (c) 2023 Lincoln D. Stein and the InvokeAI Team
 """Implementation of ModelManagerServiceBase."""

+from typing import Optional
+
 import torch
 from typing_extensions import Self

 from invokeai.app.services.invoker import Invoker
 from invokeai.backend.model_manager.load import ModelCache, ModelConvertCache, ModelLoaderRegistry
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice
 from invokeai.backend.util.logging import InvokeAILogger

 from ..config import InvokeAIAppConfig
@@ -67,7 +69,7 @@ class ModelManagerService(ModelManagerServiceBase):
        model_record_service: ModelRecordServiceBase,
        download_queue: DownloadQueueServiceBase,
        events: EventServiceBase,
-        execution_device: torch.device = choose_torch_device(),
+        execution_device: Optional[torch.device] = None,
    ) -> Self:
        """
        Construct the model manager service instance.
@@ -80,8 +82,9 @@ class ModelManagerService(ModelManagerServiceBase):
        ram_cache = ModelCache(
            max_cache_size=app_config.ram,
            max_vram_cache_size=app_config.vram,
+            lazy_offloading=app_config.lazy_offload,
            logger=logger,
-            execution_device=execution_device,
+            execution_device=execution_device or TorchDevice.choose_torch_device(),
        )
        convert_cache = ModelConvertCache(cache_path=app_config.convert_cache_path, max_size=app_config.convert_cache)
        loader = ModelLoadService(
--- a/invokeai/app/services/session_processor/session_processor_default.py
+++ b/invokeai/app/services/session_processor/session_processor_default.py
@@ -86,6 +86,12 @@ class DefaultSessionProcessor(SessionProcessorBase):
            self._poll_now()
        elif event_name == "batch_enqueued":
            self._poll_now()
+        elif event_name == "queue_item_status_changed" and event[1]["data"]["queue_item"]["status"] in [
+            "completed",
+            "failed",
+            "canceled",
+        ]:
+            self._poll_now()

    def resume(self) -> SessionProcessorStatus:
        if not self._resume_event.is_set():
--- a/invokeai/app/services/shared/invocation_context.py
+++ b/invokeai/app/services/shared/invocation_context.py
@@ -245,6 +245,18 @@ class ImagesInterface(InvocationContextInterface):
        """
        return self._services.images.get_dto(image_name)

+    def get_path(self, image_name: str, thumbnail: bool = False) -> Path:
+        """Gets the internal path to an image or thumbnail.
+
+        Args:
+            image_name: The name of the image to get the path of.
+            thumbnail: Get the path of the thumbnail instead of the full image
+
+        Returns:
+            The local path of the image or thumbnail.
+        """
+        return self._services.images.get_path(image_name, thumbnail)
+

 class TensorsInterface(InvocationContextInterface):
    def save(self, tensor: Tensor) -> str:
--- a/invokeai/backend/image_util/init.py
+++ b/invokeai/backend/image_util/init.py
@@ -2,7 +2,7 @@
 Initialization file for invokeai.backend.image_util methods.
 """

-from .patchmatch import PatchMatch  # noqa: F401
+from .infill_methods.patchmatch import PatchMatch  # noqa: F401
 from .pngwriter import PngWriter, PromptFormatter, retrieve_metadata, write_metadata  # noqa: F401
 from .seamless import configure_model_padding  # noqa: F401
 from .util import InitImageResizer, make_grid  # noqa: F401
--- a/invokeai/backend/image_util/depth_anything/init.py
+++ b/invokeai/backend/image_util/depth_anything/init.py
@@ -13,7 +13,7 @@ from invokeai.app.services.config.config_default import get_config
 from invokeai.app.util.download_with_progress import download_with_progress_bar
 from invokeai.backend.image_util.depth_anything.model.dpt import DPT_DINOv2
 from invokeai.backend.image_util.depth_anything.utilities.util import NormalizeImage, PrepareForNet, Resize
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice
 from invokeai.backend.util.logging import InvokeAILogger

 config = get_config()
@@ -56,7 +56,7 @@ class DepthAnythingDetector:
    def __init__(self) -> None:
        self.model = None
        self.model_size: Union[Literal["large", "base", "small"], None] = None
-        self.device = choose_torch_device()
+        self.device = TorchDevice.choose_torch_device()

    def load_model(self, model_size: Literal["large", "base", "small"] = "small"):
        DEPTH_ANYTHING_MODEL_PATH = config.models_path / DEPTH_ANYTHING_MODELS[model_size]["local"]
@@ -81,7 +81,7 @@ class DepthAnythingDetector:
            self.model.load_state_dict(torch.load(DEPTH_ANYTHING_MODEL_PATH.as_posix(), map_location="cpu"))
            self.model.eval()

-        self.model.to(choose_torch_device())
+        self.model.to(self.device)
        return self.model

    def __call__(self, image: Image.Image, resolution: int = 512) -> Image.Image:
@@ -94,7 +94,7 @@ class DepthAnythingDetector:

        image_height, image_width = np_image.shape[:2]
        np_image = transform({"image": np_image})["image"]
-        tensor_image = torch.from_numpy(np_image).unsqueeze(0).to(choose_torch_device())
+        tensor_image = torch.from_numpy(np_image).unsqueeze(0).to(self.device)

        with torch.no_grad():
            depth = self.model(tensor_image)
--- a/invokeai/backend/image_util/dw_openpose/wholebody.py
+++ b/invokeai/backend/image_util/dw_openpose/wholebody.py
@@ -7,7 +7,7 @@ import onnxruntime as ort

 from invokeai.app.services.config.config_default import get_config
 from invokeai.app.util.download_with_progress import download_with_progress_bar
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice

 from .onnxdet import inference_detector
 from .onnxpose import inference_pose
@@ -28,9 +28,9 @@ config = get_config()

 class Wholebody:
    def __init__(self):
-        device = choose_torch_device()
+        device = TorchDevice.choose_torch_device()

-        providers = ["CUDAExecutionProvider"] if device == "cuda" else ["CPUExecutionProvider"]
+        providers = ["CUDAExecutionProvider"] if device.type == "cuda" else ["CPUExecutionProvider"]

        DET_MODEL_PATH = config.models_path / DWPOSE_MODELS["yolox_l.onnx"]["local"]
        download_with_progress_bar("yolox_l.onnx", DWPOSE_MODELS["yolox_l.onnx"]["url"], DET_MODEL_PATH)
--- a/invokeai/backend/image_util/infill_methods/cv2_inpaint.py
+++ b/invokeai/backend/image_util/infill_methods/cv2_inpaint.py
--- a/invokeai/backend/image_util/infill_methods/lama.py
+++ b/invokeai/backend/image_util/infill_methods/lama.py
@@ -7,7 +7,8 @@ from PIL import Image

 import invokeai.backend.util.logging as logger
 from invokeai.app.services.config.config_default import get_config
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.app.util.download_with_progress import download_with_progress_bar
+from invokeai.backend.util.devices import TorchDevice


 def norm_img(np_img):
@@ -28,8 +29,16 @@ def load_jit_model(url_or_path, device):

 class LaMA:
    def __call__(self, input_image: Image.Image, *args: Any, **kwds: Any) -> Any:
-        device = choose_torch_device()
+        device = TorchDevice.choose_torch_device()
        model_location = get_config().models_path / "core/misc/lama/lama.pt"
+
+        if not model_location.exists():
+            download_with_progress_bar(
+                name="LaMa Inpainting Model",
+                url="https://github.com/Sanster/models/releases/download/add_big_lama/big-lama.pt",
+                dest_path=model_location,
+            )
+
        model = load_jit_model(model_location, device)

        image = np.asarray(input_image.convert("RGB"))
--- a/invokeai/backend/image_util/infill_methods/mosaic.py
+++ b/invokeai/backend/image_util/infill_methods/mosaic.py
@@ -0,0 +1,60 @@
+from typing import Tuple
+
+import numpy as np
+from PIL import Image
+
+
+def infill_mosaic(
+    image: Image.Image,
+    tile_shape: Tuple[int, int] = (64, 64),
+    min_color: Tuple[int, int, int, int] = (0, 0, 0, 0),
+    max_color: Tuple[int, int, int, int] = (255, 255, 255, 0),
+) -> Image.Image:
+    """
+    image:PIL - A PIL Image
+    tile_shape: Tuple[int,int] - Tile width & Tile Height
+    min_color: Tuple[int,int,int] - RGB values for the lowest color to clip to (0-255)
+    max_color: Tuple[int,int,int] - RGB values for the highest color to clip to (0-255)
+    """
+
+    np_image = np.array(image)  # Convert image to np array
+    alpha = np_image[:, :, 3]  # Get the mask from the alpha channel of the image
+    non_transparent_pixels = np_image[alpha != 0, :3]  # List of non-transparent pixels
+
+    # Create color tiles to paste in the empty areas of the image
+    tile_width, tile_height = tile_shape
+
+    # Clip the range of colors in the image to a particular spectrum only
+    r_min, g_min, b_min, _ = min_color
+    r_max, g_max, b_max, _ = max_color
+    non_transparent_pixels[:, 0] = np.clip(non_transparent_pixels[:, 0], r_min, r_max)
+    non_transparent_pixels[:, 1] = np.clip(non_transparent_pixels[:, 1], g_min, g_max)
+    non_transparent_pixels[:, 2] = np.clip(non_transparent_pixels[:, 2], b_min, b_max)
+
+    tiles = []
+    for _ in range(256):
+        color = non_transparent_pixels[np.random.randint(len(non_transparent_pixels))]
+        tile = np.zeros((tile_height, tile_width, 3), dtype=np.uint8)
+        tile[:, :] = color
+        tiles.append(tile)
+
+    # Fill the transparent area with tiles
+    filled_image = np.zeros((image.height, image.width, 3), dtype=np.uint8)
+
+    for x in range(image.width):
+        for y in range(image.height):
+            tile = tiles[np.random.randint(len(tiles))]
+            try:
+                filled_image[
+                    y - (y % tile_height) : y - (y % tile_height) + tile_height,
+                    x - (x % tile_width) : x - (x % tile_width) + tile_width,
+                ] = tile
+            except ValueError:
+                # Need to handle edge cases - literally
+                pass
+
+    filled_image = Image.fromarray(filled_image)  # Convert the filled tiles image to PIL
+    image = Image.composite(
+        image, filled_image, image.split()[-1]
+    )  # Composite the original image on top of the filled tiles
+    return image
--- a/invokeai/backend/image_util/infill_methods/patchmatch.py
+++ b/invokeai/backend/image_util/infill_methods/patchmatch.py
@@ -0,0 +1,67 @@
+"""
+This module defines a singleton object, "patchmatch" that
+wraps the actual patchmatch object. It respects the global
+"try_patchmatch" attribute, so that patchmatch loading can
+be suppressed or deferred
+"""
+
+import numpy as np
+from PIL import Image
+
+import invokeai.backend.util.logging as logger
+from invokeai.app.services.config.config_default import get_config
+
+
+class PatchMatch:
+    """
+    Thin class wrapper around the patchmatch function.
+    """
+
+    patch_match = None
+    tried_load: bool = False
+
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def _load_patch_match(cls):
+        if cls.tried_load:
+            return
+        if get_config().patchmatch:
+            from patchmatch import patch_match as pm
+
+            if pm.patchmatch_available:
+                logger.info("Patchmatch initialized")
+                cls.patch_match = pm
+            else:
+                logger.info("Patchmatch not loaded (nonfatal)")
+        else:
+            logger.info("Patchmatch loading disabled")
+        cls.tried_load = True
+
+    @classmethod
+    def patchmatch_available(cls) -> bool:
+        cls._load_patch_match()
+        if not cls.patch_match:
+            return False
+        return cls.patch_match.patchmatch_available
+
+    @classmethod
+    def inpaint(cls, image: Image.Image) -> Image.Image:
+        if cls.patch_match is None or not cls.patchmatch_available():
+            return image
+
+        np_image = np.array(image)
+        mask = 255 - np_image[:, :, 3]
+        infilled = cls.patch_match.inpaint(np_image[:, :, :3], mask, patch_size=3)
+        return Image.fromarray(infilled, mode="RGB")
+
+
+def infill_patchmatch(image: Image.Image) -> Image.Image:
+    IS_PATCHMATCH_AVAILABLE = PatchMatch.patchmatch_available()
+
+    if not IS_PATCHMATCH_AVAILABLE:
+        logger.warning("PatchMatch is not available on this system")
+        return image
+
+    return PatchMatch.inpaint(image)
--- a/invokeai/backend/image_util/infill_methods/test_images/source1.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source1.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source10.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source10.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source2.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source2.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source3.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source3.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source4.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source4.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source5.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source5.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source6.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source6.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source7.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source7.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source8.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source8.webp
--- a/invokeai/backend/image_util/infill_methods/test_images/source9.webp
+++ b/invokeai/backend/image_util/infill_methods/test_images/source9.webp
--- a/invokeai/backend/image_util/infill_methods/tile.ipynb
+++ b/invokeai/backend/image_util/infill_methods/tile.ipynb
@@ -0,0 +1,95 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"Smoke test for the tile infill\"\"\"\n",
+    "\n",
+    "from pathlib import Path\n",
+    "from typing import Optional\n",
+    "from PIL import Image\n",
+    "from invokeai.backend.image_util.infill_methods.tile import infill_tile\n",
+    "\n",
+    "images: list[tuple[str, Image.Image]] = []\n",
+    "\n",
+    "for i in sorted(Path(\"./test_images/\").glob(\"*.webp\")):\n",
+    "    images.append((i.name, Image.open(i)))\n",
+    "    images.append((i.name, Image.open(i).transpose(Image.FLIP_LEFT_RIGHT)))\n",
+    "    images.append((i.name, Image.open(i).transpose(Image.FLIP_TOP_BOTTOM)))\n",
+    "    images.append((i.name, Image.open(i).resize((512, 512))))\n",
+    "    images.append((i.name, Image.open(i).resize((1234, 461))))\n",
+    "\n",
+    "outputs: list[tuple[str, Image.Image, Image.Image, Optional[Image.Image]]] = []\n",
+    "\n",
+    "for name, image in images:\n",
+    "    try:\n",
+    "        output = infill_tile(image, seed=0, tile_size=32)\n",
+    "        outputs.append((name, image, output.infilled, output.tile_image))\n",
+    "    except ValueError as e:\n",
+    "        print(f\"Skipping image {name}: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Display the images in jupyter notebook\n",
+    "import matplotlib.pyplot as plt\n",
+    "from PIL import ImageOps\n",
+    "\n",
+    "fig, axes = plt.subplots(len(outputs), 3, figsize=(10, 3 * len(outputs)))\n",
+    "plt.subplots_adjust(hspace=0)\n",
+    "\n",
+    "for i, (name, original, infilled, tile_image) in enumerate(outputs):\n",
+    "    # Add a border to each image, helps to see the edges\n",
+    "    size = original.size\n",
+    "    original = ImageOps.expand(original, border=5, fill=\"red\")\n",
+    "    filled = ImageOps.expand(infilled, border=5, fill=\"red\")\n",
+    "    if tile_image:\n",
+    "        tile_image = ImageOps.expand(tile_image, border=5, fill=\"red\")\n",
+    "\n",
+    "    axes[i, 0].imshow(original)\n",
+    "    axes[i, 0].axis(\"off\")\n",
+    "    axes[i, 0].set_title(f\"Original ({name} - {size})\")\n",
+    "\n",
+    "    if tile_image:\n",
+    "        axes[i, 1].imshow(tile_image)\n",
+    "        axes[i, 1].axis(\"off\")\n",
+    "        axes[i, 1].set_title(\"Tile Image\")\n",
+    "    else:\n",
+    "        axes[i, 1].axis(\"off\")\n",
+    "        axes[i, 1].set_title(\"NO TILES GENERATED (NO TRANSPARENCY)\")\n",
+    "\n",
+    "    axes[i, 2].imshow(filled)\n",
+    "    axes[i, 2].axis(\"off\")\n",
+    "    axes[i, 2].set_title(\"Filled\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".invokeai",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/invokeai/backend/image_util/infill_methods/tile.py
+++ b/invokeai/backend/image_util/infill_methods/tile.py
@@ -0,0 +1,122 @@
+from dataclasses import dataclass
+from typing import Optional
+
+import numpy as np
+from PIL import Image
+
+
+def create_tile_pool(img_array: np.ndarray, tile_size: tuple[int, int]) -> list[np.ndarray]:
+    """
+    Create a pool of tiles from non-transparent areas of the image by systematically walking through the image.
+
+    Args:
+        img_array: numpy array of the image.
+        tile_size: tuple (tile_width, tile_height) specifying the size of each tile.
+
+    Returns:
+        A list of numpy arrays, each representing a tile.
+    """
+    tiles: list[np.ndarray] = []
+    rows, cols = img_array.shape[:2]
+    tile_width, tile_height = tile_size
+
+    for y in range(0, rows - tile_height + 1, tile_height):
+        for x in range(0, cols - tile_width + 1, tile_width):
+            tile = img_array[y : y + tile_height, x : x + tile_width]
+            # Check if the image has an alpha channel and the tile is completely opaque
+            if img_array.shape[2] == 4 and np.all(tile[:, :, 3] == 255):
+                tiles.append(tile)
+            elif img_array.shape[2] == 3:  # If no alpha channel, append the tile
+                tiles.append(tile)
+
+    if not tiles:
+        raise ValueError(
+            "Not enough opaque pixels to generate any tiles. Use a smaller tile size or a different image."
+        )
+
+    return tiles
+
+
+def create_filled_image(
+    img_array: np.ndarray, tile_pool: list[np.ndarray], tile_size: tuple[int, int], seed: int
+) -> np.ndarray:
+    """
+    Create an image of the same dimensions as the original, filled entirely with tiles from the pool.
+
+    Args:
+        img_array: numpy array of the original image.
+        tile_pool: A list of numpy arrays, each representing a tile.
+        tile_size: tuple (tile_width, tile_height) specifying the size of each tile.
+
+    Returns:
+        A numpy array representing the filled image.
+    """
+
+    rows, cols, _ = img_array.shape
+    tile_width, tile_height = tile_size
+
+    # Prep an empty RGB image
+    filled_img_array = np.zeros((rows, cols, 3), dtype=img_array.dtype)
+
+    # Make the random tile selection reproducible
+    rng = np.random.default_rng(seed)
+
+    for y in range(0, rows, tile_height):
+        for x in range(0, cols, tile_width):
+            # Pick a random tile from the pool
+            tile = tile_pool[rng.integers(len(tile_pool))]
+
+            # Calculate the space available (may be less than tile size near the edges)
+            space_y = min(tile_height, rows - y)
+            space_x = min(tile_width, cols - x)
+
+            # Crop the tile if necessary to fit into the available space
+            cropped_tile = tile[:space_y, :space_x, :3]
+
+            # Fill the available space with the (possibly cropped) tile
+            filled_img_array[y : y + space_y, x : x + space_x, :3] = cropped_tile
+
+    return filled_img_array
+
+
+@dataclass
+class InfillTileOutput:
+    infilled: Image.Image
+    tile_image: Optional[Image.Image] = None
+
+
+def infill_tile(image_to_infill: Image.Image, seed: int, tile_size: int) -> InfillTileOutput:
+    """Infills an image with random tiles from the image itself.
+
+    If the image is not an RGBA image, it is returned untouched.
+
+    Args:
+        image: The image to infill.
+        tile_size: The size of the tiles to use for infilling.
+
+    Raises:
+        ValueError: If there are not enough opaque pixels to generate any tiles.
+    """
+
+    if image_to_infill.mode != "RGBA":
+        return InfillTileOutput(infilled=image_to_infill)
+
+    # Internally, we want a tuple of (tile_width, tile_height). In the future, the tile size can be any rectangle.
+    _tile_size = (tile_size, tile_size)
+    np_image = np.array(image_to_infill, dtype=np.uint8)
+
+    # Create the pool of tiles that we will use to infill
+    tile_pool = create_tile_pool(np_image, _tile_size)
+
+    # Create an image from the tiles, same size as the original
+    tile_np_image = create_filled_image(np_image, tile_pool, _tile_size, seed)
+
+    # Paste the OG image over the tile image, effectively infilling the area
+    tile_image = Image.fromarray(tile_np_image, "RGB")
+    infilled = tile_image.copy()
+    infilled.paste(image_to_infill, (0, 0), image_to_infill.split()[-1])
+
+    # I think we want this to be "RGBA"?
+    infilled.convert("RGBA")
+
+    return InfillTileOutput(infilled=infilled, tile_image=tile_image)
--- a/invokeai/backend/image_util/patchmatch.py
+++ b/invokeai/backend/image_util/patchmatch.py
@@ -1,49 +0,0 @@
-"""
-This module defines a singleton object, "patchmatch" that
-wraps the actual patchmatch object. It respects the global
-"try_patchmatch" attribute, so that patchmatch loading can
-be suppressed or deferred
-"""
-
-import numpy as np
-
-import invokeai.backend.util.logging as logger
-from invokeai.app.services.config.config_default import get_config
-
-
-class PatchMatch:
-    """
-    Thin class wrapper around the patchmatch function.
-    """
-
-    patch_match = None
-    tried_load: bool = False
-
-    def __init__(self):
-        super().__init__()
-
-    @classmethod
-    def _load_patch_match(self):
-        if self.tried_load:
-            return
-        if get_config().patchmatch:
-            from patchmatch import patch_match as pm
-
-            if pm.patchmatch_available:
-                logger.info("Patchmatch initialized")
-            else:
-                logger.info("Patchmatch not loaded (nonfatal)")
-            self.patch_match = pm
-        else:
-            logger.info("Patchmatch loading disabled")
-        self.tried_load = True
-
-    @classmethod
-    def patchmatch_available(self) -> bool:
-        self._load_patch_match()
-        return self.patch_match and self.patch_match.patchmatch_available
-
-    @classmethod
-    def inpaint(self, *args, **kwargs) -> np.ndarray:
-        if self.patchmatch_available():
-            return self.patch_match.inpaint(*args, **kwargs)
--- a/invokeai/backend/image_util/realesrgan/realesrgan.py
+++ b/invokeai/backend/image_util/realesrgan/realesrgan.py
@@ -11,7 +11,7 @@ from cv2.typing import MatLike
 from tqdm import tqdm

 from invokeai.backend.image_util.basicsr.rrdbnet_arch import RRDBNet
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice

 """
 Adapted from https://github.com/xinntao/Real-ESRGAN/blob/master/realesrgan/utils.py
@@ -65,7 +65,7 @@ class RealESRGAN:
        self.pre_pad = pre_pad
        self.mod_scale: Optional[int] = None
        self.half = half
-        self.device = choose_torch_device()
+        self.device = TorchDevice.choose_torch_device()

        loadnet = torch.load(model_path, map_location=torch.device("cpu"))

--- a/invokeai/backend/image_util/safety_checker.py
+++ b/invokeai/backend/image_util/safety_checker.py
@@ -13,7 +13,7 @@ from transformers import AutoFeatureExtractor

 import invokeai.backend.util.logging as logger
 from invokeai.app.services.config.config_default import get_config
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice
 from invokeai.backend.util.silence_warnings import SilenceWarnings

 CHECKER_PATH = "core/convert/stable-diffusion-safety-checker"
@@ -51,7 +51,7 @@ class SafetyChecker:
        cls._load_safety_checker()
        if cls.safety_checker is None or cls.feature_extractor is None:
            return False
-        device = choose_torch_device()
+        device = TorchDevice.choose_torch_device()
        features = cls.feature_extractor([image], return_tensors="pt")
        features.to(device)
        cls.safety_checker.to(device)
--- a/invokeai/backend/ip_adapter/attention_processor.py
+++ b/invokeai/backend/ip_adapter/attention_processor.py
@@ -1,182 +0,0 @@
-# copied from https://github.com/tencent-ailab/IP-Adapter (Apache License 2.0)
-#   and modified as needed
-
-# tencent-ailab comment:
-# modified from https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from diffusers.models.attention_processor import AttnProcessor2_0 as DiffusersAttnProcessor2_0
-
-from invokeai.backend.ip_adapter.ip_attention_weights import IPAttentionProcessorWeights
-
-
-# Create a version of AttnProcessor2_0 that is a sub-class of nn.Module. This is required for IP-Adapter state_dict
-# loading.
-class AttnProcessor2_0(DiffusersAttnProcessor2_0, nn.Module):
-    def __init__(self):
-        DiffusersAttnProcessor2_0.__init__(self)
-        nn.Module.__init__(self)
-
-    def __call__(
-        self,
-        attn,
-        hidden_states,
-        encoder_hidden_states=None,
-        attention_mask=None,
-        temb=None,
-        ip_adapter_image_prompt_embeds=None,
-    ):
-        """Re-definition of DiffusersAttnProcessor2_0.__call__(...) that accepts and ignores the
-        ip_adapter_image_prompt_embeds parameter.
-        """
-        return DiffusersAttnProcessor2_0.__call__(
-            self, attn, hidden_states, encoder_hidden_states, attention_mask, temb
-        )
-
-
-class IPAttnProcessor2_0(torch.nn.Module):
-    r"""
-    Attention processor for IP-Adapater for PyTorch 2.0.
-    Args:
-        hidden_size (`int`):
-            The hidden size of the attention layer.
-        cross_attention_dim (`int`):
-            The number of channels in the `encoder_hidden_states`.
-        scale (`float`, defaults to 1.0):
-            the weight scale of image prompt.
-    """
-
-    def __init__(self, weights: list[IPAttentionProcessorWeights], scales: list[float]):
-        super().__init__()
-
-        if not hasattr(F, "scaled_dot_product_attention"):
-            raise ImportError("AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
-
-        assert len(weights) == len(scales)
-
-        self._weights = weights
-        self._scales = scales
-
-    def __call__(
-        self,
-        attn,
-        hidden_states,
-        encoder_hidden_states=None,
-        attention_mask=None,
-        temb=None,
-        ip_adapter_image_prompt_embeds=None,
-    ):
-        """Apply IP-Adapter attention.
-
-        Args:
-            ip_adapter_image_prompt_embeds (torch.Tensor): The image prompt embeddings.
-                Shape: (batch_size, num_ip_images, seq_len, ip_embedding_len).
-        """
-        residual = hidden_states
-
-        if attn.spatial_norm is not None:
-            hidden_states = attn.spatial_norm(hidden_states, temb)
-
-        input_ndim = hidden_states.ndim
-
-        if input_ndim == 4:
-            batch_size, channel, height, width = hidden_states.shape
-            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
-
-        batch_size, sequence_length, _ = (
-            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
-        )
-
-        if attention_mask is not None:
-            attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
-            # scaled_dot_product_attention expects attention_mask shape to be
-            # (batch, heads, source_length, target_length)
-            attention_mask = attention_mask.view(batch_size, attn.heads, -1, attention_mask.shape[-1])
-
-        if attn.group_norm is not None:
-            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
-
-        query = attn.to_q(hidden_states)
-
-        if encoder_hidden_states is None:
-            encoder_hidden_states = hidden_states
-        elif attn.norm_cross:
-            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
-
-        key = attn.to_k(encoder_hidden_states)
-        value = attn.to_v(encoder_hidden_states)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-
-        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        # the output of sdp = (batch, num_heads, seq_len, head_dim)
-        # TODO: add support for attn.scale when we move to Torch 2.1
-        hidden_states = F.scaled_dot_product_attention(
-            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
-        )
-
-        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
-        hidden_states = hidden_states.to(query.dtype)
-
-        if encoder_hidden_states is not None:
-            # If encoder_hidden_states is not None, then we are doing cross-attention, not self-attention. In this case,
-            # we will apply IP-Adapter conditioning. We validate the inputs for IP-Adapter conditioning here.
-            assert ip_adapter_image_prompt_embeds is not None
-            assert len(ip_adapter_image_prompt_embeds) == len(self._weights)
-
-            for ipa_embed, ipa_weights, scale in zip(
-                ip_adapter_image_prompt_embeds, self._weights, self._scales, strict=True
-            ):
-                # The batch dimensions should match.
-                assert ipa_embed.shape[0] == encoder_hidden_states.shape[0]
-                # The token_len dimensions should match.
-                assert ipa_embed.shape[-1] == encoder_hidden_states.shape[-1]
-
-                ip_hidden_states = ipa_embed
-
-                # Expected ip_hidden_state shape: (batch_size, num_ip_images, ip_seq_len, ip_image_embedding)
-
-                ip_key = ipa_weights.to_k_ip(ip_hidden_states)
-                ip_value = ipa_weights.to_v_ip(ip_hidden_states)
-
-                # Expected ip_key and ip_value shape: (batch_size, num_ip_images, ip_seq_len, head_dim * num_heads)
-
-                ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-                ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-                # Expected ip_key and ip_value shape: (batch_size, num_heads, num_ip_images * ip_seq_len, head_dim)
-
-                # TODO: add support for attn.scale when we move to Torch 2.1
-                ip_hidden_states = F.scaled_dot_product_attention(
-                    query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False
-                )
-
-                # Expected ip_hidden_states shape: (batch_size, num_heads, query_seq_len, head_dim)
-
-                ip_hidden_states = ip_hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
-                ip_hidden_states = ip_hidden_states.to(query.dtype)
-
-                # Expected ip_hidden_states shape: (batch_size, query_seq_len, num_heads * head_dim)
-
-                hidden_states = hidden_states + scale * ip_hidden_states
-
-        # linear proj
-        hidden_states = attn.to_out[0](hidden_states)
-        # dropout
-        hidden_states = attn.to_out[1](hidden_states)
-
-        if input_ndim == 4:
-            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
-
-        if attn.residual_connection:
-            hidden_states = hidden_states + residual
-
-        hidden_states = hidden_states / attn.rescale_output_factor
-
-        return hidden_states
--- a/invokeai/backend/ip_adapter/ip_adapter.py
+++ b/invokeai/backend/ip_adapter/ip_adapter.py
@@ -12,6 +12,7 @@ from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection

 from invokeai.backend.ip_adapter.ip_attention_weights import IPAttentionWeights

+from ..raw_model import RawModel
 from .resampler import Resampler


@@ -101,7 +102,7 @@ class MLPProjModel(torch.nn.Module):
        return clip_extra_context_tokens


-class IPAdapter(torch.nn.Module):
+class IPAdapter(RawModel):
    """IP-Adapter: https://arxiv.org/pdf/2308.06721.pdf"""

    def __init__(
@@ -111,7 +112,6 @@ class IPAdapter(torch.nn.Module):
        dtype: torch.dtype = torch.float16,
        num_tokens: int = 4,
    ):
-        super().__init__()
        self.device = device
        self.dtype = dtype

--- a/invokeai/backend/ip_adapter/unet_patcher.py
+++ b/invokeai/backend/ip_adapter/unet_patcher.py
@@ -1,53 +0,0 @@
-from contextlib import contextmanager
-
-from diffusers.models import UNet2DConditionModel
-
-from invokeai.backend.ip_adapter.attention_processor import AttnProcessor2_0, IPAttnProcessor2_0
-from invokeai.backend.ip_adapter.ip_adapter import IPAdapter
-
-
-class UNetPatcher:
-    """A class that contains multiple IP-Adapters and can apply them to a UNet."""
-
-    def __init__(self, ip_adapters: list[IPAdapter]):
-        self._ip_adapters = ip_adapters
-        self._scales = [1.0] * len(self._ip_adapters)
-
-    def set_scale(self, idx: int, value: float):
-        self._scales[idx] = value
-
-    def _prepare_attention_processors(self, unet: UNet2DConditionModel):
-        """Prepare a dict of attention processors that can be injected into a unet, and load the IP-Adapter attention
-        weights into them.
-
-        Note that the `unet` param is only used to determine attention block dimensions and naming.
-        """
-        # Construct a dict of attention processors based on the UNet's architecture.
-        attn_procs = {}
-        for idx, name in enumerate(unet.attn_processors.keys()):
-            if name.endswith("attn1.processor"):
-                attn_procs[name] = AttnProcessor2_0()
-            else:
-                # Collect the weights from each IP Adapter for the idx'th attention processor.
-                attn_procs[name] = IPAttnProcessor2_0(
-                    [ip_adapter.attn_weights.get_attention_processor_weights(idx) for ip_adapter in self._ip_adapters],
-                    self._scales,
-                )
-        return attn_procs
-
-    @contextmanager
-    def apply_ip_adapter_attention(self, unet: UNet2DConditionModel):
-        """A context manager that patches `unet` with IP-Adapter attention processors."""
-
-        attn_procs = self._prepare_attention_processors(unet)
-
-        orig_attn_processors = unet.attn_processors
-
-        try:
-            # Note to future devs: set_attn_processor(...) does something slightly unexpected - it pops elements from the
-            # passed dict. So, if you wanted to keep the dict for future use, you'd have to make a moderately-shallow copy
-            # of it. E.g. `attn_procs_copy = {k: v for k, v in attn_procs.items()}`.
-            unet.set_attn_processor(attn_procs)
-            yield None
-        finally:
-            unet.set_attn_processor(orig_attn_processors)
--- a/invokeai/backend/lora.py
+++ b/invokeai/backend/lora.py
@@ -0,0 +1,624 @@
+# Copyright (c) 2024 The InvokeAI Development team
+"""LoRA model support."""
+
+import bisect
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple, Union
+
+import torch
+from safetensors.torch import load_file
+from typing_extensions import Self
+
+from invokeai.backend.model_manager import BaseModelType
+
+from .raw_model import RawModel
+
+
+class LoRALayerBase:
+    # rank: Optional[int]
+    # alpha: Optional[float]
+    # bias: Optional[torch.Tensor]
+    # layer_key: str
+
+    # @property
+    # def scale(self):
+    #    return self.alpha / self.rank if (self.alpha and self.rank) else 1.0
+
+    def __init__(
+        self,
+        layer_key: str,
+        values: Dict[str, torch.Tensor],
+    ):
+        if "alpha" in values:
+            self.alpha = values["alpha"].item()
+        else:
+            self.alpha = None
+
+        if "bias_indices" in values and "bias_values" in values and "bias_size" in values:
+            self.bias: Optional[torch.Tensor] = torch.sparse_coo_tensor(
+                values["bias_indices"],
+                values["bias_values"],
+                tuple(values["bias_size"]),
+            )
+
+        else:
+            self.bias = None
+
+        self.rank = None  # set in layer implementation
+        self.layer_key = layer_key
+
+    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
+        raise NotImplementedError()
+
+    def calc_size(self) -> int:
+        model_size = 0
+        for val in [self.bias]:
+            if val is not None:
+                model_size += val.nelement() * val.element_size()
+        return model_size
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ) -> None:
+        if self.bias is not None:
+            self.bias = self.bias.to(device=device, dtype=dtype)
+
+
+# TODO: find and debug lora/locon with bias
+class LoRALayer(LoRALayerBase):
+    # up: torch.Tensor
+    # mid: Optional[torch.Tensor]
+    # down: torch.Tensor
+
+    def __init__(
+        self,
+        layer_key: str,
+        values: Dict[str, torch.Tensor],
+    ):
+        super().__init__(layer_key, values)
+
+        self.up = values["lora_up.weight"]
+        self.down = values["lora_down.weight"]
+        if "lora_mid.weight" in values:
+            self.mid: Optional[torch.Tensor] = values["lora_mid.weight"]
+        else:
+            self.mid = None
+
+        self.rank = self.down.shape[0]
+
+    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
+        if self.mid is not None:
+            up = self.up.reshape(self.up.shape[0], self.up.shape[1])
+            down = self.down.reshape(self.down.shape[0], self.down.shape[1])
+            weight = torch.einsum("m n w h, i m, n j -> i j w h", self.mid, up, down)
+        else:
+            weight = self.up.reshape(self.up.shape[0], -1) @ self.down.reshape(self.down.shape[0], -1)
+
+        return weight
+
+    def calc_size(self) -> int:
+        model_size = super().calc_size()
+        for val in [self.up, self.mid, self.down]:
+            if val is not None:
+                model_size += val.nelement() * val.element_size()
+        return model_size
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ) -> None:
+        super().to(device=device, dtype=dtype)
+
+        self.up = self.up.to(device=device, dtype=dtype)
+        self.down = self.down.to(device=device, dtype=dtype)
+
+        if self.mid is not None:
+            self.mid = self.mid.to(device=device, dtype=dtype)
+
+
+class LoHALayer(LoRALayerBase):
+    # w1_a: torch.Tensor
+    # w1_b: torch.Tensor
+    # w2_a: torch.Tensor
+    # w2_b: torch.Tensor
+    # t1: Optional[torch.Tensor] = None
+    # t2: Optional[torch.Tensor] = None
+
+    def __init__(self, layer_key: str, values: Dict[str, torch.Tensor]):
+        super().__init__(layer_key, values)
+
+        self.w1_a = values["hada_w1_a"]
+        self.w1_b = values["hada_w1_b"]
+        self.w2_a = values["hada_w2_a"]
+        self.w2_b = values["hada_w2_b"]
+
+        if "hada_t1" in values:
+            self.t1: Optional[torch.Tensor] = values["hada_t1"]
+        else:
+            self.t1 = None
+
+        if "hada_t2" in values:
+            self.t2: Optional[torch.Tensor] = values["hada_t2"]
+        else:
+            self.t2 = None
+
+        self.rank = self.w1_b.shape[0]
+
+    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
+        if self.t1 is None:
+            weight: torch.Tensor = (self.w1_a @ self.w1_b) * (self.w2_a @ self.w2_b)
+
+        else:
+            rebuild1 = torch.einsum("i j k l, j r, i p -> p r k l", self.t1, self.w1_b, self.w1_a)
+            rebuild2 = torch.einsum("i j k l, j r, i p -> p r k l", self.t2, self.w2_b, self.w2_a)
+            weight = rebuild1 * rebuild2
+
+        return weight
+
+    def calc_size(self) -> int:
+        model_size = super().calc_size()
+        for val in [self.w1_a, self.w1_b, self.w2_a, self.w2_b, self.t1, self.t2]:
+            if val is not None:
+                model_size += val.nelement() * val.element_size()
+        return model_size
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ) -> None:
+        super().to(device=device, dtype=dtype)
+
+        self.w1_a = self.w1_a.to(device=device, dtype=dtype)
+        self.w1_b = self.w1_b.to(device=device, dtype=dtype)
+        if self.t1 is not None:
+            self.t1 = self.t1.to(device=device, dtype=dtype)
+
+        self.w2_a = self.w2_a.to(device=device, dtype=dtype)
+        self.w2_b = self.w2_b.to(device=device, dtype=dtype)
+        if self.t2 is not None:
+            self.t2 = self.t2.to(device=device, dtype=dtype)
+
+
+class LoKRLayer(LoRALayerBase):
+    # w1: Optional[torch.Tensor] = None
+    # w1_a: Optional[torch.Tensor] = None
+    # w1_b: Optional[torch.Tensor] = None
+    # w2: Optional[torch.Tensor] = None
+    # w2_a: Optional[torch.Tensor] = None
+    # w2_b: Optional[torch.Tensor] = None
+    # t2: Optional[torch.Tensor] = None
+
+    def __init__(
+        self,
+        layer_key: str,
+        values: Dict[str, torch.Tensor],
+    ):
+        super().__init__(layer_key, values)
+
+        if "lokr_w1" in values:
+            self.w1: Optional[torch.Tensor] = values["lokr_w1"]
+            self.w1_a = None
+            self.w1_b = None
+        else:
+            self.w1 = None
+            self.w1_a = values["lokr_w1_a"]
+            self.w1_b = values["lokr_w1_b"]
+
+        if "lokr_w2" in values:
+            self.w2: Optional[torch.Tensor] = values["lokr_w2"]
+            self.w2_a = None
+            self.w2_b = None
+        else:
+            self.w2 = None
+            self.w2_a = values["lokr_w2_a"]
+            self.w2_b = values["lokr_w2_b"]
+
+        if "lokr_t2" in values:
+            self.t2: Optional[torch.Tensor] = values["lokr_t2"]
+        else:
+            self.t2 = None
+
+        if "lokr_w1_b" in values:
+            self.rank = values["lokr_w1_b"].shape[0]
+        elif "lokr_w2_b" in values:
+            self.rank = values["lokr_w2_b"].shape[0]
+        else:
+            self.rank = None  # unscaled
+
+    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
+        w1: Optional[torch.Tensor] = self.w1
+        if w1 is None:
+            assert self.w1_a is not None
+            assert self.w1_b is not None
+            w1 = self.w1_a @ self.w1_b
+
+        w2 = self.w2
+        if w2 is None:
+            if self.t2 is None:
+                assert self.w2_a is not None
+                assert self.w2_b is not None
+                w2 = self.w2_a @ self.w2_b
+            else:
+                w2 = torch.einsum("i j k l, i p, j r -> p r k l", self.t2, self.w2_a, self.w2_b)
+
+        if len(w2.shape) == 4:
+            w1 = w1.unsqueeze(2).unsqueeze(2)
+        w2 = w2.contiguous()
+        assert w1 is not None
+        assert w2 is not None
+        weight = torch.kron(w1, w2)
+
+        return weight
+
+    def calc_size(self) -> int:
+        model_size = super().calc_size()
+        for val in [self.w1, self.w1_a, self.w1_b, self.w2, self.w2_a, self.w2_b, self.t2]:
+            if val is not None:
+                model_size += val.nelement() * val.element_size()
+        return model_size
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ) -> None:
+        super().to(device=device, dtype=dtype)
+
+        if self.w1 is not None:
+            self.w1 = self.w1.to(device=device, dtype=dtype)
+        else:
+            assert self.w1_a is not None
+            assert self.w1_b is not None
+            self.w1_a = self.w1_a.to(device=device, dtype=dtype)
+            self.w1_b = self.w1_b.to(device=device, dtype=dtype)
+
+        if self.w2 is not None:
+            self.w2 = self.w2.to(device=device, dtype=dtype)
+        else:
+            assert self.w2_a is not None
+            assert self.w2_b is not None
+            self.w2_a = self.w2_a.to(device=device, dtype=dtype)
+            self.w2_b = self.w2_b.to(device=device, dtype=dtype)
+
+        if self.t2 is not None:
+            self.t2 = self.t2.to(device=device, dtype=dtype)
+
+
+class FullLayer(LoRALayerBase):
+    # weight: torch.Tensor
+
+    def __init__(
+        self,
+        layer_key: str,
+        values: Dict[str, torch.Tensor],
+    ):
+        super().__init__(layer_key, values)
+
+        self.weight = values["diff"]
+
+        if len(values.keys()) > 1:
+            _keys = list(values.keys())
+            _keys.remove("diff")
+            raise NotImplementedError(f"Unexpected keys in lora diff layer: {_keys}")
+
+        self.rank = None  # unscaled
+
+    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
+        return self.weight
+
+    def calc_size(self) -> int:
+        model_size = super().calc_size()
+        model_size += self.weight.nelement() * self.weight.element_size()
+        return model_size
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ) -> None:
+        super().to(device=device, dtype=dtype)
+
+        self.weight = self.weight.to(device=device, dtype=dtype)
+
+
+class IA3Layer(LoRALayerBase):
+    # weight: torch.Tensor
+    # on_input: torch.Tensor
+
+    def __init__(
+        self,
+        layer_key: str,
+        values: Dict[str, torch.Tensor],
+    ):
+        super().__init__(layer_key, values)
+
+        self.weight = values["weight"]
+        self.on_input = values["on_input"]
+
+        self.rank = None  # unscaled
+
+    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
+        weight = self.weight
+        if not self.on_input:
+            weight = weight.reshape(-1, 1)
+        assert orig_weight is not None
+        return orig_weight * weight
+
+    def calc_size(self) -> int:
+        model_size = super().calc_size()
+        model_size += self.weight.nelement() * self.weight.element_size()
+        model_size += self.on_input.nelement() * self.on_input.element_size()
+        return model_size
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ):
+        super().to(device=device, dtype=dtype)
+
+        self.weight = self.weight.to(device=device, dtype=dtype)
+        self.on_input = self.on_input.to(device=device, dtype=dtype)
+
+
+AnyLoRALayer = Union[LoRALayer, LoHALayer, LoKRLayer, FullLayer, IA3Layer]
+
+
+class LoRAModelRaw(RawModel):  # (torch.nn.Module):
+    _name: str
+    layers: Dict[str, AnyLoRALayer]
+
+    def __init__(
+        self,
+        name: str,
+        layers: Dict[str, AnyLoRALayer],
+    ):
+        self._name = name
+        self.layers = layers
+
+    @property
+    def name(self) -> str:
+        return self._name
+
+    def to(
+        self,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ) -> None:
+        # TODO: try revert if exception?
+        for _key, layer in self.layers.items():
+            layer.to(device=device, dtype=dtype)
+
+    def calc_size(self) -> int:
+        model_size = 0
+        for _, layer in self.layers.items():
+            model_size += layer.calc_size()
+        return model_size
+
+    @classmethod
+    def _convert_sdxl_keys_to_diffusers_format(cls, state_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+        """Convert the keys of an SDXL LoRA state_dict to diffusers format.
+
+        The input state_dict can be in either Stability AI format or diffusers format. If the state_dict is already in
+        diffusers format, then this function will have no effect.
+
+        This function is adapted from:
+        https://github.com/bmaltais/kohya_ss/blob/2accb1305979ba62f5077a23aabac23b4c37e935/networks/lora_diffusers.py#L385-L409
+
+        Args:
+            state_dict (Dict[str, Tensor]): The SDXL LoRA state_dict.
+
+        Raises:
+            ValueError: If state_dict contains an unrecognized key, or not all keys could be converted.
+
+        Returns:
+            Dict[str, Tensor]: The diffusers-format state_dict.
+        """
+        converted_count = 0  # The number of Stability AI keys converted to diffusers format.
+        not_converted_count = 0  # The number of keys that were not converted.
+
+        # Get a sorted list of Stability AI UNet keys so that we can efficiently search for keys with matching prefixes.
+        # For example, we want to efficiently find `input_blocks_4_1` in the list when searching for
+        # `input_blocks_4_1_proj_in`.
+        stability_unet_keys = list(SDXL_UNET_STABILITY_TO_DIFFUSERS_MAP)
+        stability_unet_keys.sort()
+
+        new_state_dict = {}
+        for full_key, value in state_dict.items():
+            if full_key.startswith("lora_unet_"):
+                search_key = full_key.replace("lora_unet_", "")
+                # Use bisect to find the key in stability_unet_keys that *may* match the search_key's prefix.
+                position = bisect.bisect_right(stability_unet_keys, search_key)
+                map_key = stability_unet_keys[position - 1]
+                # Now, check if the map_key *actually* matches the search_key.
+                if search_key.startswith(map_key):
+                    new_key = full_key.replace(map_key, SDXL_UNET_STABILITY_TO_DIFFUSERS_MAP[map_key])
+                    new_state_dict[new_key] = value
+                    converted_count += 1
+                else:
+                    new_state_dict[full_key] = value
+                    not_converted_count += 1
+            elif full_key.startswith("lora_te1_") or full_key.startswith("lora_te2_"):
+                # The CLIP text encoders have the same keys in both Stability AI and diffusers formats.
+                new_state_dict[full_key] = value
+                continue
+            else:
+                raise ValueError(f"Unrecognized SDXL LoRA key prefix: '{full_key}'.")
+
+        if converted_count > 0 and not_converted_count > 0:
+            raise ValueError(
+                f"The SDXL LoRA could only be partially converted to diffusers format. converted={converted_count},"
+                f" not_converted={not_converted_count}"
+            )
+
+        return new_state_dict
+
+    @classmethod
+    def from_checkpoint(
+        cls,
+        file_path: Union[str, Path],
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+        base_model: Optional[BaseModelType] = None,
+    ) -> Self:
+        device = device or torch.device("cpu")
+        dtype = dtype or torch.float32
+
+        if isinstance(file_path, str):
+            file_path = Path(file_path)
+
+        model = cls(
+            name=file_path.stem,
+            layers={},
+        )
+
+        if file_path.suffix == ".safetensors":
+            sd = load_file(file_path.absolute().as_posix(), device="cpu")
+        else:
+            sd = torch.load(file_path, map_location="cpu")
+
+        state_dict = cls._group_state(sd)
+
+        if base_model == BaseModelType.StableDiffusionXL:
+            state_dict = cls._convert_sdxl_keys_to_diffusers_format(state_dict)
+
+        for layer_key, values in state_dict.items():
+            # lora and locon
+            if "lora_down.weight" in values:
+                layer: AnyLoRALayer = LoRALayer(layer_key, values)
+
+            # loha
+            elif "hada_w1_b" in values:
+                layer = LoHALayer(layer_key, values)
+
+            # lokr
+            elif "lokr_w1_b" in values or "lokr_w1" in values:
+                layer = LoKRLayer(layer_key, values)
+
+            # diff
+            elif "diff" in values:
+                layer = FullLayer(layer_key, values)
+
+            # ia3
+            elif "weight" in values and "on_input" in values:
+                layer = IA3Layer(layer_key, values)
+
+            else:
+                print(f">> Encountered unknown lora layer module in {model.name}: {layer_key} - {list(values.keys())}")
+                raise Exception("Unknown lora format!")
+
+            # lower memory consumption by removing already parsed layer values
+            state_dict[layer_key].clear()
+
+            layer.to(device=device, dtype=dtype)
+            model.layers[layer_key] = layer
+
+        return model
+
+    @staticmethod
+    def _group_state(state_dict: Dict[str, torch.Tensor]) -> Dict[str, Dict[str, torch.Tensor]]:
+        state_dict_groupped: Dict[str, Dict[str, torch.Tensor]] = {}
+
+        for key, value in state_dict.items():
+            stem, leaf = key.split(".", 1)
+            if stem not in state_dict_groupped:
+                state_dict_groupped[stem] = {}
+            state_dict_groupped[stem][leaf] = value
+
+        return state_dict_groupped
+
+
+# code from
+# https://github.com/bmaltais/kohya_ss/blob/2accb1305979ba62f5077a23aabac23b4c37e935/networks/lora_diffusers.py#L15C1-L97C32
+def make_sdxl_unet_conversion_map() -> List[Tuple[str, str]]:
+    """Create a dict mapping state_dict keys from Stability AI SDXL format to diffusers SDXL format."""
+    unet_conversion_map_layer = []
+
+    for i in range(3):  # num_blocks is 3 in sdxl
+        # loop over downblocks/upblocks
+        for j in range(2):
+            # loop over resnets/attentions for downblocks
+            hf_down_res_prefix = f"down_blocks.{i}.resnets.{j}."
+            sd_down_res_prefix = f"input_blocks.{3*i + j + 1}.0."
+            unet_conversion_map_layer.append((sd_down_res_prefix, hf_down_res_prefix))
+
+            if i < 3:
+                # no attention layers in down_blocks.3
+                hf_down_atn_prefix = f"down_blocks.{i}.attentions.{j}."
+                sd_down_atn_prefix = f"input_blocks.{3*i + j + 1}.1."
+                unet_conversion_map_layer.append((sd_down_atn_prefix, hf_down_atn_prefix))
+
+        for j in range(3):
+            # loop over resnets/attentions for upblocks
+            hf_up_res_prefix = f"up_blocks.{i}.resnets.{j}."
+            sd_up_res_prefix = f"output_blocks.{3*i + j}.0."
+            unet_conversion_map_layer.append((sd_up_res_prefix, hf_up_res_prefix))
+
+            # if i > 0: commentout for sdxl
+            # no attention layers in up_blocks.0
+            hf_up_atn_prefix = f"up_blocks.{i}.attentions.{j}."
+            sd_up_atn_prefix = f"output_blocks.{3*i + j}.1."
+            unet_conversion_map_layer.append((sd_up_atn_prefix, hf_up_atn_prefix))
+
+        if i < 3:
+            # no downsample in down_blocks.3
+            hf_downsample_prefix = f"down_blocks.{i}.downsamplers.0.conv."
+            sd_downsample_prefix = f"input_blocks.{3*(i+1)}.0.op."
+            unet_conversion_map_layer.append((sd_downsample_prefix, hf_downsample_prefix))
+
+            # no upsample in up_blocks.3
+            hf_upsample_prefix = f"up_blocks.{i}.upsamplers.0."
+            sd_upsample_prefix = f"output_blocks.{3*i + 2}.{2}."  # change for sdxl
+            unet_conversion_map_layer.append((sd_upsample_prefix, hf_upsample_prefix))
+
+    hf_mid_atn_prefix = "mid_block.attentions.0."
+    sd_mid_atn_prefix = "middle_block.1."
+    unet_conversion_map_layer.append((sd_mid_atn_prefix, hf_mid_atn_prefix))
+
+    for j in range(2):
+        hf_mid_res_prefix = f"mid_block.resnets.{j}."
+        sd_mid_res_prefix = f"middle_block.{2*j}."
+        unet_conversion_map_layer.append((sd_mid_res_prefix, hf_mid_res_prefix))
+
+    unet_conversion_map_resnet = [
+        # (stable-diffusion, HF Diffusers)
+        ("in_layers.0.", "norm1."),
+        ("in_layers.2.", "conv1."),
+        ("out_layers.0.", "norm2."),
+        ("out_layers.3.", "conv2."),
+        ("emb_layers.1.", "time_emb_proj."),
+        ("skip_connection.", "conv_shortcut."),
+    ]
+
+    unet_conversion_map = []
+    for sd, hf in unet_conversion_map_layer:
+        if "resnets" in hf:
+            for sd_res, hf_res in unet_conversion_map_resnet:
+                unet_conversion_map.append((sd + sd_res, hf + hf_res))
+        else:
+            unet_conversion_map.append((sd, hf))
+
+    for j in range(2):
+        hf_time_embed_prefix = f"time_embedding.linear_{j+1}."
+        sd_time_embed_prefix = f"time_embed.{j*2}."
+        unet_conversion_map.append((sd_time_embed_prefix, hf_time_embed_prefix))
+
+    for j in range(2):
+        hf_label_embed_prefix = f"add_embedding.linear_{j+1}."
+        sd_label_embed_prefix = f"label_emb.0.{j*2}."
+        unet_conversion_map.append((sd_label_embed_prefix, hf_label_embed_prefix))
+
+    unet_conversion_map.append(("input_blocks.0.0.", "conv_in."))
+    unet_conversion_map.append(("out.0.", "conv_norm_out."))
+    unet_conversion_map.append(("out.2.", "conv_out."))
+
+    return unet_conversion_map
+
+
+SDXL_UNET_STABILITY_TO_DIFFUSERS_MAP = {
+    sd.rstrip(".").replace(".", "_"): hf.rstrip(".").replace(".", "_") for sd, hf in make_sdxl_unet_conversion_map()
+}
--- a/invokeai/backend/lora/init.py
+++ b/invokeai/backend/lora/init.py
--- a/invokeai/backend/lora/full_layer.py
+++ b/invokeai/backend/lora/full_layer.py
@@ -1,42 +0,0 @@
-from typing import Dict, Optional
-
-import torch
-
-from invokeai.backend.lora.lora_layer_base import LoRALayerBase
-
-
-class FullLayer(LoRALayerBase):
-    # weight: torch.Tensor
-
-    def __init__(
-        self,
-        layer_key: str,
-        values: Dict[str, torch.Tensor],
-    ):
-        super().__init__(layer_key, values)
-
-        self.weight = values["diff"]
-
-        if len(values.keys()) > 1:
-            _keys = list(values.keys())
-            _keys.remove("diff")
-            raise NotImplementedError(f"Unexpected keys in lora diff layer: {_keys}")
-
-        self.rank = None  # unscaled
-
-    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
-        return self.weight
-
-    def calc_size(self) -> int:
-        model_size = super().calc_size()
-        model_size += self.weight.nelement() * self.weight.element_size()
-        return model_size
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        super().to(device=device, dtype=dtype)
-
-        self.weight = self.weight.to(device=device, dtype=dtype)
--- a/invokeai/backend/lora/ia3_layer.py
+++ b/invokeai/backend/lora/ia3_layer.py
@@ -1,45 +0,0 @@
-from typing import Dict, Optional
-
-import torch
-
-from invokeai.backend.lora.lora_layer_base import LoRALayerBase
-
-
-class IA3Layer(LoRALayerBase):
-    # weight: torch.Tensor
-    # on_input: torch.Tensor
-
-    def __init__(
-        self,
-        layer_key: str,
-        values: Dict[str, torch.Tensor],
-    ):
-        super().__init__(layer_key, values)
-
-        self.weight = values["weight"]
-        self.on_input = values["on_input"]
-
-        self.rank = None  # unscaled
-
-    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
-        weight = self.weight
-        if not self.on_input:
-            weight = weight.reshape(-1, 1)
-        assert orig_weight is not None
-        return orig_weight * weight
-
-    def calc_size(self) -> int:
-        model_size = super().calc_size()
-        model_size += self.weight.nelement() * self.weight.element_size()
-        model_size += self.on_input.nelement() * self.on_input.element_size()
-        return model_size
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ):
-        super().to(device=device, dtype=dtype)
-
-        self.weight = self.weight.to(device=device, dtype=dtype)
-        self.on_input = self.on_input.to(device=device, dtype=dtype)
--- a/invokeai/backend/lora/loha_layer.py
+++ b/invokeai/backend/lora/loha_layer.py
@@ -1,69 +0,0 @@
-from typing import Dict, Optional
-
-import torch
-
-from invokeai.backend.lora.lora_layer_base import LoRALayerBase
-
-
-class LoHALayer(LoRALayerBase):
-    # w1_a: torch.Tensor
-    # w1_b: torch.Tensor
-    # w2_a: torch.Tensor
-    # w2_b: torch.Tensor
-    # t1: Optional[torch.Tensor] = None
-    # t2: Optional[torch.Tensor] = None
-
-    def __init__(self, layer_key: str, values: Dict[str, torch.Tensor]):
-        super().__init__(layer_key, values)
-
-        self.w1_a = values["hada_w1_a"]
-        self.w1_b = values["hada_w1_b"]
-        self.w2_a = values["hada_w2_a"]
-        self.w2_b = values["hada_w2_b"]
-
-        if "hada_t1" in values:
-            self.t1: Optional[torch.Tensor] = values["hada_t1"]
-        else:
-            self.t1 = None
-
-        if "hada_t2" in values:
-            self.t2: Optional[torch.Tensor] = values["hada_t2"]
-        else:
-            self.t2 = None
-
-        self.rank = self.w1_b.shape[0]
-
-    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
-        if self.t1 is None:
-            weight: torch.Tensor = (self.w1_a @ self.w1_b) * (self.w2_a @ self.w2_b)
-
-        else:
-            rebuild1 = torch.einsum("i j k l, j r, i p -> p r k l", self.t1, self.w1_b, self.w1_a)
-            rebuild2 = torch.einsum("i j k l, j r, i p -> p r k l", self.t2, self.w2_b, self.w2_a)
-            weight = rebuild1 * rebuild2
-
-        return weight
-
-    def calc_size(self) -> int:
-        model_size = super().calc_size()
-        for val in [self.w1_a, self.w1_b, self.w2_a, self.w2_b, self.t1, self.t2]:
-            if val is not None:
-                model_size += val.nelement() * val.element_size()
-        return model_size
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        super().to(device=device, dtype=dtype)
-
-        self.w1_a = self.w1_a.to(device=device, dtype=dtype)
-        self.w1_b = self.w1_b.to(device=device, dtype=dtype)
-        if self.t1 is not None:
-            self.t1 = self.t1.to(device=device, dtype=dtype)
-
-        self.w2_a = self.w2_a.to(device=device, dtype=dtype)
-        self.w2_b = self.w2_b.to(device=device, dtype=dtype)
-        if self.t2 is not None:
-            self.t2 = self.t2.to(device=device, dtype=dtype)
--- a/invokeai/backend/lora/lokr_layer.py
+++ b/invokeai/backend/lora/lokr_layer.py
@@ -1,110 +0,0 @@
-from typing import Dict, Optional
-
-import torch
-
-from invokeai.backend.lora.lora_layer_base import LoRALayerBase
-
-
-class LoKRLayer(LoRALayerBase):
-    # w1: Optional[torch.Tensor] = None
-    # w1_a: Optional[torch.Tensor] = None
-    # w1_b: Optional[torch.Tensor] = None
-    # w2: Optional[torch.Tensor] = None
-    # w2_a: Optional[torch.Tensor] = None
-    # w2_b: Optional[torch.Tensor] = None
-    # t2: Optional[torch.Tensor] = None
-
-    def __init__(
-        self,
-        layer_key: str,
-        values: Dict[str, torch.Tensor],
-    ):
-        super().__init__(layer_key, values)
-
-        if "lokr_w1" in values:
-            self.w1: Optional[torch.Tensor] = values["lokr_w1"]
-            self.w1_a = None
-            self.w1_b = None
-        else:
-            self.w1 = None
-            self.w1_a = values["lokr_w1_a"]
-            self.w1_b = values["lokr_w1_b"]
-
-        if "lokr_w2" in values:
-            self.w2: Optional[torch.Tensor] = values["lokr_w2"]
-            self.w2_a = None
-            self.w2_b = None
-        else:
-            self.w2 = None
-            self.w2_a = values["lokr_w2_a"]
-            self.w2_b = values["lokr_w2_b"]
-
-        if "lokr_t2" in values:
-            self.t2: Optional[torch.Tensor] = values["lokr_t2"]
-        else:
-            self.t2 = None
-
-        if "lokr_w1_b" in values:
-            self.rank = values["lokr_w1_b"].shape[0]
-        elif "lokr_w2_b" in values:
-            self.rank = values["lokr_w2_b"].shape[0]
-        else:
-            self.rank = None  # unscaled
-
-    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
-        w1: Optional[torch.Tensor] = self.w1
-        if w1 is None:
-            assert self.w1_a is not None
-            assert self.w1_b is not None
-            w1 = self.w1_a @ self.w1_b
-
-        w2 = self.w2
-        if w2 is None:
-            if self.t2 is None:
-                assert self.w2_a is not None
-                assert self.w2_b is not None
-                w2 = self.w2_a @ self.w2_b
-            else:
-                w2 = torch.einsum("i j k l, i p, j r -> p r k l", self.t2, self.w2_a, self.w2_b)
-
-        if len(w2.shape) == 4:
-            w1 = w1.unsqueeze(2).unsqueeze(2)
-        w2 = w2.contiguous()
-        assert w1 is not None
-        assert w2 is not None
-        weight = torch.kron(w1, w2)
-
-        return weight
-
-    def calc_size(self) -> int:
-        model_size = super().calc_size()
-        for val in [self.w1, self.w1_a, self.w1_b, self.w2, self.w2_a, self.w2_b, self.t2]:
-            if val is not None:
-                model_size += val.nelement() * val.element_size()
-        return model_size
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        super().to(device=device, dtype=dtype)
-
-        if self.w1 is not None:
-            self.w1 = self.w1.to(device=device, dtype=dtype)
-        else:
-            assert self.w1_a is not None
-            assert self.w1_b is not None
-            self.w1_a = self.w1_a.to(device=device, dtype=dtype)
-            self.w1_b = self.w1_b.to(device=device, dtype=dtype)
-
-        if self.w2 is not None:
-            self.w2 = self.w2.to(device=device, dtype=dtype)
-        else:
-            assert self.w2_a is not None
-            assert self.w2_b is not None
-            self.w2_a = self.w2_a.to(device=device, dtype=dtype)
-            self.w2_b = self.w2_b.to(device=device, dtype=dtype)
-
-        if self.t2 is not None:
-            self.t2 = self.t2.to(device=device, dtype=dtype)
--- a/invokeai/backend/lora/lora_layer.py
+++ b/invokeai/backend/lora/lora_layer.py
@@ -1,81 +0,0 @@
-from typing import Optional
-
-import torch
-
-from invokeai.backend.lora.lora_layer_base import LoRALayerBase
-
-
-class LoRALayer(LoRALayerBase):
-    def __init__(
-        self,
-        layer_key: str,
-        values: dict[str, torch.Tensor],
-    ):
-        super().__init__(layer_key, values)
-
-        self.up = values["lora_up.weight"]
-        self.down = values["lora_down.weight"]
-
-        self.mid: Optional[torch.Tensor] = values.get("lora_mid.weight", None)
-        self.dora_scale: Optional[torch.Tensor] = values.get("dora_scale", None)
-        self.rank = self.down.shape[0]
-
-    def _apply_dora(self, orig_weight: torch.Tensor, lora_weight: torch.Tensor) -> torch.Tensor:
-        """Apply DoRA to the weight matrix.
-
-        This function is based roughly on the reference implementation in PEFT, but handles scaling in a slightly
-        different way:
-        https://github.com/huggingface/peft/blob/26726bf1ddee6ca75ed4e1bfd292094526707a78/src/peft/tuners/lora/layer.py#L421-L433
-
-        """
-        # Merge the original weight with the LoRA weight.
-        merged_weight = orig_weight + lora_weight
-
-        # Calculate the vector-wise L2 norm of the weight matrix across each column vector.
-        weight_norm: torch.Tensor = torch.linalg.norm(merged_weight, dim=1)
-
-        dora_factor = self.dora_scale / weight_norm
-        new_weight = dora_factor * merged_weight
-
-        # TODO(ryand): This is wasteful. We already have the final weight, but we calculate the diff, because that is
-        # what the `get_weight()` API is expected to return. If we do refactor this, we'll have to give some thought to
-        # how lora weight scaling should be applied - having the full weight diff makes this easy.
-        weight_diff = new_weight - orig_weight
-        return weight_diff
-
-    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
-        if self.mid is not None:
-            up = self.up.reshape(self.up.shape[0], self.up.shape[1])
-            down = self.down.reshape(self.down.shape[0], self.down.shape[1])
-            weight = torch.einsum("m n w h, i m, n j -> i j w h", self.mid, up, down)
-        else:
-            weight = self.up.reshape(self.up.shape[0], -1) @ self.down.reshape(self.down.shape[0], -1)
-
-        if self.dora_scale is not None:
-            assert orig_weight is not None
-            weight = self._apply_dora(orig_weight, weight)
-
-        return weight
-
-    def calc_size(self) -> int:
-        model_size = super().calc_size()
-        for val in [self.up, self.mid, self.down]:
-            if val is not None:
-                model_size += val.nelement() * val.element_size()
-        return model_size
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        super().to(device=device, dtype=dtype)
-
-        self.up = self.up.to(device=device, dtype=dtype)
-        self.down = self.down.to(device=device, dtype=dtype)
-
-        if self.mid is not None:
-            self.mid = self.mid.to(device=device, dtype=dtype)
-
-        if self.dora_scale is not None:
-            self.dora_scale = self.dora_scale.to(device=device, dtype=dtype)
--- a/invokeai/backend/lora/lora_layer_base.py
+++ b/invokeai/backend/lora/lora_layer_base.py
@@ -1,55 +0,0 @@
-from typing import Dict, Optional
-
-import torch
-
-
-class LoRALayerBase:
-    # rank: Optional[int]
-    # alpha: Optional[float]
-    # bias: Optional[torch.Tensor]
-    # layer_key: str
-
-    # @property
-    # def scale(self):
-    #    return self.alpha / self.rank if (self.alpha and self.rank) else 1.0
-
-    def __init__(
-        self,
-        layer_key: str,
-        values: Dict[str, torch.Tensor],
-    ):
-        if "alpha" in values:
-            self.alpha = values["alpha"].item()
-        else:
-            self.alpha = None
-
-        if "bias_indices" in values and "bias_values" in values and "bias_size" in values:
-            self.bias: Optional[torch.Tensor] = torch.sparse_coo_tensor(
-                values["bias_indices"],
-                values["bias_values"],
-                tuple(values["bias_size"]),
-            )
-
-        else:
-            self.bias = None
-
-        self.rank = None  # set in layer implementation
-        self.layer_key = layer_key
-
-    def get_weight(self, orig_weight: Optional[torch.Tensor]) -> torch.Tensor:
-        raise NotImplementedError()
-
-    def calc_size(self) -> int:
-        model_size = 0
-        for val in [self.bias]:
-            if val is not None:
-                model_size += val.nelement() * val.element_size()
-        return model_size
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        if self.bias is not None:
-            self.bias = self.bias.to(device=device, dtype=dtype)
--- a/invokeai/backend/lora/lora_model.py
+++ b/invokeai/backend/lora/lora_model.py
@@ -1,111 +0,0 @@
-from pathlib import Path
-from typing import Optional, Union
-
-import torch
-
-from invokeai.backend.lora.full_layer import FullLayer
-from invokeai.backend.lora.ia3_layer import IA3Layer
-from invokeai.backend.lora.loha_layer import LoHALayer
-from invokeai.backend.lora.lokr_layer import LoKRLayer
-from invokeai.backend.lora.lora_layer import LoRALayer
-from invokeai.backend.lora.sdxl_state_dict_utils import convert_sdxl_keys_to_diffusers_format
-from invokeai.backend.model_manager import BaseModelType
-from invokeai.backend.util.serialization import load_state_dict
-
-AnyLoRALayer = Union[LoRALayer, LoHALayer, LoKRLayer, FullLayer, IA3Layer]
-
-
-class LoRAModelRaw(torch.nn.Module):
-    def __init__(
-        self,
-        name: str,
-        layers: dict[str, AnyLoRALayer],
-    ):
-        super().__init__()
-        self._name = name
-        self.layers = layers
-
-    @property
-    def name(self) -> str:
-        return self._name
-
-    def to(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        # TODO: try revert if exception?
-        for _key, layer in self.layers.items():
-            layer.to(device=device, dtype=dtype)
-
-    def calc_size(self) -> int:
-        model_size = 0
-        for _, layer in self.layers.items():
-            model_size += layer.calc_size()
-        return model_size
-
-    @classmethod
-    def from_checkpoint(
-        cls,
-        file_path: Union[str, Path],
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        base_model: Optional[BaseModelType] = None,
-    ):
-        device = device or torch.device("cpu")
-        dtype = dtype or torch.float32
-
-        file_path = Path(file_path)
-
-        model_name = file_path.stem
-
-        sd = load_state_dict(file_path, device=str(device))
-        state_dict = cls._group_state(sd)
-
-        if base_model == BaseModelType.StableDiffusionXL:
-            state_dict = convert_sdxl_keys_to_diffusers_format(state_dict)
-
-        layers: dict[str, AnyLoRALayer] = {}
-        for layer_key, values in state_dict.items():
-            # lora and locon
-            if "lora_down.weight" in values:
-                layer: AnyLoRALayer = LoRALayer(layer_key, values)
-
-            # loha
-            elif "hada_w1_b" in values:
-                layer = LoHALayer(layer_key, values)
-
-            # lokr
-            elif "lokr_w1_b" in values or "lokr_w1" in values:
-                layer = LoKRLayer(layer_key, values)
-
-            # diff
-            elif "diff" in values:
-                layer = FullLayer(layer_key, values)
-
-            # ia3
-            elif "weight" in values and "on_input" in values:
-                layer = IA3Layer(layer_key, values)
-
-            else:
-                raise ValueError(f"Unknown lora layer module in {model_name}: {layer_key}: {list(values.keys())}")
-
-            # lower memory consumption by removing already parsed layer values
-            state_dict[layer_key].clear()
-
-            layer.to(device=device, dtype=dtype)
-            layers[layer_key] = layer
-
-        return cls(name=model_name, layers=layers)
-
-    @staticmethod
-    def _group_state(state_dict: dict[str, torch.Tensor]) -> dict[str, dict[str, torch.Tensor]]:
-        state_dict_groupped: dict[str, dict[str, torch.Tensor]] = {}
-
-        for key, value in state_dict.items():
-            stem, leaf = key.split(".", 1)
-            if stem not in state_dict_groupped:
-                state_dict_groupped[stem] = {}
-            state_dict_groupped[stem][leaf] = value
-
-        return state_dict_groupped
--- a/invokeai/backend/lora/lora_model_patcher.py
+++ b/invokeai/backend/lora/lora_model_patcher.py
@@ -1,137 +0,0 @@
-from contextlib import contextmanager
-from typing import Iterator, Tuple
-
-import torch
-from diffusers.models.unets.unet_2d_condition import UNet2DConditionModel
-from transformers import CLIPTextModel
-
-from invokeai.backend.lora.lora_model import LoRAModelRaw
-from invokeai.backend.model_manager.any_model_type import AnyModel
-
-
-class LoraModelPatcher:
-    @staticmethod
-    def _resolve_lora_key(model: torch.nn.Module, lora_key: str, prefix: str) -> Tuple[str, torch.nn.Module]:
-        assert "." not in lora_key
-
-        if not lora_key.startswith(prefix):
-            raise Exception(f"lora_key with invalid prefix: {lora_key}, {prefix}")
-
-        module = model
-        module_key = ""
-        key_parts = lora_key[len(prefix) :].split("_")
-
-        submodule_name = key_parts.pop(0)
-
-        while len(key_parts) > 0:
-            try:
-                module = module.get_submodule(submodule_name)
-                module_key += "." + submodule_name
-                submodule_name = key_parts.pop(0)
-            except Exception:
-                submodule_name += "_" + key_parts.pop(0)
-
-        module = module.get_submodule(submodule_name)
-        module_key = (module_key + "." + submodule_name).lstrip(".")
-
-        return (module_key, module)
-
-    @classmethod
-    @contextmanager
-    def apply_lora_unet(
-        cls,
-        unet: UNet2DConditionModel,
-        loras: Iterator[Tuple[LoRAModelRaw, float]],
-    ):
-        with cls.apply_lora(unet, loras, "lora_unet_"):
-            yield
-
-    @classmethod
-    @contextmanager
-    def apply_lora_text_encoder(
-        cls,
-        text_encoder: CLIPTextModel,
-        loras: Iterator[Tuple[LoRAModelRaw, float]],
-    ):
-        with cls.apply_lora(text_encoder, loras, "lora_te_"):
-            yield
-
-    @classmethod
-    @contextmanager
-    def apply_sdxl_lora_text_encoder(
-        cls,
-        text_encoder: CLIPTextModel,
-        loras: Iterator[Tuple[LoRAModelRaw, float]],
-    ):
-        with cls.apply_lora(text_encoder, loras, "lora_te1_"):
-            yield
-
-    @classmethod
-    @contextmanager
-    def apply_sdxl_lora_text_encoder2(
-        cls,
-        text_encoder: CLIPTextModel,
-        loras: Iterator[Tuple[LoRAModelRaw, float]],
-    ):
-        with cls.apply_lora(text_encoder, loras, "lora_te2_"):
-            yield
-
-    @classmethod
-    @contextmanager
-    def apply_lora(
-        cls,
-        model: AnyModel,
-        loras: Iterator[Tuple[LoRAModelRaw, float]],
-        prefix: str,
-    ):
-        original_weights = {}
-        try:
-            with torch.no_grad():
-                for lora, lora_weight in loras:
-                    # assert lora.device.type == "cpu"
-                    for layer_key, layer in lora.layers.items():
-                        if not layer_key.startswith(prefix):
-                            continue
-
-                        # TODO(ryand): A non-negligible amount of time is currently spent resolving LoRA keys. This
-                        # should be improved in the following ways:
-                        # 1. The key mapping could be more-efficiently pre-computed. This would save time every time a
-                        #    LoRA model is applied.
-                        # 2. From an API perspective, there's no reason that the `LoraModelPatcher` should be aware of
-                        #    the intricacies of Stable Diffusion key resolution. It should just expect the input LoRA
-                        #    weights to have valid keys.
-                        assert isinstance(model, torch.nn.Module)
-                        module_key, module = cls._resolve_lora_key(model, layer_key, prefix)
-
-                        # All of the LoRA weight calculations will be done on the same device as the module weight.
-                        # (Performance will be best if this is a CUDA device.)
-                        device = module.weight.device
-                        dtype = module.weight.dtype
-
-                        if module_key not in original_weights:
-                            original_weights[module_key] = module.weight.detach().to(device="cpu", copy=True)
-
-                        layer_scale = layer.alpha / layer.rank if (layer.alpha and layer.rank) else 1.0
-
-                        # We intentionally move to the target device first, then cast. Experimentally, this was found to
-                        # be significantly faster for 16-bit CPU tensors being moved to a CUDA device than doing the
-                        # same thing in a single call to '.to(...)'.
-                        layer.to(device=device)
-                        layer.to(dtype=torch.float32)
-                        # TODO(ryand): Using torch.autocast(...) over explicit casting may offer a speed benefit on CUDA
-                        # devices here. Experimentally, it was found to be very slow on CPU. More investigation needed.
-                        layer_weight = layer.get_weight(module.weight) * (lora_weight * layer_scale)
-                        layer.to(device=torch.device("cpu"))
-
-                        if module.weight.shape != layer_weight.shape:
-                            layer_weight = layer_weight.reshape(module.weight.shape)
-
-                        module.weight += layer_weight.to(dtype=dtype)
-
-            yield  # wait for context manager exit
-
-        finally:
-            assert hasattr(model, "get_submodule")  # mypy not picking up fact that torch.nn.Module has get_submodule()
-            with torch.no_grad():
-                for module_key, weight in original_weights.items():
-                    model.get_submodule(module_key).weight.copy_(weight)
--- a/invokeai/backend/lora/sdxl_state_dict_utils.py
+++ b/invokeai/backend/lora/sdxl_state_dict_utils.py
@@ -1,157 +0,0 @@
-import bisect
-from typing import TypeVar
-
-
-def make_sdxl_unet_conversion_map() -> list[tuple[str, str]]:
-    """Create a dict mapping state_dict keys from Stability AI SDXL format to diffusers SDXL format.
-
-    Ported from:
-    https://github.com/bmaltais/kohya_ss/blob/2accb1305979ba62f5077a23aabac23b4c37e935/networks/lora_diffusers.py#L15C1-L97C32
-    """
-    unet_conversion_map_layer: list[tuple[str, str]] = []
-
-    for i in range(3):  # num_blocks is 3 in sdxl
-        # loop over downblocks/upblocks
-        for j in range(2):
-            # loop over resnets/attentions for downblocks
-            hf_down_res_prefix = f"down_blocks.{i}.resnets.{j}."
-            sd_down_res_prefix = f"input_blocks.{3*i + j + 1}.0."
-            unet_conversion_map_layer.append((sd_down_res_prefix, hf_down_res_prefix))
-
-            if i < 3:
-                # no attention layers in down_blocks.3
-                hf_down_atn_prefix = f"down_blocks.{i}.attentions.{j}."
-                sd_down_atn_prefix = f"input_blocks.{3*i + j + 1}.1."
-                unet_conversion_map_layer.append((sd_down_atn_prefix, hf_down_atn_prefix))
-
-        for j in range(3):
-            # loop over resnets/attentions for upblocks
-            hf_up_res_prefix = f"up_blocks.{i}.resnets.{j}."
-            sd_up_res_prefix = f"output_blocks.{3*i + j}.0."
-            unet_conversion_map_layer.append((sd_up_res_prefix, hf_up_res_prefix))
-
-            # if i > 0: commentout for sdxl
-            # no attention layers in up_blocks.0
-            hf_up_atn_prefix = f"up_blocks.{i}.attentions.{j}."
-            sd_up_atn_prefix = f"output_blocks.{3*i + j}.1."
-            unet_conversion_map_layer.append((sd_up_atn_prefix, hf_up_atn_prefix))
-
-        if i < 3:
-            # no downsample in down_blocks.3
-            hf_downsample_prefix = f"down_blocks.{i}.downsamplers.0.conv."
-            sd_downsample_prefix = f"input_blocks.{3*(i+1)}.0.op."
-            unet_conversion_map_layer.append((sd_downsample_prefix, hf_downsample_prefix))
-
-            # no upsample in up_blocks.3
-            hf_upsample_prefix = f"up_blocks.{i}.upsamplers.0."
-            sd_upsample_prefix = f"output_blocks.{3*i + 2}.{2}."  # change for sdxl
-            unet_conversion_map_layer.append((sd_upsample_prefix, hf_upsample_prefix))
-
-    hf_mid_atn_prefix = "mid_block.attentions.0."
-    sd_mid_atn_prefix = "middle_block.1."
-    unet_conversion_map_layer.append((sd_mid_atn_prefix, hf_mid_atn_prefix))
-
-    for j in range(2):
-        hf_mid_res_prefix = f"mid_block.resnets.{j}."
-        sd_mid_res_prefix = f"middle_block.{2*j}."
-        unet_conversion_map_layer.append((sd_mid_res_prefix, hf_mid_res_prefix))
-
-    unet_conversion_map_resnet = [
-        # (stable-diffusion, HF Diffusers)
-        ("in_layers.0.", "norm1."),
-        ("in_layers.2.", "conv1."),
-        ("out_layers.0.", "norm2."),
-        ("out_layers.3.", "conv2."),
-        ("emb_layers.1.", "time_emb_proj."),
-        ("skip_connection.", "conv_shortcut."),
-    ]
-
-    unet_conversion_map: list[tuple[str, str]] = []
-    for sd, hf in unet_conversion_map_layer:
-        if "resnets" in hf:
-            for sd_res, hf_res in unet_conversion_map_resnet:
-                unet_conversion_map.append((sd + sd_res, hf + hf_res))
-        else:
-            unet_conversion_map.append((sd, hf))
-
-    for j in range(2):
-        hf_time_embed_prefix = f"time_embedding.linear_{j+1}."
-        sd_time_embed_prefix = f"time_embed.{j*2}."
-        unet_conversion_map.append((sd_time_embed_prefix, hf_time_embed_prefix))
-
-    for j in range(2):
-        hf_label_embed_prefix = f"add_embedding.linear_{j+1}."
-        sd_label_embed_prefix = f"label_emb.0.{j*2}."
-        unet_conversion_map.append((sd_label_embed_prefix, hf_label_embed_prefix))
-
-    unet_conversion_map.append(("input_blocks.0.0.", "conv_in."))
-    unet_conversion_map.append(("out.0.", "conv_norm_out."))
-    unet_conversion_map.append(("out.2.", "conv_out."))
-
-    return unet_conversion_map
-
-
-SDXL_UNET_STABILITY_TO_DIFFUSERS_MAP = {
-    sd.rstrip(".").replace(".", "_"): hf.rstrip(".").replace(".", "_") for sd, hf in make_sdxl_unet_conversion_map()
-}
-
-
-T = TypeVar("T")
-
-
-def convert_sdxl_keys_to_diffusers_format(state_dict: dict[str, T]) -> dict[str, T]:
-    """Convert the keys of an SDXL LoRA state_dict to diffusers format.
-
-    The input state_dict can be in either Stability AI format or diffusers format. If the state_dict is already in
-    diffusers format, then this function will have no effect.
-
-    This function is adapted from:
-    https://github.com/bmaltais/kohya_ss/blob/2accb1305979ba62f5077a23aabac23b4c37e935/networks/lora_diffusers.py#L385-L409
-
-    Args:
-        state_dict (dict[str, Tensor]): The SDXL LoRA state_dict.
-
-    Raises:
-        ValueError: If state_dict contains an unrecognized key, or not all keys could be converted.
-
-    Returns:
-        dict[str, Tensor]: The diffusers-format state_dict.
-    """
-    converted_count = 0  # The number of Stability AI keys converted to diffusers format.
-    not_converted_count = 0  # The number of keys that were not converted.
-
-    # Get a sorted list of Stability AI UNet keys so that we can efficiently search for keys with matching prefixes.
-    # For example, we want to efficiently find `input_blocks_4_1` in the list when searching for
-    # `input_blocks_4_1_proj_in`.
-    stability_unet_keys = list(SDXL_UNET_STABILITY_TO_DIFFUSERS_MAP)
-    stability_unet_keys.sort()
-
-    new_state_dict: dict[str, T] = {}
-    for full_key, value in state_dict.items():
-        if full_key.startswith("lora_unet_"):
-            search_key = full_key.replace("lora_unet_", "")
-            # Use bisect to find the key in stability_unet_keys that *may* match the search_key's prefix.
-            position = bisect.bisect_right(stability_unet_keys, search_key)
-            map_key = stability_unet_keys[position - 1]
-            # Now, check if the map_key *actually* matches the search_key.
-            if search_key.startswith(map_key):
-                new_key = full_key.replace(map_key, SDXL_UNET_STABILITY_TO_DIFFUSERS_MAP[map_key])
-                new_state_dict[new_key] = value
-                converted_count += 1
-            else:
-                new_state_dict[full_key] = value
-                not_converted_count += 1
-        elif full_key.startswith("lora_te1_") or full_key.startswith("lora_te2_"):
-            # The CLIP text encoders have the same keys in both Stability AI and diffusers formats.
-            new_state_dict[full_key] = value
-            continue
-        else:
-            raise ValueError(f"Unrecognized SDXL LoRA key prefix: '{full_key}'.")
-
-    if converted_count > 0 and not_converted_count > 0:
-        raise ValueError(
-            f"The SDXL LoRA could only be partially converted to diffusers format. converted={converted_count},"
-            f" not_converted={not_converted_count}"
-        )
-
-    return new_state_dict
--- a/invokeai/backend/model_manager/init.py
+++ b/invokeai/backend/model_manager/init.py
@@ -1,6 +1,7 @@
 """Re-export frequently-used symbols from the Model Manager backend."""

 from .config import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    InvalidModelConfigException,
@@ -17,6 +18,7 @@ from .probe import ModelProbe
 from .search import ModelSearch

 __all__ = [
+    "AnyModel",
    "AnyModelConfig",
    "BaseModelType",
    "ModelRepoVariant",
--- a/invokeai/backend/model_manager/any_model_type.py
+++ b/invokeai/backend/model_manager/any_model_type.py
@@ -1,12 +0,0 @@
-from typing import Union
-
-import torch
-from diffusers.models.modeling_utils import ModelMixin
-
-from invokeai.backend.ip_adapter.ip_adapter import IPAdapter
-from invokeai.backend.lora.lora_model import LoRAModelRaw
-from invokeai.backend.onnx.onnx_runtime import IAIOnnxRuntimeModel
-from invokeai.backend.textual_inversion import TextualInversionModelRaw
-
-# ModelMixin is the base class for all diffusers and transformers models
-AnyModel = Union[ModelMixin, torch.nn.Module, IPAdapter, LoRAModelRaw, TextualInversionModelRaw, IAIOnnxRuntimeModel]
--- a/invokeai/backend/model_manager/config.py
+++ b/invokeai/backend/model_manager/config.py
@@ -24,12 +24,20 @@ import time
 from enum import Enum
 from typing import Literal, Optional, Type, TypeAlias, Union

+import torch
+from diffusers.models.modeling_utils import ModelMixin
 from pydantic import BaseModel, ConfigDict, Discriminator, Field, Tag, TypeAdapter
 from typing_extensions import Annotated, Any, Dict

 from invokeai.app.invocations.constants import SCHEDULER_NAME_VALUES
 from invokeai.app.util.misc import uuid_string

+from ..raw_model import RawModel
+
+# ModelMixin is the base class for all diffusers and transformers models
+# RawModel is the InvokeAI wrapper class for ip_adapters, loras, textual_inversion and onnx runtime
+AnyModel = Union[ModelMixin, RawModel, torch.nn.Module]
+

 class InvalidModelConfigException(Exception):
    """Exception for when config parser doesn't recognized this combination of model type and format."""
--- a/invokeai/backend/model_manager/convert_ckpt_to_diffusers.py
+++ b/invokeai/backend/model_manager/convert_ckpt_to_diffusers.py
@@ -15,7 +15,7 @@ from diffusers.pipelines.stable_diffusion.convert_from_ckpt import (
 )
 from omegaconf import DictConfig

-from invokeai.backend.model_manager.any_model_type import AnyModel
+from . import AnyModel


 def convert_ldm_vae_to_diffusers(
--- a/invokeai/backend/model_manager/load/load_base.py
+++ b/invokeai/backend/model_manager/load/load_base.py
@@ -10,8 +10,8 @@ from pathlib import Path
 from typing import Any, Optional

 from invokeai.app.services.config import InvokeAIAppConfig
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.model_manager.config import (
+    AnyModel,
    AnyModelConfig,
    SubModelType,
 )
--- a/invokeai/backend/model_manager/load/load_default.py
+++ b/invokeai/backend/model_manager/load/load_default.py
@@ -7,18 +7,18 @@ from typing import Optional

 from invokeai.app.services.config import InvokeAIAppConfig
 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    InvalidModelConfigException,
    SubModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.model_manager.config import DiffusersConfigBase, ModelType
 from invokeai.backend.model_manager.load.convert_cache import ModelConvertCacheBase
 from invokeai.backend.model_manager.load.load_base import LoadedModel, ModelLoaderBase
 from invokeai.backend.model_manager.load.model_cache.model_cache_base import ModelCacheBase, ModelLockerBase
 from invokeai.backend.model_manager.load.model_util import calc_model_size_by_data, calc_model_size_by_fs
 from invokeai.backend.model_manager.load.optimizations import skip_torch_weight_init
-from invokeai.backend.util.devices import choose_torch_device, torch_dtype
+from invokeai.backend.util.devices import TorchDevice


 # TO DO: The loader is not thread safe!
@@ -37,7 +37,7 @@ class ModelLoader(ModelLoaderBase):
        self._logger = logger
        self._ram_cache = ram_cache
        self._convert_cache = convert_cache
-        self._torch_dtype = torch_dtype(choose_torch_device(), app_config)
+        self._torch_dtype = TorchDevice.choose_torch_dtype()

    def load_model(self, model_config: AnyModelConfig, submodel_type: Optional[SubModelType] = None) -> LoadedModel:
        """
--- a/invokeai/backend/model_manager/load/model_cache/model_cache_base.py
+++ b/invokeai/backend/model_manager/load/model_cache/model_cache_base.py
@@ -14,8 +14,7 @@ from typing import Dict, Generic, Optional, TypeVar

 import torch

-from invokeai.backend.model_manager.any_model_type import AnyModel
-from invokeai.backend.model_manager.config import SubModelType
+from invokeai.backend.model_manager.config import AnyModel, SubModelType


 class ModelLockerBase(ABC):
@@ -118,7 +117,7 @@ class ModelCacheBase(ABC, Generic[T]):

    @property
    @abstractmethod
-    def stats(self) -> CacheStats:
+    def stats(self) -> Optional[CacheStats]:
        """Return collected CacheStats object."""
        pass

--- a/invokeai/backend/model_manager/load/model_cache/model_cache_default.py
+++ b/invokeai/backend/model_manager/load/model_cache/model_cache_default.py
@@ -28,18 +28,14 @@ from typing import Dict, List, Optional

 import torch

-from invokeai.backend.model_manager import SubModelType
-from invokeai.backend.model_manager.any_model_type import AnyModel
+from invokeai.backend.model_manager import AnyModel, SubModelType
 from invokeai.backend.model_manager.load.memory_snapshot import MemorySnapshot, get_pretty_snapshot_diff
-from invokeai.backend.util.devices import choose_torch_device
+from invokeai.backend.util.devices import TorchDevice
 from invokeai.backend.util.logging import InvokeAILogger

 from .model_cache_base import CacheRecord, CacheStats, ModelCacheBase, ModelLockerBase
 from .model_locker import ModelLocker

-if choose_torch_device() == torch.device("mps"):
-    from torch import mps
-
 # Maximum size of the cache, in gigs
 # Default is roughly enough to hold three fp16 diffusers models in RAM simultaneously
 DEFAULT_MAX_CACHE_SIZE = 6.0
@@ -245,9 +241,7 @@ class ModelCache(ModelCacheBase[AnyModel]):
                    f"Removing {cache_entry.key} from VRAM to free {(cache_entry.size/GIG):.2f}GB; vram free = {(torch.cuda.memory_allocated()/GIG):.2f}GB"
                )

-        torch.cuda.empty_cache()
-        if choose_torch_device() == torch.device("mps"):
-            mps.empty_cache()
+        TorchDevice.empty_cache()

    def move_model_to_device(self, cache_entry: CacheRecord[AnyModel], target_device: torch.device) -> None:
        """Move model into the indicated device.
@@ -270,12 +264,14 @@ class ModelCache(ModelCacheBase[AnyModel]):
        if torch.device(source_device).type == torch.device(target_device).type:
            return

-        # may raise an exception here if insufficient GPU VRAM
-        self._check_free_vram(target_device, cache_entry.size)
-
        start_model_to_time = time.time()
        snapshot_before = self._capture_memory_snapshot()
-        cache_entry.model.to(target_device)
+        try:
+            cache_entry.model.to(target_device)
+        except Exception as e:  # blow away cache entry
+            self._delete_cache_entry(cache_entry)
+            raise e
+
        snapshot_after = self._capture_memory_snapshot()
        end_model_to_time = time.time()
        self.logger.debug(
@@ -330,11 +326,11 @@ class ModelCache(ModelCacheBase[AnyModel]):
                    f" {in_ram_models}/{in_vram_models}({locked_in_vram_models})"
                )

-    def make_room(self, model_size: int) -> None:
+    def make_room(self, size: int) -> None:
        """Make enough room in the cache to accommodate a new model of indicated size."""
        # calculate how much memory this model will require
        # multiplier = 2 if self.precision==torch.float32 else 1
-        bytes_needed = model_size
+        bytes_needed = size
        maximum_size = self.max_cache_size * GIG  # stored in GB, convert to bytes
        current_size = self.cache_size()

@@ -389,12 +385,11 @@ class ModelCache(ModelCacheBase[AnyModel]):
            # 1 from onnx runtime object
            if not cache_entry.locked and refs <= (3 if "onnx" in model_key else 2):
                self.logger.debug(
-                    f"Removing {model_key} from RAM cache to free at least {(model_size/GIG):.2f} GB (-{(cache_entry.size/GIG):.2f} GB)"
+                    f"Removing {model_key} from RAM cache to free at least {(size/GIG):.2f} GB (-{(cache_entry.size/GIG):.2f} GB)"
                )
                current_size -= cache_entry.size
                models_cleared += 1
-                del self._cache_stack[pos]
-                del self._cached_models[model_key]
+                self._delete_cache_entry(cache_entry)
                del cache_entry

            else:
@@ -416,22 +411,9 @@ class ModelCache(ModelCacheBase[AnyModel]):
                self.stats.cleared = models_cleared
            gc.collect()

-        torch.cuda.empty_cache()
-        if choose_torch_device() == torch.device("mps"):
-            mps.empty_cache()
-
+        TorchDevice.empty_cache()
        self.logger.debug(f"After making room: cached_models={len(self._cached_models)}")

-    def _check_free_vram(self, target_device: torch.device, needed_size: int) -> None:
-        if target_device.type != "cuda":
-            return
-        vram_device = (  # mem_get_info() needs an indexed device
-            target_device if target_device.index is not None else torch.device(str(target_device), index=0)
-        )
-        free_mem, _ = torch.cuda.mem_get_info(torch.device(vram_device))
-        if needed_size > free_mem:
-            needed_gb = round(needed_size / GIG, 2)
-            free_gb = round(free_mem / GIG, 2)
-            raise torch.cuda.OutOfMemoryError(
-                f"Insufficient VRAM to load model, requested {needed_gb}GB but only had {free_gb}GB free"
-            )
+    def _delete_cache_entry(self, cache_entry: CacheRecord[AnyModel]) -> None:
+        self._cache_stack.remove(cache_entry.key)
+        del self._cached_models[cache_entry.key]
--- a/invokeai/backend/model_manager/load/model_cache/model_locker.py
+++ b/invokeai/backend/model_manager/load/model_cache/model_locker.py
@@ -4,7 +4,7 @@ Base class and implementation of a class that moves models in and out of VRAM.

 import torch

-from invokeai.backend.model_manager.any_model_type import AnyModel
+from invokeai.backend.model_manager import AnyModel

 from .model_cache_base import CacheRecord, ModelCacheBase, ModelLockerBase

@@ -34,7 +34,6 @@ class ModelLocker(ModelLockerBase):

        # NOTE that the model has to have the to() method in order for this code to move it into GPU!
        self._cache_entry.lock()
-
        try:
            if self._cache.lazy_offloading:
                self._cache.offload_unlocked_models(self._cache_entry.size)
@@ -51,6 +50,7 @@ class ModelLocker(ModelLockerBase):
        except Exception:
            self._cache_entry.unlock()
            raise
+
        return self.model

    def unlock(self) -> None:
--- a/invokeai/backend/model_manager/load/model_loaders/controlnet.py
+++ b/invokeai/backend/model_manager/load/model_loaders/controlnet.py
@@ -5,12 +5,12 @@ from pathlib import Path
 from typing import Optional

 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    ModelFormat,
    ModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.model_manager.config import CheckpointConfigBase
 from invokeai.backend.model_manager.convert_ckpt_to_diffusers import convert_controlnet_to_diffusers

--- a/invokeai/backend/model_manager/load/model_loaders/generic_diffusers.py
+++ b/invokeai/backend/model_manager/load/model_loaders/generic_diffusers.py
@@ -9,6 +9,7 @@ from diffusers.configuration_utils import ConfigMixin
 from diffusers.models.modeling_utils import ModelMixin

 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    InvalidModelConfigException,
@@ -16,7 +17,6 @@ from invokeai.backend.model_manager import (
    ModelType,
    SubModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.model_manager.config import DiffusersConfigBase

 from .. import ModelLoader, ModelLoaderRegistry
--- a/invokeai/backend/model_manager/load/model_loaders/ip_adapter.py
+++ b/invokeai/backend/model_manager/load/model_loaders/ip_adapter.py
@@ -7,9 +7,9 @@ from typing import Optional
 import torch

 from invokeai.backend.ip_adapter.ip_adapter import build_ip_adapter
-from invokeai.backend.model_manager import AnyModelConfig, BaseModelType, ModelFormat, ModelType, SubModelType
-from invokeai.backend.model_manager.any_model_type import AnyModel
+from invokeai.backend.model_manager import AnyModel, AnyModelConfig, BaseModelType, ModelFormat, ModelType, SubModelType
 from invokeai.backend.model_manager.load import ModelLoader, ModelLoaderRegistry
+from invokeai.backend.raw_model import RawModel


@ModelLoaderRegistry.register(base=BaseModelType.Any, type=ModelType.IPAdapter, format=ModelFormat.InvokeAI)
@@ -25,7 +25,7 @@ class IPAdapterInvokeAILoader(ModelLoader):
        if submodel_type is not None:
            raise ValueError("There are no submodels in an IP-Adapter model.")
        model_path = Path(config.path)
-        model = build_ip_adapter(
+        model: RawModel = build_ip_adapter(
            ip_adapter_ckpt_path=model_path,
            device=torch.device("cpu"),
            dtype=self._torch_dtype,
--- a/invokeai/backend/model_manager/load/model_loaders/lora.py
+++ b/invokeai/backend/model_manager/load/model_loaders/lora.py
@@ -6,15 +6,15 @@ from pathlib import Path
 from typing import Optional

 from invokeai.app.services.config import InvokeAIAppConfig
-from invokeai.backend.lora.lora_model import LoRAModelRaw
+from invokeai.backend.lora import LoRAModelRaw
 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    ModelFormat,
    ModelType,
    SubModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.model_manager.load.convert_cache import ModelConvertCacheBase
 from invokeai.backend.model_manager.load.model_cache.model_cache_base import ModelCacheBase

--- a/invokeai/backend/model_manager/load/model_loaders/onnx.py
+++ b/invokeai/backend/model_manager/load/model_loaders/onnx.py
@@ -6,13 +6,13 @@ from pathlib import Path
 from typing import Optional

 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    ModelFormat,
    ModelType,
    SubModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel

 from .. import ModelLoaderRegistry
 from .generic_diffusers import GenericDiffusersLoader
--- a/invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py
+++ b/invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py
@@ -5,6 +5,7 @@ from pathlib import Path
 from typing import Optional

 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    ModelFormat,
@@ -12,7 +13,6 @@ from invokeai.backend.model_manager import (
    SchedulerPredictionType,
    SubModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.model_manager.config import (
    CheckpointConfigBase,
    DiffusersConfigBase,
--- a/invokeai/backend/model_manager/load/model_loaders/textual_inversion.py
+++ b/invokeai/backend/model_manager/load/model_loaders/textual_inversion.py
@@ -5,13 +5,13 @@ from pathlib import Path
 from typing import Optional

 from invokeai.backend.model_manager import (
+    AnyModel,
    AnyModelConfig,
    BaseModelType,
    ModelFormat,
    ModelType,
    SubModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
 from invokeai.backend.textual_inversion import TextualInversionModelRaw

 from .. import ModelLoader, ModelLoaderRegistry
--- a/invokeai/backend/model_manager/load/model_loaders/vae.py
+++ b/invokeai/backend/model_manager/load/model_loaders/vae.py
@@ -14,8 +14,7 @@ from invokeai.backend.model_manager import (
    ModelFormat,
    ModelType,
 )
-from invokeai.backend.model_manager.any_model_type import AnyModel
-from invokeai.backend.model_manager.config import CheckpointConfigBase
+from invokeai.backend.model_manager.config import AnyModel, CheckpointConfigBase
 from invokeai.backend.model_manager.convert_ckpt_to_diffusers import convert_ldm_vae_to_diffusers

 from .. import ModelLoaderRegistry
--- a/invokeai/backend/model_manager/load/model_util.py
+++ b/invokeai/backend/model_manager/load/model_util.py
@@ -8,7 +8,7 @@ from typing import Optional
 import torch
 from diffusers import DiffusionPipeline

-from invokeai.backend.model_manager.any_model_type import AnyModel
+from invokeai.backend.model_manager.config import AnyModel
 from invokeai.backend.onnx.onnx_runtime import IAIOnnxRuntimeModel


--- a/invokeai/backend/model_manager/load/optimizations.py
+++ b/invokeai/backend/model_manager/load/optimizations.py
@@ -17,7 +17,7 @@ def skip_torch_weight_init() -> Generator[None, None, None]:
    completely unnecessary if the intent is to load checkpoint weights from disk for the layer. This context manager
    monkey-patches common torch layers to skip the weight initialization step.
    """
-    torch_modules = [torch.nn.Linear, torch.nn.modules.conv._ConvNd, torch.nn.Embedding, torch.nn.LayerNorm]
+    torch_modules = [torch.nn.Linear, torch.nn.modules.conv._ConvNd, torch.nn.Embedding]
    saved_functions = [hasattr(m, "reset_parameters") and m.reset_parameters for m in torch_modules]

    try:
--- a/invokeai/backend/model_manager/merge.py
+++ b/invokeai/backend/model_manager/merge.py
@@ -17,7 +17,7 @@ from diffusers.utils import logging as dlogging

 from invokeai.app.services.model_install import ModelInstallServiceBase
 from invokeai.app.services.model_records.model_records_base import ModelRecordChanges
-from invokeai.backend.util.devices import choose_torch_device, torch_dtype
+from invokeai.backend.util.devices import TorchDevice

 from . import (
    AnyModelConfig,
@@ -43,6 +43,7 @@ class ModelMerger(object):
        Initialize a ModelMerger object with the model installer.
        """
        self._installer = installer
+        self._dtype = TorchDevice.choose_torch_dtype()

    def merge_diffusion_models(
        self,
@@ -68,7 +69,7 @@ class ModelMerger(object):
            warnings.simplefilter("ignore")
            verbosity = dlogging.get_verbosity()
            dlogging.set_verbosity_error()
-            dtype = torch.float16 if variant == "fp16" else torch_dtype(choose_torch_device())
+            dtype = torch.float16 if variant == "fp16" else self._dtype

            # Note that checkpoint_merger will not work with downloaded HuggingFace fp16 models
            # until upstream https://github.com/huggingface/diffusers/pull/6670 is merged and released.
@@ -151,7 +152,7 @@ class ModelMerger(object):
        dump_path.mkdir(parents=True, exist_ok=True)
        dump_path = dump_path / merged_model_name

-        dtype = torch.float16 if variant == "fp16" else torch_dtype(choose_torch_device())
+        dtype = torch.float16 if variant == "fp16" else self._dtype
        merged_pipe.save_pretrained(dump_path.as_posix(), safe_serialization=True, torch_dtype=dtype, variant=variant)

        # register model and get its unique key
--- a/invokeai/backend/model_patcher.py
+++ b/invokeai/backend/model_patcher.py
@@ -13,14 +13,157 @@ from diffusers import OnnxRuntimeModel, UNet2DConditionModel
 from transformers import CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer

 from invokeai.app.shared.models import FreeUConfig
-from invokeai.backend.lora.lora_model import LoRAModelRaw
+from invokeai.backend.model_manager import AnyModel
 from invokeai.backend.model_manager.load.optimizations import skip_torch_weight_init
 from invokeai.backend.onnx.onnx_runtime import IAIOnnxRuntimeModel

+from .lora import LoRAModelRaw
 from .textual_inversion import TextualInversionManager, TextualInversionModelRaw

+"""
+loras = [
+    (lora_model1, 0.7),
+    (lora_model2, 0.4),
+]
+with LoRAHelper.apply_lora_unet(unet, loras):
+    # unet with applied loras
+# unmodified unet

+"""
+
+
+# TODO: rename smth like ModelPatcher and add TI method?
 class ModelPatcher:
+    @staticmethod
+    def _resolve_lora_key(model: torch.nn.Module, lora_key: str, prefix: str) -> Tuple[str, torch.nn.Module]:
+        assert "." not in lora_key
+
+        if not lora_key.startswith(prefix):
+            raise Exception(f"lora_key with invalid prefix: {lora_key}, {prefix}")
+
+        module = model
+        module_key = ""
+        key_parts = lora_key[len(prefix) :].split("_")
+
+        submodule_name = key_parts.pop(0)
+
+        while len(key_parts) > 0:
+            try:
+                module = module.get_submodule(submodule_name)
+                module_key += "." + submodule_name
+                submodule_name = key_parts.pop(0)
+            except Exception:
+                submodule_name += "_" + key_parts.pop(0)
+
+        module = module.get_submodule(submodule_name)
+        module_key = (module_key + "." + submodule_name).lstrip(".")
+
+        return (module_key, module)
+
+    @classmethod
+    @contextmanager
+    def apply_lora_unet(
+        cls,
+        unet: UNet2DConditionModel,
+        loras: Iterator[Tuple[LoRAModelRaw, float]],
+    ) -> None:
+        with cls.apply_lora(unet, loras, "lora_unet_"):
+            yield
+
+    @classmethod
+    @contextmanager
+    def apply_lora_text_encoder(
+        cls,
+        text_encoder: CLIPTextModel,
+        loras: Iterator[Tuple[LoRAModelRaw, float]],
+    ) -> None:
+        with cls.apply_lora(text_encoder, loras, "lora_te_"):
+            yield
+
+    @classmethod
+    @contextmanager
+    def apply_sdxl_lora_text_encoder(
+        cls,
+        text_encoder: CLIPTextModel,
+        loras: List[Tuple[LoRAModelRaw, float]],
+    ) -> None:
+        with cls.apply_lora(text_encoder, loras, "lora_te1_"):
+            yield
+
+    @classmethod
+    @contextmanager
+    def apply_sdxl_lora_text_encoder2(
+        cls,
+        text_encoder: CLIPTextModel,
+        loras: List[Tuple[LoRAModelRaw, float]],
+    ) -> None:
+        with cls.apply_lora(text_encoder, loras, "lora_te2_"):
+            yield
+
+    @classmethod
+    @contextmanager
+    def apply_lora(
+        cls,
+        model: AnyModel,
+        loras: Iterator[Tuple[LoRAModelRaw, float]],
+        prefix: str,
+    ) -> None:
+        original_weights = {}
+        try:
+            with torch.no_grad():
+                for lora, lora_weight in loras:
+                    # assert lora.device.type == "cpu"
+                    for layer_key, layer in lora.layers.items():
+                        if not layer_key.startswith(prefix):
+                            continue
+
+                        # TODO(ryand): A non-negligible amount of time is currently spent resolving LoRA keys. This
+                        # should be improved in the following ways:
+                        # 1. The key mapping could be more-efficiently pre-computed. This would save time every time a
+                        #    LoRA model is applied.
+                        # 2. From an API perspective, there's no reason that the `ModelPatcher` should be aware of the
+                        #    intricacies of Stable Diffusion key resolution. It should just expect the input LoRA
+                        #    weights to have valid keys.
+                        assert isinstance(model, torch.nn.Module)
+                        module_key, module = cls._resolve_lora_key(model, layer_key, prefix)
+
+                        # All of the LoRA weight calculations will be done on the same device as the module weight.
+                        # (Performance will be best if this is a CUDA device.)
+                        device = module.weight.device
+                        dtype = module.weight.dtype
+
+                        if module_key not in original_weights:
+                            original_weights[module_key] = module.weight.detach().to(device="cpu", copy=True)
+
+                        layer_scale = layer.alpha / layer.rank if (layer.alpha and layer.rank) else 1.0
+
+                        # We intentionally move to the target device first, then cast. Experimentally, this was found to
+                        # be significantly faster for 16-bit CPU tensors being moved to a CUDA device than doing the
+                        # same thing in a single call to '.to(...)'.
+                        layer.to(device=device)
+                        layer.to(dtype=torch.float32)
+                        # TODO(ryand): Using torch.autocast(...) over explicit casting may offer a speed benefit on CUDA
+                        # devices here. Experimentally, it was found to be very slow on CPU. More investigation needed.
+                        layer_weight = layer.get_weight(module.weight) * (lora_weight * layer_scale)
+                        layer.to(device=torch.device("cpu"))
+
+                        assert isinstance(layer_weight, torch.Tensor)  # mypy thinks layer_weight is a float|Any ??!
+                        if module.weight.shape != layer_weight.shape:
+                            # TODO: debug on lycoris
+                            assert hasattr(layer_weight, "reshape")
+                            layer_weight = layer_weight.reshape(module.weight.shape)
+
+                        assert isinstance(layer_weight, torch.Tensor)  # mypy thinks layer_weight is a float|Any ??!
+                        module.weight += layer_weight.to(dtype=dtype)
+
+            yield  # wait for context manager exit
+
+        finally:
+            assert hasattr(model, "get_submodule")  # mypy not picking up fact that torch.nn.Module has get_submodule()
+            with torch.no_grad():
+                for module_key, weight in original_weights.items():
+                    model.get_submodule(module_key).weight.copy_(weight)
+
    @classmethod
    @contextmanager
    def apply_ti(
--- a/invokeai/backend/onnx/onnx_runtime.py
+++ b/invokeai/backend/onnx/onnx_runtime.py
@@ -6,16 +6,17 @@ from typing import Any, List, Optional, Tuple, Union

 import numpy as np
 import onnx
-import torch
 from onnx import numpy_helper
 from onnxruntime import InferenceSession, SessionOptions, get_available_providers

+from ..raw_model import RawModel
+
 ONNX_WEIGHTS_NAME = "model.onnx"


 # NOTE FROM LS: This was copied from Stalker's original implementation.
 # I have not yet gone through and fixed all the type hints
-class IAIOnnxRuntimeModel(torch.nn.Module):
+class IAIOnnxRuntimeModel(RawModel):
    class _tensor_access:
        def __init__(self, model):  # type: ignore
            self.model = model
@@ -102,7 +103,7 @@ class IAIOnnxRuntimeModel(torch.nn.Module):

        self.proto = onnx.load(model_path, load_external_data=False)
        """
-        super().__init__()
+
        self.proto = onnx.load(model_path, load_external_data=True)
        # self.data = dict()
        # for tensor in self.proto.graph.initializer:
--- a/invokeai/backend/raw_model.py
+++ b/invokeai/backend/raw_model.py
@@ -0,0 +1,15 @@
+"""Base class for 'Raw' models.
+
+The RawModel class is the base class of LoRAModelRaw and TextualInversionModelRaw,
+and is used for type checking of calls to the model patcher. Its main purpose
+is to avoid a circular import issues when lora.py tries to import BaseModelType
+from invokeai.backend.model_manager.config, and the latter tries to import LoRAModelRaw
+from lora.py.
+
+The term 'raw' was introduced to describe a wrapper around a torch.nn.Module
+that adds additional methods and attributes.
+"""
+
+
+class RawModel:
+    """Base class for 'Raw' model wrappers."""
--- a/invokeai/backend/stable_diffusion/diffusers_pipeline.py
+++ b/invokeai/backend/stable_diffusion/diffusers_pipeline.py
@@ -21,12 +21,11 @@ from pydantic import Field
 from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer

 from invokeai.app.services.config.config_default import get_config
-from invokeai.backend.ip_adapter.ip_adapter import IPAdapter
-from invokeai.backend.ip_adapter.unet_patcher import UNetPatcher
-from invokeai.backend.stable_diffusion.diffusion.conditioning_data import ConditioningData
+from invokeai.backend.stable_diffusion.diffusion.conditioning_data import IPAdapterData, TextConditioningData
 from invokeai.backend.stable_diffusion.diffusion.shared_invokeai_diffusion import InvokeAIDiffuserComponent
+from invokeai.backend.stable_diffusion.diffusion.unet_attention_patcher import UNetAttentionPatcher, UNetIPAdapterData
 from invokeai.backend.util.attention import auto_detect_slice_size
-from invokeai.backend.util.devices import normalize_device
+from invokeai.backend.util.devices import TorchDevice


@dataclass
@@ -149,16 +148,6 @@ class ControlNetData:
    resize_mode: str = Field(default="just_resize")


-@dataclass
-class IPAdapterData:
-    ip_adapter_model: IPAdapter = Field(default=None)
-    # TODO: change to polymorphic so can do different weights per step (once implemented...)
-    weight: Union[float, List[float]] = Field(default=1.0)
-    # weight: float = Field(default=1.0)
-    begin_step_percent: float = Field(default=0.0)
-    end_step_percent: float = Field(default=1.0)
-
-
@dataclass
 class T2IAdapterData:
    """A structure containing the information required to apply conditioning from a single T2I-Adapter model."""
@@ -266,7 +255,7 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        if self.unet.device.type == "cpu" or self.unet.device.type == "mps":
            mem_free = psutil.virtual_memory().free
        elif self.unet.device.type == "cuda":
-            mem_free, _ = torch.cuda.mem_get_info(normalize_device(self.unet.device))
+            mem_free, _ = torch.cuda.mem_get_info(TorchDevice.normalize(self.unet.device))
        else:
            raise ValueError(f"unrecognized device {self.unet.device}")
        # input tensor of [1, 4, h/8, w/8]
@@ -295,7 +284,8 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        self,
        latents: torch.Tensor,
        num_inference_steps: int,
-        conditioning_data: ConditioningData,
+        scheduler_step_kwargs: dict[str, Any],
+        conditioning_data: TextConditioningData,
        *,
        noise: Optional[torch.Tensor],
        timesteps: torch.Tensor,
@@ -308,7 +298,7 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        mask: Optional[torch.Tensor] = None,
        masked_latents: Optional[torch.Tensor] = None,
        gradient_mask: Optional[bool] = False,
-        seed: Optional[int] = None,
+        seed: int,
    ) -> torch.Tensor:
        if init_timestep.shape[0] == 0:
            return latents
@@ -326,20 +316,6 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
            latents = self.scheduler.add_noise(latents, noise, batched_t)

        if mask is not None:
-            # if no noise provided, noisify unmasked area based on seed(or 0 as fallback)
-            if noise is None:
-                noise = torch.randn(
-                    orig_latents.shape,
-                    dtype=torch.float32,
-                    device="cpu",
-                    generator=torch.Generator(device="cpu").manual_seed(seed or 0),
-                ).to(device=orig_latents.device, dtype=orig_latents.dtype)
-
-                latents = self.scheduler.add_noise(latents, noise, batched_t)
-                latents = torch.lerp(
-                    orig_latents, latents.to(dtype=orig_latents.dtype), mask.to(dtype=orig_latents.dtype)
-                )
-
            if is_inpainting_model(self.unet):
                if masked_latents is None:
                    raise Exception("Source image required for inpaint mask when inpaint model used!")
@@ -348,6 +324,15 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
                    self._unet_forward, mask, masked_latents
                )
            else:
+                # if no noise provided, noisify unmasked area based on seed
+                if noise is None:
+                    noise = torch.randn(
+                        orig_latents.shape,
+                        dtype=torch.float32,
+                        device="cpu",
+                        generator=torch.Generator(device="cpu").manual_seed(seed),
+                    ).to(device=orig_latents.device, dtype=orig_latents.dtype)
+
                additional_guidance.append(AddsMaskGuidance(mask, orig_latents, self.scheduler, noise, gradient_mask))

        try:
@@ -355,6 +340,7 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
                latents,
                timesteps,
                conditioning_data,
+                scheduler_step_kwargs=scheduler_step_kwargs,
                additional_guidance=additional_guidance,
                control_data=control_data,
                ip_adapter_data=ip_adapter_data,
@@ -380,7 +366,8 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        self,
        latents: torch.Tensor,
        timesteps,
-        conditioning_data: ConditioningData,
+        conditioning_data: TextConditioningData,
+        scheduler_step_kwargs: dict[str, Any],
        *,
        additional_guidance: List[Callable] = None,
        control_data: List[ControlNetData] = None,
@@ -397,22 +384,22 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        if timesteps.shape[0] == 0:
            return latents

-        ip_adapter_unet_patcher = None
-        extra_conditioning_info = conditioning_data.text_embeddings.extra_conditioning
-        if extra_conditioning_info is not None and extra_conditioning_info.wants_cross_attention_control:
-            attn_ctx = self.invokeai_diffuser.custom_attention_context(
-                self.invokeai_diffuser.model,
-                extra_conditioning_info=extra_conditioning_info,
+        use_ip_adapter = ip_adapter_data is not None
+        use_regional_prompting = (
+            conditioning_data.cond_regions is not None or conditioning_data.uncond_regions is not None
+        )
+        unet_attention_patcher = None
+        self.use_ip_adapter = use_ip_adapter
+        attn_ctx = nullcontext()
+
+        if use_ip_adapter or use_regional_prompting:
+            ip_adapters: Optional[List[UNetIPAdapterData]] = (
+                [{"ip_adapter": ipa.ip_adapter_model, "target_blocks": ipa.target_blocks} for ipa in ip_adapter_data]
+                if use_ip_adapter
+                else None
            )
-            self.use_ip_adapter = False
-        elif ip_adapter_data is not None:
-            # TODO(ryand): Should we raise an exception if both custom attention and IP-Adapter attention are active?
-            # As it is now, the IP-Adapter will silently be skipped.
-            ip_adapter_unet_patcher = UNetPatcher([ipa.ip_adapter_model for ipa in ip_adapter_data])
-            attn_ctx = ip_adapter_unet_patcher.apply_ip_adapter_attention(self.invokeai_diffuser.model)
-            self.use_ip_adapter = True
-        else:
-            attn_ctx = nullcontext()
+            unet_attention_patcher = UNetAttentionPatcher(ip_adapters)
+            attn_ctx = unet_attention_patcher.apply_ip_adapter_attention(self.invokeai_diffuser.model)

        with attn_ctx:
            if callback is not None:
@@ -435,11 +422,11 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
                    conditioning_data,
                    step_index=i,
                    total_step_count=len(timesteps),
+                    scheduler_step_kwargs=scheduler_step_kwargs,
                    additional_guidance=additional_guidance,
                    control_data=control_data,
                    ip_adapter_data=ip_adapter_data,
                    t2i_adapter_data=t2i_adapter_data,
-                    ip_adapter_unet_patcher=ip_adapter_unet_patcher,
                )
                latents = step_output.prev_sample
                predicted_original = getattr(step_output, "pred_original_sample", None)
@@ -463,14 +450,14 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        self,
        t: torch.Tensor,
        latents: torch.Tensor,
-        conditioning_data: ConditioningData,
+        conditioning_data: TextConditioningData,
        step_index: int,
        total_step_count: int,
+        scheduler_step_kwargs: dict[str, Any],
        additional_guidance: List[Callable] = None,
        control_data: List[ControlNetData] = None,
        ip_adapter_data: Optional[list[IPAdapterData]] = None,
        t2i_adapter_data: Optional[list[T2IAdapterData]] = None,
-        ip_adapter_unet_patcher: Optional[UNetPatcher] = None,
    ):
        # invokeai_diffuser has batched timesteps, but diffusers schedulers expect a single value
        timestep = t[0]
@@ -485,23 +472,6 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
        #     i.e. before or after passing it to InvokeAIDiffuserComponent
        latent_model_input = self.scheduler.scale_model_input(latents, timestep)

-        # handle IP-Adapter
-        if self.use_ip_adapter and ip_adapter_data is not None:  # somewhat redundant but logic is clearer
-            for i, single_ip_adapter_data in enumerate(ip_adapter_data):
-                first_adapter_step = math.floor(single_ip_adapter_data.begin_step_percent * total_step_count)
-                last_adapter_step = math.ceil(single_ip_adapter_data.end_step_percent * total_step_count)
-                weight = (
-                    single_ip_adapter_data.weight[step_index]
-                    if isinstance(single_ip_adapter_data.weight, List)
-                    else single_ip_adapter_data.weight
-                )
-                if step_index >= first_adapter_step and step_index <= last_adapter_step:
-                    # Only apply this IP-Adapter if the current step is within the IP-Adapter's begin/end step range.
-                    ip_adapter_unet_patcher.set_scale(i, weight)
-                else:
-                    # Otherwise, set the IP-Adapter's scale to 0, so it has no effect.
-                    ip_adapter_unet_patcher.set_scale(i, 0.0)
-
        # Handle ControlNet(s)
        down_block_additional_residuals = None
        mid_block_additional_residual = None
@@ -550,6 +520,7 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
            step_index=step_index,
            total_step_count=total_step_count,
            conditioning_data=conditioning_data,
+            ip_adapter_data=ip_adapter_data,
            down_block_additional_residuals=down_block_additional_residuals,  # for ControlNet
            mid_block_additional_residual=mid_block_additional_residual,  # for ControlNet
            down_intrablock_additional_residuals=down_intrablock_additional_residuals,  # for T2I-Adapter
@@ -569,7 +540,7 @@ class StableDiffusionGeneratorPipeline(StableDiffusionPipeline):
            )

        # compute the previous noisy sample x_t -> x_t-1
-        step_output = self.scheduler.step(noise_pred, timestep, latents, **conditioning_data.scheduler_args)
+        step_output = self.scheduler.step(noise_pred, timestep, latents, **scheduler_step_kwargs)

        # TODO: discuss injection point options. For now this is a patch to get progress images working with inpainting again.
        for guidance in additional_guidance:
--- a/invokeai/backend/stable_diffusion/diffusion/conditioning_data.py
+++ b/invokeai/backend/stable_diffusion/diffusion/conditioning_data.py
@@ -1,27 +1,17 @@
-import dataclasses
-import inspect
-from dataclasses import dataclass, field
-from typing import Any, List, Optional, Union
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Union

 import torch

-from .cross_attention_control import Arguments
-
-
-@dataclass
-class ExtraConditioningInfo:
-    tokens_count_including_eos_bos: int
-    cross_attention_control_args: Optional[Arguments] = None
-
-    @property
-    def wants_cross_attention_control(self):
-        return self.cross_attention_control_args is not None
+from invokeai.backend.ip_adapter.ip_adapter import IPAdapter


@dataclass
 class BasicConditioningInfo:
+    """SD 1/2 text conditioning information produced by Compel."""
+
    embeds: torch.Tensor
-    extra_conditioning: Optional[ExtraConditioningInfo]

    def to(self, device, dtype=None):
        self.embeds = self.embeds.to(device=device, dtype=dtype)
@@ -35,6 +25,8 @@ class ConditioningFieldData:

@dataclass
 class SDXLConditioningInfo(BasicConditioningInfo):
+    """SDXL text conditioning information produced by Compel."""
+
    pooled_embeds: torch.Tensor
    add_time_ids: torch.Tensor

@@ -57,37 +49,75 @@ class IPAdapterConditioningInfo:


@dataclass
-class ConditioningData:
-    unconditioned_embeddings: BasicConditioningInfo
-    text_embeddings: BasicConditioningInfo
-    """
-    Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-    `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf).
-    Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate
-    images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
-    """
-    guidance_scale: Union[float, List[float]]
-    """ for models trained using zero-terminal SNR ("ztsnr"), it's suggested to use guidance_rescale_multiplier of 0.7 .
-     ref [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf)
-    """
-    guidance_rescale_multiplier: float = 0
-    scheduler_args: dict[str, Any] = field(default_factory=dict)
+class IPAdapterData:
+    ip_adapter_model: IPAdapter
+    ip_adapter_conditioning: IPAdapterConditioningInfo
+    mask: torch.Tensor
+    target_blocks: List[str]

-    ip_adapter_conditioning: Optional[list[IPAdapterConditioningInfo]] = None
+    # Either a single weight applied to all steps, or a list of weights for each step.
+    weight: Union[float, List[float]] = 1.0
+    begin_step_percent: float = 0.0
+    end_step_percent: float = 1.0

-    @property
-    def dtype(self):
-        return self.text_embeddings.dtype
+    def scale_for_step(self, step_index: int, total_steps: int) -> float:
+        first_adapter_step = math.floor(self.begin_step_percent * total_steps)
+        last_adapter_step = math.ceil(self.end_step_percent * total_steps)
+        weight = self.weight[step_index] if isinstance(self.weight, List) else self.weight
+        if step_index >= first_adapter_step and step_index <= last_adapter_step:
+            # Only apply this IP-Adapter if the current step is within the IP-Adapter's begin/end step range.
+            return weight
+        # Otherwise, set the IP-Adapter's scale to 0, so it has no effect.
+        return 0.0

-    def add_scheduler_args_if_applicable(self, scheduler, **kwargs):
-        scheduler_args = dict(self.scheduler_args)
-        step_method = inspect.signature(scheduler.step)
-        for name, value in kwargs.items():
-            try:
-                step_method.bind_partial(**{name: value})
-            except TypeError:
-                # FIXME: don't silently discard arguments
-                pass  # debug("%s does not accept argument named %r", scheduler, name)
-            else:
-                scheduler_args[name] = value
-        return dataclasses.replace(self, scheduler_args=scheduler_args)
+
+@dataclass
+class Range:
+    start: int
+    end: int
+
+
+class TextConditioningRegions:
+    def __init__(
+        self,
+        masks: torch.Tensor,
+        ranges: list[Range],
+    ):
+        # A binary mask indicating the regions of the image that the prompt should be applied to.
+        # Shape: (1, num_prompts, height, width)
+        # Dtype: torch.bool
+        self.masks = masks
+
+        # A list of ranges indicating the start and end indices of the embeddings that corresponding mask applies to.
+        # ranges[i] contains the embedding range for the i'th prompt / mask.
+        self.ranges = ranges
+
+        assert self.masks.shape[1] == len(self.ranges)
+
+
+class TextConditioningData:
+    def __init__(
+        self,
+        uncond_text: Union[BasicConditioningInfo, SDXLConditioningInfo],
+        cond_text: Union[BasicConditioningInfo, SDXLConditioningInfo],
+        uncond_regions: Optional[TextConditioningRegions],
+        cond_regions: Optional[TextConditioningRegions],
+        guidance_scale: Union[float, List[float]],
+        guidance_rescale_multiplier: float = 0,
+    ):
+        self.uncond_text = uncond_text
+        self.cond_text = cond_text
+        self.uncond_regions = uncond_regions
+        self.cond_regions = cond_regions
+        # Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+        # `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf).
+        # Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate
+        # images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+        self.guidance_scale = guidance_scale
+        # For models trained using zero-terminal SNR ("ztsnr"), it's suggested to use guidance_rescale_multiplier of 0.7.
+        # See [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
+        self.guidance_rescale_multiplier = guidance_rescale_multiplier
+
+    def is_sdxl(self):
+        assert isinstance(self.uncond_text, SDXLConditioningInfo) == isinstance(self.cond_text, SDXLConditioningInfo)
+        return isinstance(self.cond_text, SDXLConditioningInfo)
--- a/invokeai/backend/stable_diffusion/diffusion/cross_attention_control.py
+++ b/invokeai/backend/stable_diffusion/diffusion/cross_attention_control.py
@@ -1,218 +0,0 @@
-# adapted from bloc97's CrossAttentionControl colab
-# https://github.com/bloc97/CrossAttentionControl
-
-
-import enum
-from dataclasses import dataclass, field
-from typing import Optional
-
-import torch
-from compel.cross_attention_control import Arguments
-from diffusers.models.attention_processor import Attention, SlicedAttnProcessor
-from diffusers.models.unets.unet_2d_condition import UNet2DConditionModel
-
-from invokeai.backend.util.devices import torch_dtype
-
-
-class CrossAttentionType(enum.Enum):
-    SELF = 1
-    TOKENS = 2
-
-
-class CrossAttnControlContext:
-    def __init__(self, arguments: Arguments):
-        """
-        :param arguments: Arguments for the cross-attention control process
-        """
-        self.cross_attention_mask: Optional[torch.Tensor] = None
-        self.cross_attention_index_map: Optional[torch.Tensor] = None
-        self.arguments = arguments
-
-    def get_active_cross_attention_control_types_for_step(
-        self, percent_through: float = None
-    ) -> list[CrossAttentionType]:
-        """
-        Should cross-attention control be applied on the given step?
-        :param percent_through: How far through the step sequence are we (0.0=pure noise, 1.0=completely denoised image). Expected range 0.0..<1.0.
-        :return: A list of attention types that cross-attention control should be performed for on the given step. May be [].
-        """
-        if percent_through is None:
-            return [CrossAttentionType.SELF, CrossAttentionType.TOKENS]
-
-        opts = self.arguments.edit_options
-        to_control = []
-        if opts["s_start"] <= percent_through < opts["s_end"]:
-            to_control.append(CrossAttentionType.SELF)
-        if opts["t_start"] <= percent_through < opts["t_end"]:
-            to_control.append(CrossAttentionType.TOKENS)
-        return to_control
-
-
-def setup_cross_attention_control_attention_processors(unet: UNet2DConditionModel, context: CrossAttnControlContext):
-    """
-    Inject attention parameters and functions into the passed in model to enable cross attention editing.
-
-    :param model: The unet model to inject into.
-    :return: None
-    """
-
-    # adapted from init_attention_edit
-    device = context.arguments.edited_conditioning.device
-
-    # urgh. should this be hardcoded?
-    max_length = 77
-    # mask=1 means use base prompt attention, mask=0 means use edited prompt attention
-    mask = torch.zeros(max_length, dtype=torch_dtype(device))
-    indices_target = torch.arange(max_length, dtype=torch.long)
-    indices = torch.arange(max_length, dtype=torch.long)
-    for name, a0, a1, b0, b1 in context.arguments.edit_opcodes:
-        if b0 < max_length:
-            if name == "equal":  # or (name == "replace" and a1 - a0 == b1 - b0):
-                # these tokens have not been edited
-                indices[b0:b1] = indices_target[a0:a1]
-                mask[b0:b1] = 1
-
-    context.cross_attention_mask = mask.to(device)
-    context.cross_attention_index_map = indices.to(device)
-    old_attn_processors = unet.attn_processors
-    if torch.backends.mps.is_available():
-        # see note in StableDiffusionGeneratorPipeline.__init__ about borked slicing on MPS
-        unet.set_attn_processor(SwapCrossAttnProcessor())
-    else:
-        # try to re-use an existing slice size
-        default_slice_size = 4
-        slice_size = next(
-            (p.slice_size for p in old_attn_processors.values() if type(p) is SlicedAttnProcessor), default_slice_size
-        )
-        unet.set_attn_processor(SlicedSwapCrossAttnProcesser(slice_size=slice_size))
-
-
-@dataclass
-class SwapCrossAttnContext:
-    modified_text_embeddings: torch.Tensor
-    index_map: torch.Tensor  # maps from original prompt token indices to the equivalent tokens in the modified prompt
-    mask: torch.Tensor  # in the target space of the index_map
-    cross_attention_types_to_do: list[CrossAttentionType] = field(default_factory=list)
-
-    def wants_cross_attention_control(self, attn_type: CrossAttentionType) -> bool:
-        return attn_type in self.cross_attention_types_to_do
-
-    @classmethod
-    def make_mask_and_index_map(
-        cls, edit_opcodes: list[tuple[str, int, int, int, int]], max_length: int
-    ) -> tuple[torch.Tensor, torch.Tensor]:
-        # mask=1 means use original prompt attention, mask=0 means use modified prompt attention
-        mask = torch.zeros(max_length)
-        indices_target = torch.arange(max_length, dtype=torch.long)
-        indices = torch.arange(max_length, dtype=torch.long)
-        for name, a0, a1, b0, b1 in edit_opcodes:
-            if b0 < max_length:
-                if name == "equal":
-                    # these tokens remain the same as in the original prompt
-                    indices[b0:b1] = indices_target[a0:a1]
-                    mask[b0:b1] = 1
-
-        return mask, indices
-
-
-class SlicedSwapCrossAttnProcesser(SlicedAttnProcessor):
-    # TODO: dynamically pick slice size based on memory conditions
-
-    def __call__(
-        self,
-        attn: Attention,
-        hidden_states,
-        encoder_hidden_states=None,
-        attention_mask=None,
-        # kwargs
-        swap_cross_attn_context: SwapCrossAttnContext = None,
-        **kwargs,
-    ):
-        attention_type = CrossAttentionType.SELF if encoder_hidden_states is None else CrossAttentionType.TOKENS
-
-        # if cross-attention control is not in play, just call through to the base implementation.
-        if (
-            attention_type is CrossAttentionType.SELF
-            or swap_cross_attn_context is None
-            or not swap_cross_attn_context.wants_cross_attention_control(attention_type)
-        ):
-            # print(f"SwapCrossAttnContext for {attention_type} not active - passing request to superclass")
-            return super().__call__(attn, hidden_states, encoder_hidden_states, attention_mask)
-        # else:
-        #    print(f"SwapCrossAttnContext for {attention_type} active")
-
-        batch_size, sequence_length, _ = hidden_states.shape
-        attention_mask = attn.prepare_attention_mask(
-            attention_mask=attention_mask,
-            target_length=sequence_length,
-            batch_size=batch_size,
-        )
-
-        query = attn.to_q(hidden_states)
-        dim = query.shape[-1]
-        query = attn.head_to_batch_dim(query)
-
-        original_text_embeddings = encoder_hidden_states
-        modified_text_embeddings = swap_cross_attn_context.modified_text_embeddings
-        original_text_key = attn.to_k(original_text_embeddings)
-        modified_text_key = attn.to_k(modified_text_embeddings)
-        original_value = attn.to_v(original_text_embeddings)
-        modified_value = attn.to_v(modified_text_embeddings)
-
-        original_text_key = attn.head_to_batch_dim(original_text_key)
-        modified_text_key = attn.head_to_batch_dim(modified_text_key)
-        original_value = attn.head_to_batch_dim(original_value)
-        modified_value = attn.head_to_batch_dim(modified_value)
-
-        # compute slices and prepare output tensor
-        batch_size_attention = query.shape[0]
-        hidden_states = torch.zeros(
-            (batch_size_attention, sequence_length, dim // attn.heads),
-            device=query.device,
-            dtype=query.dtype,
-        )
-
-        # do slices
-        for i in range(max(1, hidden_states.shape[0] // self.slice_size)):
-            start_idx = i * self.slice_size
-            end_idx = (i + 1) * self.slice_size
-
-            query_slice = query[start_idx:end_idx]
-            original_key_slice = original_text_key[start_idx:end_idx]
-            modified_key_slice = modified_text_key[start_idx:end_idx]
-            attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None
-
-            original_attn_slice = attn.get_attention_scores(query_slice, original_key_slice, attn_mask_slice)
-            modified_attn_slice = attn.get_attention_scores(query_slice, modified_key_slice, attn_mask_slice)
-
-            # because the prompt modifications may result in token sequences shifted forwards or backwards,
-            # the original attention probabilities must be remapped to account for token index changes in the
-            # modified prompt
-            remapped_original_attn_slice = torch.index_select(
-                original_attn_slice, -1, swap_cross_attn_context.index_map
-            )
-
-            # only some tokens taken from the original attention probabilities. this is controlled by the mask.
-            mask = swap_cross_attn_context.mask
-            inverse_mask = 1 - mask
-            attn_slice = remapped_original_attn_slice * mask + modified_attn_slice * inverse_mask
-
-            del remapped_original_attn_slice, modified_attn_slice
-
-            attn_slice = torch.bmm(attn_slice, modified_value[start_idx:end_idx])
-            hidden_states[start_idx:end_idx] = attn_slice
-
-        # done
-        hidden_states = attn.batch_to_head_dim(hidden_states)
-
-        # linear proj
-        hidden_states = attn.to_out[0](hidden_states)
-        # dropout
-        hidden_states = attn.to_out[1](hidden_states)
-
-        return hidden_states
-
-
-class SwapCrossAttnProcessor(SlicedSwapCrossAttnProcesser):
-    def __init__(self):
-        super(SwapCrossAttnProcessor, self).__init__(slice_size=int(1e9))  # massive slice size = don't slice
--- a/invokeai/backend/stable_diffusion/diffusion/custom_atttention.py
+++ b/invokeai/backend/stable_diffusion/diffusion/custom_atttention.py
@@ -0,0 +1,214 @@
+from dataclasses import dataclass
+from typing import List, Optional, cast
+
+import torch
+import torch.nn.functional as F
+from diffusers.models.attention_processor import Attention, AttnProcessor2_0
+
+from invokeai.backend.ip_adapter.ip_attention_weights import IPAttentionProcessorWeights
+from invokeai.backend.stable_diffusion.diffusion.regional_ip_data import RegionalIPData
+from invokeai.backend.stable_diffusion.diffusion.regional_prompt_data import RegionalPromptData
+
+
+@dataclass
+class IPAdapterAttentionWeights:
+    ip_adapter_weights: IPAttentionProcessorWeights
+    skip: bool
+
+
+class CustomAttnProcessor2_0(AttnProcessor2_0):
+    """A custom implementation of AttnProcessor2_0 that supports additional Invoke features.
+    This implementation is based on
+    https://github.com/huggingface/diffusers/blame/fcfa270fbd1dc294e2f3a505bae6bcb791d721c3/src/diffusers/models/attention_processor.py#L1204
+    Supported custom features:
+    - IP-Adapter
+    - Regional prompt attention
+    """
+
+    def __init__(
+        self,
+        ip_adapter_attention_weights: Optional[List[IPAdapterAttentionWeights]] = None,
+    ):
+        """Initialize a CustomAttnProcessor2_0.
+        Note: Arguments that are the same for all attention layers are passed to __call__(). Arguments that are
+        layer-specific are passed to __init__().
+        Args:
+            ip_adapter_weights: The IP-Adapter attention weights. ip_adapter_weights[i] contains the attention weights
+                for the i'th IP-Adapter.
+        """
+        super().__init__()
+        self._ip_adapter_attention_weights = ip_adapter_attention_weights
+
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        temb: Optional[torch.Tensor] = None,
+        # For Regional Prompting:
+        regional_prompt_data: Optional[RegionalPromptData] = None,
+        percent_through: Optional[torch.Tensor] = None,
+        # For IP-Adapter:
+        regional_ip_data: Optional[RegionalIPData] = None,
+        *args,
+        **kwargs,
+    ) -> torch.FloatTensor:
+        """Apply attention.
+        Args:
+            regional_prompt_data: The regional prompt data for the current batch. If not None, this will be used to
+                apply regional prompt masking.
+            regional_ip_data: The IP-Adapter data for the current batch.
+        """
+        # If true, we are doing cross-attention, if false we are doing self-attention.
+        is_cross_attention = encoder_hidden_states is not None
+
+        # Start unmodified block from AttnProcessor2_0.
+        # vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
+        residual = hidden_states
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+
+        input_ndim = hidden_states.ndim
+
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+
+        batch_size, sequence_length, _ = (
+            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
+        )
+        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+        # End unmodified block from AttnProcessor2_0.
+
+        _, query_seq_len, _ = hidden_states.shape
+        # Handle regional prompt attention masks.
+        if regional_prompt_data is not None and is_cross_attention:
+            assert percent_through is not None
+            prompt_region_attention_mask = regional_prompt_data.get_cross_attn_mask(
+                query_seq_len=query_seq_len, key_seq_len=sequence_length
+            )
+
+            if attention_mask is None:
+                attention_mask = prompt_region_attention_mask
+            else:
+                attention_mask = prompt_region_attention_mask + attention_mask
+
+        # Start unmodified block from AttnProcessor2_0.
+        # vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
+        if attention_mask is not None:
+            attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
+            # scaled_dot_product_attention expects attention_mask shape to be
+            # (batch, heads, source_length, target_length)
+            attention_mask = attention_mask.view(batch_size, attn.heads, -1, attention_mask.shape[-1])
+
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
+
+        query = attn.to_q(hidden_states)
+
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        # TODO: add support for attn.scale when we move to Torch 2.1
+        hidden_states = F.scaled_dot_product_attention(
+            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
+        )
+
+        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        hidden_states = hidden_states.to(query.dtype)
+        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+        # End unmodified block from AttnProcessor2_0.
+
+        # Apply IP-Adapter conditioning.
+        if is_cross_attention:
+            if self._ip_adapter_attention_weights:
+                assert regional_ip_data is not None
+                ip_masks = regional_ip_data.get_masks(query_seq_len=query_seq_len)
+
+                assert (
+                    len(regional_ip_data.image_prompt_embeds)
+                    == len(self._ip_adapter_attention_weights)
+                    == len(regional_ip_data.scales)
+                    == ip_masks.shape[1]
+                )
+
+                for ipa_index, ipa_embed in enumerate(regional_ip_data.image_prompt_embeds):
+                    ipa_weights = self._ip_adapter_attention_weights[ipa_index].ip_adapter_weights
+                    ipa_scale = regional_ip_data.scales[ipa_index]
+                    ip_mask = ip_masks[0, ipa_index, ...]
+
+                    # The batch dimensions should match.
+                    assert ipa_embed.shape[0] == encoder_hidden_states.shape[0]
+                    # The token_len dimensions should match.
+                    assert ipa_embed.shape[-1] == encoder_hidden_states.shape[-1]
+
+                    ip_hidden_states = ipa_embed
+
+                    # Expected ip_hidden_state shape: (batch_size, num_ip_images, ip_seq_len, ip_image_embedding)
+
+                    if not self._ip_adapter_attention_weights[ipa_index].skip:
+                        ip_key = ipa_weights.to_k_ip(ip_hidden_states)
+                        ip_value = ipa_weights.to_v_ip(ip_hidden_states)
+
+                        # Expected ip_key and ip_value shape:
+                        # (batch_size, num_ip_images, ip_seq_len, head_dim * num_heads)
+
+                        ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+                        ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+                        # Expected ip_key and ip_value shape:
+                        # (batch_size, num_heads, num_ip_images * ip_seq_len, head_dim)
+
+                        # TODO: add support for attn.scale when we move to Torch 2.1
+                        ip_hidden_states = F.scaled_dot_product_attention(
+                            query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False
+                        )
+
+                        # Expected ip_hidden_states shape: (batch_size, num_heads, query_seq_len, head_dim)
+                        ip_hidden_states = ip_hidden_states.transpose(1, 2).reshape(
+                            batch_size, -1, attn.heads * head_dim
+                        )
+
+                        ip_hidden_states = ip_hidden_states.to(query.dtype)
+
+                        # Expected ip_hidden_states shape: (batch_size, query_seq_len, num_heads * head_dim)
+                        hidden_states = hidden_states + ipa_scale * ip_hidden_states * ip_mask
+            else:
+                # If IP-Adapter is not enabled, then regional_ip_data should not be passed in.
+                assert regional_ip_data is None
+
+        # Start unmodified block from AttnProcessor2_0.
+        # vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+
+        hidden_states = hidden_states / attn.rescale_output_factor
+        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+        # End of unmodified block from AttnProcessor2_0
+
+        # casting torch.Tensor to torch.FloatTensor to avoid type issues
+        return cast(torch.FloatTensor, hidden_states)
--- a/invokeai/backend/stable_diffusion/diffusion/regional_ip_data.py
+++ b/invokeai/backend/stable_diffusion/diffusion/regional_ip_data.py
@@ -0,0 +1,72 @@
+import torch
+
+
+class RegionalIPData:
+    """A class to manage the data for regional IP-Adapter conditioning."""
+
+    def __init__(
+        self,
+        image_prompt_embeds: list[torch.Tensor],
+        scales: list[float],
+        masks: list[torch.Tensor],
+        dtype: torch.dtype,
+        device: torch.device,
+        max_downscale_factor: int = 8,
+    ):
+        """Initialize a `IPAdapterConditioningData` object."""
+        assert len(image_prompt_embeds) == len(scales) == len(masks)
+
+        # The image prompt embeddings.
+        # regional_ip_data[i] contains the image prompt embeddings for the i'th IP-Adapter. Each tensor
+        # has shape (batch_size, num_ip_images, seq_len, ip_embedding_len).
+        self.image_prompt_embeds = image_prompt_embeds
+
+        # The scales for the IP-Adapter attention.
+        # scales[i] contains the attention scale for the i'th IP-Adapter.
+        self.scales = scales
+
+        # The IP-Adapter masks.
+        # self._masks_by_seq_len[s] contains the spatial masks for the downsampling level with query sequence length of
+        # s. It has shape (batch_size, num_ip_images, query_seq_len, 1). The masks have values of 1.0 for included
+        # regions and 0.0 for excluded regions.
+        self._masks_by_seq_len = self._prepare_masks(masks, max_downscale_factor, device, dtype)
+
+    def _prepare_masks(
+        self, masks: list[torch.Tensor], max_downscale_factor: int, device: torch.device, dtype: torch.dtype
+    ) -> dict[int, torch.Tensor]:
+        """Prepare the masks for the IP-Adapter attention."""
+        # Concatenate the masks so that they can be processed more efficiently.
+        mask_tensor = torch.cat(masks, dim=1)
+
+        mask_tensor = mask_tensor.to(device=device, dtype=dtype)
+
+        masks_by_seq_len: dict[int, torch.Tensor] = {}
+
+        # Downsample the spatial dimensions by factors of 2 until max_downscale_factor is reached.
+        downscale_factor = 1
+        while downscale_factor <= max_downscale_factor:
+            b, num_ip_adapters, h, w = mask_tensor.shape
+            # Assert that the batch size is 1, because I haven't thought through batch handling for this feature yet.
+            assert b == 1
+
+            # The IP-Adapters are applied in the cross-attention layers, where the query sequence length is the h * w of
+            # the spatial features.
+            query_seq_len = h * w
+
+            masks_by_seq_len[query_seq_len] = mask_tensor.view((b, num_ip_adapters, -1, 1))
+
+            downscale_factor *= 2
+            if downscale_factor <= max_downscale_factor:
+                # We use max pooling because we downscale to a pretty low resolution, so we don't want small mask
+                # regions to be lost entirely.
+                #
+                # ceil_mode=True is set to mirror the downsampling behavior of SD and SDXL.
+                #
+                # TODO(ryand): In the future, we may want to experiment with other downsampling methods.
+                mask_tensor = torch.nn.functional.max_pool2d(mask_tensor, kernel_size=2, stride=2, ceil_mode=True)
+
+        return masks_by_seq_len
+
+    def get_masks(self, query_seq_len: int) -> torch.Tensor:
+        """Get the mask for the given query sequence length."""
+        return self._masks_by_seq_len[query_seq_len]
--- a/invokeai/backend/stable_diffusion/diffusion/regional_prompt_data.py
+++ b/invokeai/backend/stable_diffusion/diffusion/regional_prompt_data.py
@@ -0,0 +1,105 @@
+import torch
+import torch.nn.functional as F
+
+from invokeai.backend.stable_diffusion.diffusion.conditioning_data import (
+    TextConditioningRegions,
+)
+
+
+class RegionalPromptData:
+    """A class to manage the prompt data for regional conditioning."""
+
+    def __init__(
+        self,
+        regions: list[TextConditioningRegions],
+        device: torch.device,
+        dtype: torch.dtype,
+        max_downscale_factor: int = 8,
+    ):
+        """Initialize a `RegionalPromptData` object.
+        Args:
+            regions (list[TextConditioningRegions]): regions[i] contains the prompt regions for the i'th sample in the
+                batch.
+            device (torch.device): The device to use for the attention masks.
+            dtype (torch.dtype): The data type to use for the attention masks.
+            max_downscale_factor: Spatial masks will be prepared for downscale factors from 1 to max_downscale_factor
+                in steps of 2x.
+        """
+        self._regions = regions
+        self._device = device
+        self._dtype = dtype
+        # self._spatial_masks_by_seq_len[b][s] contains the spatial masks for the b'th batch sample with a query
+        # sequence length of s.
+        self._spatial_masks_by_seq_len: list[dict[int, torch.Tensor]] = self._prepare_spatial_masks(
+            regions, max_downscale_factor
+        )
+        self._negative_cross_attn_mask_score = -10000.0
+
+    def _prepare_spatial_masks(
+        self, regions: list[TextConditioningRegions], max_downscale_factor: int = 8
+    ) -> list[dict[int, torch.Tensor]]:
+        """Prepare the spatial masks for all downscaling factors."""
+        # batch_masks_by_seq_len[b][s] contains the spatial masks for the b'th batch sample with a query sequence length
+        # of s.
+        batch_sample_masks_by_seq_len: list[dict[int, torch.Tensor]] = []
+
+        for batch_sample_regions in regions:
+            batch_sample_masks_by_seq_len.append({})
+
+            batch_sample_masks = batch_sample_regions.masks.to(device=self._device, dtype=self._dtype)
+
+            # Downsample the spatial dimensions by factors of 2 until max_downscale_factor is reached.
+            downscale_factor = 1
+            while downscale_factor <= max_downscale_factor:
+                b, _num_prompts, h, w = batch_sample_masks.shape
+                assert b == 1
+                query_seq_len = h * w
+
+                batch_sample_masks_by_seq_len[-1][query_seq_len] = batch_sample_masks
+
+                downscale_factor *= 2
+                if downscale_factor <= max_downscale_factor:
+                    # We use max pooling because we downscale to a pretty low resolution, so we don't want small prompt
+                    # regions to be lost entirely.
+                    #
+                    # ceil_mode=True is set to mirror the downsampling behavior of SD and SDXL.
+                    #
+                    # TODO(ryand): In the future, we may want to experiment with other downsampling methods (e.g.
+                    # nearest interpolation), and could potentially use a weighted mask rather than a binary mask.
+                    batch_sample_masks = F.max_pool2d(batch_sample_masks, kernel_size=2, stride=2, ceil_mode=True)
+
+        return batch_sample_masks_by_seq_len
+
+    def get_cross_attn_mask(self, query_seq_len: int, key_seq_len: int) -> torch.Tensor:
+        """Get the cross-attention mask for the given query sequence length.
+        Args:
+            query_seq_len: The length of the flattened spatial features at the current downscaling level.
+            key_seq_len (int): The sequence length of the prompt embeddings (which act as the key in the cross-attention
+                layers). This is most likely equal to the max embedding range end, but we pass it explicitly to be sure.
+        Returns:
+            torch.Tensor: The cross-attention score mask.
+                shape: (batch_size, query_seq_len, key_seq_len).
+                dtype: float
+        """
+        batch_size = len(self._spatial_masks_by_seq_len)
+        batch_spatial_masks = [self._spatial_masks_by_seq_len[b][query_seq_len] for b in range(batch_size)]
+
+        # Create an empty attention mask with the correct shape.
+        attn_mask = torch.zeros((batch_size, query_seq_len, key_seq_len), dtype=self._dtype, device=self._device)
+
+        for batch_idx in range(batch_size):
+            batch_sample_spatial_masks = batch_spatial_masks[batch_idx]
+            batch_sample_regions = self._regions[batch_idx]
+
+            # Flatten the spatial dimensions of the mask by reshaping to (1, num_prompts, query_seq_len, 1).
+            _, num_prompts, _, _ = batch_sample_spatial_masks.shape
+            batch_sample_query_masks = batch_sample_spatial_masks.view((1, num_prompts, query_seq_len, 1))
+
+            for prompt_idx, embedding_range in enumerate(batch_sample_regions.ranges):
+                batch_sample_query_scores = batch_sample_query_masks[0, prompt_idx, :, :].clone()
+                batch_sample_query_mask = batch_sample_query_scores > 0.5
+                batch_sample_query_scores[batch_sample_query_mask] = 0.0
+                batch_sample_query_scores[~batch_sample_query_mask] = self._negative_cross_attn_mask_score
+                attn_mask[batch_idx, :, embedding_range.start : embedding_range.end] = batch_sample_query_scores
+
+        return attn_mask
--- a/invokeai/backend/stable_diffusion/diffusion/shared_invokeai_diffusion.py
+++ b/invokeai/backend/stable_diffusion/diffusion/shared_invokeai_diffusion.py
@@ -1,26 +1,20 @@
 from __future__ import annotations

 import math
-from contextlib import contextmanager
 from typing import Any, Callable, Optional, Union

 import torch
-from diffusers import UNet2DConditionModel
 from typing_extensions import TypeAlias

 from invokeai.app.services.config.config_default import get_config
 from invokeai.backend.stable_diffusion.diffusion.conditioning_data import (
-    ConditioningData,
-    ExtraConditioningInfo,
-    SDXLConditioningInfo,
-)
-
-from .cross_attention_control import (
-    CrossAttentionType,
-    CrossAttnControlContext,
-    SwapCrossAttnContext,
-    setup_cross_attention_control_attention_processors,
+    IPAdapterData,
+    Range,
+    TextConditioningData,
+    TextConditioningRegions,
 )
+from invokeai.backend.stable_diffusion.diffusion.regional_ip_data import RegionalIPData
+from invokeai.backend.stable_diffusion.diffusion.regional_prompt_data import RegionalPromptData

 ModelForwardCallback: TypeAlias = Union[
    # x, t, conditioning, Optional[cross-attention kwargs]
@@ -58,31 +52,8 @@ class InvokeAIDiffuserComponent:
        self.conditioning = None
        self.model = model
        self.model_forward_callback = model_forward_callback
-        self.cross_attention_control_context = None
        self.sequential_guidance = config.sequential_guidance

-    @contextmanager
-    def custom_attention_context(
-        self,
-        unet: UNet2DConditionModel,
-        extra_conditioning_info: Optional[ExtraConditioningInfo],
-    ):
-        old_attn_processors = unet.attn_processors
-
-        try:
-            self.cross_attention_control_context = CrossAttnControlContext(
-                arguments=extra_conditioning_info.cross_attention_control_args,
-            )
-            setup_cross_attention_control_attention_processors(
-                unet,
-                self.cross_attention_control_context,
-            )
-
-            yield None
-        finally:
-            self.cross_attention_control_context = None
-            unet.set_attn_processor(old_attn_processors)
-
    def do_controlnet_step(
        self,
        control_data,
@@ -90,7 +61,7 @@ class InvokeAIDiffuserComponent:
        timestep: torch.Tensor,
        step_index: int,
        total_step_count: int,
-        conditioning_data,
+        conditioning_data: TextConditioningData,
    ):
        down_block_res_samples, mid_block_res_sample = None, None

@@ -123,28 +94,28 @@ class InvokeAIDiffuserComponent:
                added_cond_kwargs = None

                if cfg_injection:  # only applying ControlNet to conditional instead of in unconditioned
-                    if type(conditioning_data.text_embeddings) is SDXLConditioningInfo:
+                    if conditioning_data.is_sdxl():
                        added_cond_kwargs = {
-                            "text_embeds": conditioning_data.text_embeddings.pooled_embeds,
-                            "time_ids": conditioning_data.text_embeddings.add_time_ids,
+                            "text_embeds": conditioning_data.cond_text.pooled_embeds,
+                            "time_ids": conditioning_data.cond_text.add_time_ids,
                        }
-                    encoder_hidden_states = conditioning_data.text_embeddings.embeds
+                    encoder_hidden_states = conditioning_data.cond_text.embeds
                    encoder_attention_mask = None
                else:
-                    if type(conditioning_data.text_embeddings) is SDXLConditioningInfo:
+                    if conditioning_data.is_sdxl():
                        added_cond_kwargs = {
                            "text_embeds": torch.cat(
                                [
                                    # TODO: how to pad? just by zeros? or even truncate?
-                                    conditioning_data.unconditioned_embeddings.pooled_embeds,
-                                    conditioning_data.text_embeddings.pooled_embeds,
+                                    conditioning_data.uncond_text.pooled_embeds,
+                                    conditioning_data.cond_text.pooled_embeds,
                                ],
                                dim=0,
                            ),
                            "time_ids": torch.cat(
                                [
-                                    conditioning_data.unconditioned_embeddings.add_time_ids,
-                                    conditioning_data.text_embeddings.add_time_ids,
+                                    conditioning_data.uncond_text.add_time_ids,
+                                    conditioning_data.cond_text.add_time_ids,
                                ],
                                dim=0,
                            ),
@@ -153,8 +124,8 @@ class InvokeAIDiffuserComponent:
                        encoder_hidden_states,
                        encoder_attention_mask,
                    ) = self._concat_conditionings_for_batch(
-                        conditioning_data.unconditioned_embeddings.embeds,
-                        conditioning_data.text_embeddings.embeds,
+                        conditioning_data.uncond_text.embeds,
+                        conditioning_data.cond_text.embeds,
                    )
                if isinstance(control_datum.weight, list):
                    # if controlnet has multiple weights, use the weight for the current step
@@ -198,24 +169,15 @@ class InvokeAIDiffuserComponent:
        self,
        sample: torch.Tensor,
        timestep: torch.Tensor,
-        conditioning_data: ConditioningData,
+        conditioning_data: TextConditioningData,
+        ip_adapter_data: Optional[list[IPAdapterData]],
        step_index: int,
        total_step_count: int,
        down_block_additional_residuals: Optional[torch.Tensor] = None,  # for ControlNet
        mid_block_additional_residual: Optional[torch.Tensor] = None,  # for ControlNet
        down_intrablock_additional_residuals: Optional[torch.Tensor] = None,  # for T2I-Adapter
    ):
-        cross_attention_control_types_to_do = []
-        if self.cross_attention_control_context is not None:
-            percent_through = step_index / total_step_count
-            cross_attention_control_types_to_do = (
-                self.cross_attention_control_context.get_active_cross_attention_control_types_for_step(percent_through)
-            )
-        wants_cross_attention_control = len(cross_attention_control_types_to_do) > 0
-
-        if wants_cross_attention_control or self.sequential_guidance:
-            # If wants_cross_attention_control is True, we force the sequential mode to be used, because cross-attention
-            # control is currently only supported in sequential mode.
+        if self.sequential_guidance:
            (
                unconditioned_next_x,
                conditioned_next_x,
@@ -223,7 +185,9 @@ class InvokeAIDiffuserComponent:
                x=sample,
                sigma=timestep,
                conditioning_data=conditioning_data,
-                cross_attention_control_types_to_do=cross_attention_control_types_to_do,
+                ip_adapter_data=ip_adapter_data,
+                step_index=step_index,
+                total_step_count=total_step_count,
                down_block_additional_residuals=down_block_additional_residuals,
                mid_block_additional_residual=mid_block_additional_residual,
                down_intrablock_additional_residuals=down_intrablock_additional_residuals,
@@ -236,6 +200,9 @@ class InvokeAIDiffuserComponent:
                x=sample,
                sigma=timestep,
                conditioning_data=conditioning_data,
+                ip_adapter_data=ip_adapter_data,
+                step_index=step_index,
+                total_step_count=total_step_count,
                down_block_additional_residuals=down_block_additional_residuals,
                mid_block_additional_residual=mid_block_additional_residual,
                down_intrablock_additional_residuals=down_intrablock_additional_residuals,
@@ -294,53 +261,84 @@ class InvokeAIDiffuserComponent:

    def _apply_standard_conditioning(
        self,
-        x,
-        sigma,
-        conditioning_data: ConditioningData,
+        x: torch.Tensor,
+        sigma: torch.Tensor,
+        conditioning_data: TextConditioningData,
+        ip_adapter_data: Optional[list[IPAdapterData]],
+        step_index: int,
+        total_step_count: int,
        down_block_additional_residuals: Optional[torch.Tensor] = None,  # for ControlNet
        mid_block_additional_residual: Optional[torch.Tensor] = None,  # for ControlNet
        down_intrablock_additional_residuals: Optional[torch.Tensor] = None,  # for T2I-Adapter
-    ):
+    ) -> tuple[torch.Tensor, torch.Tensor]:
        """Runs the conditioned and unconditioned UNet forward passes in a single batch for faster inference speed at
        the cost of higher memory usage.
        """
        x_twice = torch.cat([x] * 2)
        sigma_twice = torch.cat([sigma] * 2)

-        cross_attention_kwargs = None
-        if conditioning_data.ip_adapter_conditioning is not None:
+        cross_attention_kwargs = {}
+        if ip_adapter_data is not None:
+            ip_adapter_conditioning = [ipa.ip_adapter_conditioning for ipa in ip_adapter_data]
            # Note that we 'stack' to produce tensors of shape (batch_size, num_ip_images, seq_len, token_len).
-            cross_attention_kwargs = {
-                "ip_adapter_image_prompt_embeds": [
-                    torch.stack(
-                        [ipa_conditioning.uncond_image_prompt_embeds, ipa_conditioning.cond_image_prompt_embeds]
-                    )
-                    for ipa_conditioning in conditioning_data.ip_adapter_conditioning
-                ]
-            }
+            image_prompt_embeds = [
+                torch.stack([ipa_conditioning.uncond_image_prompt_embeds, ipa_conditioning.cond_image_prompt_embeds])
+                for ipa_conditioning in ip_adapter_conditioning
+            ]
+            scales = [ipa.scale_for_step(step_index, total_step_count) for ipa in ip_adapter_data]
+            ip_masks = [ipa.mask for ipa in ip_adapter_data]
+            regional_ip_data = RegionalIPData(
+                image_prompt_embeds=image_prompt_embeds, scales=scales, masks=ip_masks, dtype=x.dtype, device=x.device
+            )
+            cross_attention_kwargs["regional_ip_data"] = regional_ip_data

        added_cond_kwargs = None
-        if type(conditioning_data.text_embeddings) is SDXLConditioningInfo:
+        if conditioning_data.is_sdxl():
            added_cond_kwargs = {
                "text_embeds": torch.cat(
                    [
                        # TODO: how to pad? just by zeros? or even truncate?
-                        conditioning_data.unconditioned_embeddings.pooled_embeds,
-                        conditioning_data.text_embeddings.pooled_embeds,
+                        conditioning_data.uncond_text.pooled_embeds,
+                        conditioning_data.cond_text.pooled_embeds,
                    ],
                    dim=0,
                ),
                "time_ids": torch.cat(
                    [
-                        conditioning_data.unconditioned_embeddings.add_time_ids,
-                        conditioning_data.text_embeddings.add_time_ids,
+                        conditioning_data.uncond_text.add_time_ids,
+                        conditioning_data.cond_text.add_time_ids,
                    ],
                    dim=0,
                ),
            }

+        if conditioning_data.cond_regions is not None or conditioning_data.uncond_regions is not None:
+            # TODO(ryand): We currently initialize RegionalPromptData for every denoising step. The text conditionings
+            # and masks are not changing from step-to-step, so this really only needs to be done once. While this seems
+            # painfully inefficient, the time spent is typically negligible compared to the forward inference pass of
+            # the UNet. The main reason that this hasn't been moved up to eliminate redundancy is that it is slightly
+            # awkward to handle both standard conditioning and sequential conditioning further up the stack.
+            regions = []
+            for c, r in [
+                (conditioning_data.uncond_text, conditioning_data.uncond_regions),
+                (conditioning_data.cond_text, conditioning_data.cond_regions),
+            ]:
+                if r is None:
+                    # Create a dummy mask and range for text conditioning that doesn't have region masks.
+                    _, _, h, w = x.shape
+                    r = TextConditioningRegions(
+                        masks=torch.ones((1, 1, h, w), dtype=x.dtype),
+                        ranges=[Range(start=0, end=c.embeds.shape[1])],
+                    )
+                regions.append(r)
+
+            cross_attention_kwargs["regional_prompt_data"] = RegionalPromptData(
+                regions=regions, device=x.device, dtype=x.dtype
+            )
+            cross_attention_kwargs["percent_through"] = step_index / total_step_count
+
        both_conditionings, encoder_attention_mask = self._concat_conditionings_for_batch(
-            conditioning_data.unconditioned_embeddings.embeds, conditioning_data.text_embeddings.embeds
+            conditioning_data.uncond_text.embeds, conditioning_data.cond_text.embeds
        )
        both_results = self.model_forward_callback(
            x_twice,
@@ -360,8 +358,10 @@ class InvokeAIDiffuserComponent:
        self,
        x: torch.Tensor,
        sigma,
-        conditioning_data: ConditioningData,
-        cross_attention_control_types_to_do: list[CrossAttentionType],
+        conditioning_data: TextConditioningData,
+        ip_adapter_data: Optional[list[IPAdapterData]],
+        step_index: int,
+        total_step_count: int,
        down_block_additional_residuals: Optional[torch.Tensor] = None,  # for ControlNet
        mid_block_additional_residual: Optional[torch.Tensor] = None,  # for ControlNet
        down_intrablock_additional_residuals: Optional[torch.Tensor] = None,  # for T2I-Adapter
@@ -391,53 +391,48 @@ class InvokeAIDiffuserComponent:
        if mid_block_additional_residual is not None:
            uncond_mid_block, cond_mid_block = mid_block_additional_residual.chunk(2)

-        # If cross-attention control is enabled, prepare the SwapCrossAttnContext.
-        cross_attn_processor_context = None
-        if self.cross_attention_control_context is not None:
-            # Note that the SwapCrossAttnContext is initialized with an empty list of cross_attention_types_to_do.
-            # This list is empty because cross-attention control is not applied in the unconditioned pass. This field
-            # will be populated before the conditioned pass.
-            cross_attn_processor_context = SwapCrossAttnContext(
-                modified_text_embeddings=self.cross_attention_control_context.arguments.edited_conditioning,
-                index_map=self.cross_attention_control_context.cross_attention_index_map,
-                mask=self.cross_attention_control_context.cross_attention_mask,
-                cross_attention_types_to_do=[],
-            )
-
        #####################
        # Unconditioned pass
        #####################

-        cross_attention_kwargs = None
+        cross_attention_kwargs = {}

        # Prepare IP-Adapter cross-attention kwargs for the unconditioned pass.
-        if conditioning_data.ip_adapter_conditioning is not None:
+        if ip_adapter_data is not None:
+            ip_adapter_conditioning = [ipa.ip_adapter_conditioning for ipa in ip_adapter_data]
            # Note that we 'unsqueeze' to produce tensors of shape (batch_size=1, num_ip_images, seq_len, token_len).
-            cross_attention_kwargs = {
-                "ip_adapter_image_prompt_embeds": [
-                    torch.unsqueeze(ipa_conditioning.uncond_image_prompt_embeds, dim=0)
-                    for ipa_conditioning in conditioning_data.ip_adapter_conditioning
-                ]
-            }
+            image_prompt_embeds = [
+                torch.unsqueeze(ipa_conditioning.uncond_image_prompt_embeds, dim=0)
+                for ipa_conditioning in ip_adapter_conditioning
+            ]

-        # Prepare cross-attention control kwargs for the unconditioned pass.
-        if cross_attn_processor_context is not None:
-            cross_attention_kwargs = {"swap_cross_attn_context": cross_attn_processor_context}
+            scales = [ipa.scale_for_step(step_index, total_step_count) for ipa in ip_adapter_data]
+            ip_masks = [ipa.mask for ipa in ip_adapter_data]
+            regional_ip_data = RegionalIPData(
+                image_prompt_embeds=image_prompt_embeds, scales=scales, masks=ip_masks, dtype=x.dtype, device=x.device
+            )
+            cross_attention_kwargs["regional_ip_data"] = regional_ip_data

        # Prepare SDXL conditioning kwargs for the unconditioned pass.
        added_cond_kwargs = None
-        is_sdxl = type(conditioning_data.text_embeddings) is SDXLConditioningInfo
-        if is_sdxl:
+        if conditioning_data.is_sdxl():
            added_cond_kwargs = {
-                "text_embeds": conditioning_data.unconditioned_embeddings.pooled_embeds,
-                "time_ids": conditioning_data.unconditioned_embeddings.add_time_ids,
+                "text_embeds": conditioning_data.uncond_text.pooled_embeds,
+                "time_ids": conditioning_data.uncond_text.add_time_ids,
            }

+        # Prepare prompt regions for the unconditioned pass.
+        if conditioning_data.uncond_regions is not None:
+            cross_attention_kwargs["regional_prompt_data"] = RegionalPromptData(
+                regions=[conditioning_data.uncond_regions], device=x.device, dtype=x.dtype
+            )
+            cross_attention_kwargs["percent_through"] = step_index / total_step_count
+
        # Run unconditioned UNet denoising (i.e. negative prompt).
        unconditioned_next_x = self.model_forward_callback(
            x,
            sigma,
-            conditioning_data.unconditioned_embeddings.embeds,
+            conditioning_data.uncond_text.embeds,
            cross_attention_kwargs=cross_attention_kwargs,
            down_block_additional_residuals=uncond_down_block,
            mid_block_additional_residual=uncond_mid_block,
@@ -449,36 +444,43 @@ class InvokeAIDiffuserComponent:
        # Conditioned pass
        ###################

-        cross_attention_kwargs = None
+        cross_attention_kwargs = {}

-        # Prepare IP-Adapter cross-attention kwargs for the conditioned pass.
-        if conditioning_data.ip_adapter_conditioning is not None:
+        if ip_adapter_data is not None:
+            ip_adapter_conditioning = [ipa.ip_adapter_conditioning for ipa in ip_adapter_data]
            # Note that we 'unsqueeze' to produce tensors of shape (batch_size=1, num_ip_images, seq_len, token_len).
-            cross_attention_kwargs = {
-                "ip_adapter_image_prompt_embeds": [
-                    torch.unsqueeze(ipa_conditioning.cond_image_prompt_embeds, dim=0)
-                    for ipa_conditioning in conditioning_data.ip_adapter_conditioning
-                ]
-            }
+            image_prompt_embeds = [
+                torch.unsqueeze(ipa_conditioning.cond_image_prompt_embeds, dim=0)
+                for ipa_conditioning in ip_adapter_conditioning
+            ]

-        # Prepare cross-attention control kwargs for the conditioned pass.
-        if cross_attn_processor_context is not None:
-            cross_attn_processor_context.cross_attention_types_to_do = cross_attention_control_types_to_do
-            cross_attention_kwargs = {"swap_cross_attn_context": cross_attn_processor_context}
+            scales = [ipa.scale_for_step(step_index, total_step_count) for ipa in ip_adapter_data]
+            ip_masks = [ipa.mask for ipa in ip_adapter_data]
+            regional_ip_data = RegionalIPData(
+                image_prompt_embeds=image_prompt_embeds, scales=scales, masks=ip_masks, dtype=x.dtype, device=x.device
+            )
+            cross_attention_kwargs["regional_ip_data"] = regional_ip_data

        # Prepare SDXL conditioning kwargs for the conditioned pass.
        added_cond_kwargs = None
-        if is_sdxl:
+        if conditioning_data.is_sdxl():
            added_cond_kwargs = {
-                "text_embeds": conditioning_data.text_embeddings.pooled_embeds,
-                "time_ids": conditioning_data.text_embeddings.add_time_ids,
+                "text_embeds": conditioning_data.cond_text.pooled_embeds,
+                "time_ids": conditioning_data.cond_text.add_time_ids,
            }

+        # Prepare prompt regions for the conditioned pass.
+        if conditioning_data.cond_regions is not None:
+            cross_attention_kwargs["regional_prompt_data"] = RegionalPromptData(
+                regions=[conditioning_data.cond_regions], device=x.device, dtype=x.dtype
+            )
+            cross_attention_kwargs["percent_through"] = step_index / total_step_count
+
        # Run conditioned UNet denoising (i.e. positive prompt).
        conditioned_next_x = self.model_forward_callback(
            x,
            sigma,
-            conditioning_data.text_embeddings.embeds,
+            conditioning_data.cond_text.embeds,
            cross_attention_kwargs=cross_attention_kwargs,
            down_block_additional_residuals=cond_down_block,
            mid_block_additional_residual=cond_mid_block,
--- a/invokeai/backend/stable_diffusion/diffusion/unet_attention_patcher.py
+++ b/invokeai/backend/stable_diffusion/diffusion/unet_attention_patcher.py
@@ -0,0 +1,68 @@
+from contextlib import contextmanager
+from typing import List, Optional, TypedDict
+
+from diffusers.models import UNet2DConditionModel
+
+from invokeai.backend.ip_adapter.ip_adapter import IPAdapter
+from invokeai.backend.stable_diffusion.diffusion.custom_atttention import (
+    CustomAttnProcessor2_0,
+    IPAdapterAttentionWeights,
+)
+
+
+class UNetIPAdapterData(TypedDict):
+    ip_adapter: IPAdapter
+    target_blocks: List[str]
+
+
+class UNetAttentionPatcher:
+    """A class for patching a UNet with CustomAttnProcessor2_0 attention layers."""
+
+    def __init__(self, ip_adapter_data: Optional[List[UNetIPAdapterData]]):
+        self._ip_adapters = ip_adapter_data
+
+    def _prepare_attention_processors(self, unet: UNet2DConditionModel):
+        """Prepare a dict of attention processors that can be injected into a unet, and load the IP-Adapter attention
+        weights into them (if IP-Adapters are being applied).
+        Note that the `unet` param is only used to determine attention block dimensions and naming.
+        """
+        # Construct a dict of attention processors based on the UNet's architecture.
+        attn_procs = {}
+        for idx, name in enumerate(unet.attn_processors.keys()):
+            if name.endswith("attn1.processor") or self._ip_adapters is None:
+                # "attn1" processors do not use IP-Adapters.
+                attn_procs[name] = CustomAttnProcessor2_0()
+            else:
+                # Collect the weights from each IP Adapter for the idx'th attention processor.
+                ip_adapter_attention_weights_collection: list[IPAdapterAttentionWeights] = []
+
+                for ip_adapter in self._ip_adapters:
+                    ip_adapter_weights = ip_adapter["ip_adapter"].attn_weights.get_attention_processor_weights(idx)
+                    skip = True
+                    for block in ip_adapter["target_blocks"]:
+                        if block in name:
+                            skip = False
+                            break
+                    ip_adapter_attention_weights: IPAdapterAttentionWeights = IPAdapterAttentionWeights(
+                        ip_adapter_weights=ip_adapter_weights, skip=skip
+                    )
+                    ip_adapter_attention_weights_collection.append(ip_adapter_attention_weights)
+
+                attn_procs[name] = CustomAttnProcessor2_0(ip_adapter_attention_weights_collection)
+
+        return attn_procs
+
+    @contextmanager
+    def apply_ip_adapter_attention(self, unet: UNet2DConditionModel):
+        """A context manager that patches `unet` with CustomAttnProcessor2_0 attention layers."""
+        attn_procs = self._prepare_attention_processors(unet)
+        orig_attn_processors = unet.attn_processors
+
+        try:
+            # Note to future devs: set_attn_processor(...) does something slightly unexpected - it pops elements from
+            # the passed dict. So, if you wanted to keep the dict for future use, you'd have to make a
+            # moderately-shallow copy of it. E.g. `attn_procs_copy = {k: v for k, v in attn_procs.items()}`.
+            unet.set_attn_processor(attn_procs)
+            yield None
+        finally:
+            unet.set_attn_processor(orig_attn_processors)
--- a/Show More
+++ b/Show More