When installing a model, the previous, graceful logic would increment a
suffix on the destination path until found a free path for the model.
But because model file installation and record creation are not in a
transaction, we could end up moving the file successfully and fail to
create the record:
- User attempts to install an already-installed model
- Attempt to move the downloaded model from download tempdir to
destination path
- The path already exists
- Add `_1` or similar to the path until we find a path that is free
- Move the model
- Create the model record
- FK constraint violation bc we already have a model w/ that name, but
the model file has already been moved into the invokeai dir.
Closes#8416
When unsafe_disable_picklescan is enabled, instead of erroring on
detections or scan failures, a warning is logged.
A warning is also logged on app startup when this setting is enabled.
The setting is disabled by default and there is no change in behaviour
when disabled.
It's not clear why we were copying downloaded models to the destination
dir instead of moving them. I cannot find a reason for it, and I am able
to install single-file and diffusers models just fine with the change.
This fixes an issue where model installation requires 2x the model's
size (bc we were copying the model over).
Previously, we used pathlib's `with_suffix()` method to change add a
suffix (e.g. ".safetensors") to a model when installing it.
The intention is to add a suffix to the model's name - but that method
actually replaces everything after the first period.
This can cause different models to be installed under the same name!
For example, the FLUX models all end up with the same name:
- "FLUX.1 schnell.safetensors" -> "FLUX.safetensors"
- "FLUX.1 dev.safetensors" -> "FLUX.safetensors"
The fix is easy - append the suffix using string formatting instead of
using pathlib.
This issue has existed for a long time, but was exacerbated in
075345bffd in which I updated the names of
our starter models, adding ".1" to the FLUX model names. Whoops!
- Add a context manager to the SqliteDatabase class which abstracts away
creating a transaction, committing it on success and rolling back on
error.
- Use it everywhere. The context manager should be exited before
returning results. No business logic changes should be present.
The previous logic had a subtle python bug related the scope and nested
generators.
Python generators are lazily evaluated - the expressions are stored and
only evaluated when needed (e.g. calling next() or list() on them)
The old logic used a variable `s`, which was continually overwritten as
the generator expressions were created. As a result, the final mappings
all use the _final_ value for `s`.
Following the consequences of this down the line, we find that collect
nodes can end up with multiple edges from exactly one of their ancestor
nodes, instead of one edge from each ancestor. Notably, it's only the
source _node_id_ that is affected - the source _fields_ have the correct
values.
So the invalid edges will point to a real node and a real field, but the
field exists on a different node.
---
This can result in a number of cryptic problems - include an error about
incompatible field types:
```
InvalidEdgeError: Field types are incompatible
(31758fd5-14a8-4de7-a840-b73ec1a1b94f.value ->
3459c793-41a2-4d82-9204-7df2d6d099ba.item)
```
Here are the conditions that lead to this error:
- The collect node has at least two incoming connections.
- The two incoming connections come from nodes of different types.
- The nodes both output a value of the same type, but the name of the
output field differs between them.
---
This commit uses non-generator logic to build up the mappings, avoiding
the issue entirely. As a bonus, it is much easier to read.
Previously we used python's own type introspection utilties to determine
input and output field types. We can use pydantic to get the field types
in a clearer, more direct way.
This improvement also exposed an awkward behaviour in this utility,
where it would return None when a field doesn't exist. I've added a
comment in the code describing the issue, but changing it would require
some significant changes and I don't want to risk breaking anything.
When we delete images, boards, or do any other board mutation, we need
to invalidate numerous query caches and related internal frontend state.
This gets complicated very quickly.
We can drastically reduce the complexity by having the backend return
some more information when we make these mutations.
For example, when deleting a list of images by name, we can return a
list of deleted image name and affected boards. The frontend can use
this information to determine which queries to invalidate with far less
tedium.
This will also enable the more efficient storage of images (e.g. in the
gallery selection). Previously, we had to store the entire image DTO
object, else we wouldn't be able to figure out which queries to
invalidate. But now that the backend tells us exactly what images/boards
have changed, we can just store image names in frontend state. This
amounts to a substantial improvement in DX and reduction in frontend
complexity.
In #7724 we made a number of perf optimisations related to enqueuing. One of these optimisations included moving the enqueue logic - including expensive prep work and db writes - to a separate thread.
At the same time manual DB locking was abandoned in favor of WAL mode.
Finally, we set `check_same_thread=False` to allow multiple threads to access the connection at a given time.
I think this may be the cause of #7950:
- We start an enqueue in a thread (running in bg)
- We dequeue
- Dequeue pulls a partially-written queue item from DB and we get the errors in the linked issue
To be honest, I don't understand enough about SQLite to confidently say that this kind of race condition is actually possible. But:
- The error started popping up around the time we made this change.
- I have reviewed the logic from enqueue to dequeue very carefully _many_ times over the past month or so, and I am confident that the error is only possible if we are getting unexpectedly `NULL` values from the DB.
- The DB schema includes `NOT NULL` constraints for the column that is apparently returning `NULL`.
- Therefore, without some kind of race condition or schema issue, the error should not be possible.
- The `enqueue_batch` call is the only place I can find where we have the possibility of a race condition due to async logic. Everywhere else, all DB interaction for the queue is synchronous, as far as I can tell.
This change retains the perf benefits by running the heavy enqueue prep logic in a separate thread, but moves back to the main thread for the DB write. It also uses an explicit transaction for the write.
Will just have to wait and see if this fixes the issue.
This reduces peak memory usage at a negligible cost. Queue items typically take on the order of seconds, making the time cost of a GC essentially free.
Not a great idea on a hotter code path though.