- Add a context manager to the SqliteDatabase class which abstracts away
creating a transaction, committing it on success and rolling back on
error.
- Use it everywhere. The context manager should be exited before
returning results. No business logic changes should be present.
In #7724 we made a number of perf optimisations related to enqueuing. One of these optimisations included moving the enqueue logic - including expensive prep work and db writes - to a separate thread.
At the same time manual DB locking was abandoned in favor of WAL mode.
Finally, we set `check_same_thread=False` to allow multiple threads to access the connection at a given time.
I think this may be the cause of #7950:
- We start an enqueue in a thread (running in bg)
- We dequeue
- Dequeue pulls a partially-written queue item from DB and we get the errors in the linked issue
To be honest, I don't understand enough about SQLite to confidently say that this kind of race condition is actually possible. But:
- The error started popping up around the time we made this change.
- I have reviewed the logic from enqueue to dequeue very carefully _many_ times over the past month or so, and I am confident that the error is only possible if we are getting unexpectedly `NULL` values from the DB.
- The DB schema includes `NOT NULL` constraints for the column that is apparently returning `NULL`.
- Therefore, without some kind of race condition or schema issue, the error should not be possible.
- The `enqueue_batch` call is the only place I can find where we have the possibility of a race condition due to async logic. Everywhere else, all DB interaction for the queue is synchronous, as far as I can tell.
This change retains the perf benefits by running the heavy enqueue prep logic in a separate thread, but moves back to the main thread for the DB write. It also uses an explicit transaction for the write.
Will just have to wait and see if this fixes the issue.
In #7688 we optimized queuing preparation logic. This inadvertently broke retrying queue items.
Previously, a `NamedTuple` was used to store the values to insert in the DB when enqueuing. This handy class provides an API similar to a dataclass, where you can instantiate it with kwargs in any order. The resultant tuple re-orders the kwargs to match the order in the class definition.
For example, consider this `NamedTuple`:
```py
class SessionQueueValueToInsert(NamedTuple):
foo: str
bar: str
```
When instantiating it, no matter the order of the kwargs, if you make a normal tuple out of it, the tuple values are in the same order as in the class definition:
```
t1 = SessionQueueValueToInsert(foo="foo", bar="bar")
print(tuple(t1)) # -> ('foo', 'bar')
t2 = SessionQueueValueToInsert(bar="bar", foo="foo")
print(tuple(t2)) # -> ('foo', 'bar')
```
So, in the old code, when we used the `NamedTuple`, it implicitly normalized the order of the values we insert into the DB.
In the retry logic, the values of the tuple were not ordered correctly, but the use of `NamedTuple` had secretly fixed the order for us.
In the linked PR, `NamedTuple` was dropped for a normal tuple, after profiling showed `NamedTuple` to be meaningfully slower than a normal tuple.
The implicit order normalization behaviour wasn't understood, and the order wasn't fixed when changin the retry logic to use a normal tuple instead of `NamedTuple`. This results in a bug where we incorrectly create queue items in the DB. For example, we stored the `destination` in the `field_values` column.
When such an incorrectly-created queue item is dequeued, it fails pydantic validation and causes what appears to be an endless loop of errors.
The only user-facing solution is to add this line to `invokeai.yaml` and restart the app:
```yaml
clear_queue_on_startup: true
```
On next startup, the queue is forcibly cleared before the error loop is triggered. Then the user should remove this line so their queue is persisted across app launches per usual.
The solution is simple - fix the ordering of the tuple. I also added a type annotation and comment to the tuple type alias definition.
Note: The endless error loop, as a general problem, will take some thinking to fix. The queue service methods to cancel and fail a queue item still retrieve it and parse it. And the list queue items methods parse the queue items. Bit of a catch 22, maybe the solution is to simply delete totally borked queue items and log an error.
SQLite cursors are meant to be lightweight and not reused. For whatever reason, we reuse one per service for the entire app lifecycle.
This can cause issues where a cursor is used twice at the same time in different transactions.
This experiment makes the session queue use a fresh cursor for each method, hopefully fixing the issue.
Rely on WAL mode and the busy timeout.
Also changed:
- Remove extraneous rollbacks when we were only doing a `SELECT`
- Remove try/catch blocks that were made extraneous when removing the extraneous rollbacks
- Avoid pydantic models when dict manipulation works
- Avoid extraneous deep copies when we can safely mutate
- Avoid NamedTuple construct and its overhead
- Fix tests to use altered function signatures
- Remove extraneous populate_graph function
Retrying a queue item means cloning it, resetting all execution-related state. Retried queue items reference the item they were retried from by id. This relationship is not enforced by any DB constraints.
- Add `retried_from_item_id` to `session_queue` table in DB in a migration.
- Add `retry_items_by_id` method to session queue service. Accepts a list of queue item IDs and clones them (minus execution state). Returns a list of retried items. Items that are not in a canceled or failed state are skipped.
- Add `retry_items_by_id` HTTP endpoint that maps 1-to-1 to the queue service method.
- Add `queue_items_retried` event, which includes the list of retried items.
When resetting the canvas or staging area, we don't want to cancel generations that are going to the gallery - only those going to the canvas.
Thus the method should not cancel by origin, but instead cancel by destination.
Update the queue method and route.
The frontend needs to know where queue items came from (i.e. which tab), and where results are going to (i.e. send images to gallery or canvas). The `origin` column is not quite enough to represent this cleanly.
A `destination` column provides the frontend what it needs to handle incoming generations.
The origin is an optional field indicating the queue item's origin. For example, "canvas" when the queue item originated from the canvas or "workflows" when the queue item originated from the workflows tab. If omitted, we assume the queue item originated from the API directly.
- Add migration to add the nullable column to the `session_queue` table.
- Update relevant event payloads with the new field.
- Add `cancel_by_origin` method to `session_queue` service and corresponding route. This is required for the canvas to bail out early when staging images.
- Add `origin` to both `SessionQueueItem` and `Batch` - it needs to be provided initially via the batch and then passed onto the queue item.
-
There's no longer any need for session-scoped events now that we have the session queue. Session started/completed/canceled map 1-to-1 to queue item status events, but queue item status events also have an event for failed state.
We can simplify queue and processor handling substantially by removing session events and instead using queue item events.
- Remove the session-scoped events entirely.
- Remove all event handling from session queue. The processor still needs to respond to some events from the queue: `QueueClearedEvent`, `BatchEnqueuedEvent` and `QueueItemStatusChangedEvent`.
- Pass an `is_canceled` callback to the invocation context instead of the cancel event
- Update processor logic to ensure the local instance of the current queue item is synced with the instance in the database. This prevents race conditions and ensures lifecycle callback do not get stale callbacks.
- Update docstrings and comments
- Add `complete_queue_item` method to session queue service as an explicit way to mark a queue item as successfully completed. Previously, the queue listened for session complete events to do this.
Closes#6442
Our events handling and implementation has a couple pain points:
- Adding or removing data from event payloads requires changes wherever the events are dispatched from.
- We have no type safety for events and need to rely on string matching and dict access when interacting with events.
- Frontend types for socket events must be manually typed. This has caused several bugs.
`fastapi-events` has a neat feature where you can create a pydantic model as an event payload, give it an `__event_name__` attr, and then dispatch the model directly.
This allows us to eliminate a layer of indirection and some unpleasant complexity:
- Event handler callbacks get type hints for their event payloads, and can use `isinstance` on them if needed.
- Event payload construction is now the responsibility of the event itself (a pydantic model), not the service. Every event model has a `build` class method, encapsulating this logic. The build methods are provided as few args as possible. For example, `InvocationStartedEvent.build()` gets the invocation instance and queue item, and can choose the data it wants to include in the event payload.
- Frontend event types may be autogenerated from the OpenAPI schema. We use the payload registry feature of `fastapi-events` to collect all payload models into one place, making it trivial to keep our schema and frontend types in sync.
This commit moves the backend over to this improved event handling setup.
- Add handling for new error columns `error_type`, `error_message`, `error_traceback`.
- Update queue item model to include the new data. The `error_traceback` field has an alias of `error` for backwards compatibility.
- Add `fail_queue_item` method. This was previously handled by `cancel_queue_item`. Splitting this functionality makes failing a queue item a bit more explicit. We also don't need to handle multiple optional error args.
-
The session is never updated in the queue after it is first enqueued. As a result, the queue detail view in the frontend never never updates and the session itself doesn't show outputs, execution graph, etc.
We need a new method on the queue service to update a queue item's session, then call it before updating the queue item's status.
Queue item status may be updated via a session-type event _or_ queue-type event. Adding the updated session to all these events is a hairy - simpler to just update the session before we do anything that could trigger a queue item status change event:
- Before calling `emit_session_complete` in the processor (handles session error, completed and cancel events and the corresponding queue events)
- Before calling `cancel_queue_item` in the processor (handles another way queue items can be canceled, outside the session execution loop)
When serializing the session, both in the new service method and the `get_queue_item` endpoint, we need to use `exclude_none=True` to prevent unexpected validation errors.
Consolidate graph processing logic into session processor.
With graphs as the unit of work, and the session queue distributing graphs, we no longer need the invocation queue or processor.
Instead, the session processor dequeues the next session and processes it in a simple loop, greatly simplifying the app.
- Remove `graph_execution_manager` service.
- Remove `queue` (invocation queue) service.
- Remove `processor` (invocation processor) service.
- Remove queue-related logic from `Invoker`. It now only starts and stops the services, providing them with access to other services.
- Remove unused `invocation_retrieval_error` and `session_retrieval_error` events, these are no longer needed.
- Clean up stats service now that it is less coupled to the rest of the app.
- Refactor cancellation logic - cancellations now originate from session queue (i.e. HTTP cancel endpoint) and are emitted as events. Processor gets the events and sets the canceled event. Access to this event is provided to the invocation context for e.g. the step callback.
- Remove `sessions` router; it provided access to `graph_executions` but that no longer exists.
* chore: bump pydantic to 2.5.2
This release fixespydantic/pydantic#8175 and allows us to use `JsonValue`
* fix(ui): exclude public/en.json from prettier config
* fix(workflow_records): fix SQLite workflow insertion to ignore duplicates
* feat(backend): update workflows handling
Update workflows handling for Workflow Library.
**Updated Workflow Storage**
"Embedded Workflows" are workflows associated with images, and are now only stored in the image files. "Library Workflows" are not associated with images, and are stored only in DB.
This works out nicely. We have always saved workflows to files, but recently began saving them to the DB in addition to in image files. When that happened, we stopped reading workflows from files, so all the workflows that only existed in images were inaccessible. With this change, access to those workflows is restored, and no workflows are lost.
**Updated Workflow Handling in Nodes**
Prior to this change, workflows were embedded in images by passing the whole workflow JSON to a special workflow field on a node. In the node's `invoke()` function, the node was able to access this workflow and save it with the image. This (inaccurately) models workflows as a property of an image and is rather awkward technically.
A workflow is now a property of a batch/session queue item. It is available in the InvocationContext and therefore available to all nodes during `invoke()`.
**Database Migrations**
Added a `SQLiteMigrator` class to handle database migrations. Migrations were needed to accomodate the DB-related changes in this PR. See the code for details.
The `images`, `workflows` and `session_queue` tables required migrations for this PR, and are using the new migrator. Other tables/services are still creating tables themselves. A followup PR will adapt them to use the migrator.
**Other/Support Changes**
- Add a `has_workflow` column to `images` table to indicate that the image has an embedded workflow.
- Add handling for retrieving the workflow from an image in python. The image file must be fetched, the workflow extracted, and then sent to client, avoiding needing the browser to parse the image file. With the `has_workflow` column, the UI knows if there is a workflow to be fetched, and only fetches when the user requests to load the workflow.
- Add route to get the workflow from an image
- Add CRUD service/routes for the library workflows
- `workflow_images` table and services removed (no longer needed now that embedded workflows are not in the DB)
* feat(ui): updated workflow handling (WIP)
Clientside updates for the backend workflow changes.
Includes roughed-out workflow library UI.
* feat: revert SQLiteMigrator class
Will pursue this in a separate PR.
* feat(nodes): do not overwrite custom node module names
Use a different, simpler method to detect if a node is custom.
* feat(nodes): restore WithWorkflow as no-op class
This class is deprecated and no longer needed. Set its workflow attr value to None (meaning it is now a no-op), and issue a warning when an invocation subclasses it.
* fix(nodes): fix get_workflow from queue item dict func
* feat(backend): add WorkflowRecordListItemDTO
This is the id, name, description, created at and updated at workflow columns/attrs. Used to display lists of workflowsl
* chore(ui): typegen
* feat(ui): add workflow loading, deleting to workflow library UI
* feat(ui): workflow library pagination button styles
* wip
* feat: workflow library WIP
- Save to library
- Duplicate
- Filter/sort
- UI/queries
* feat: workflow library - system graphs - wip
* feat(backend): sync system workflows to db
* fix: merge conflicts
* feat: simplify default workflows
- Rename "system" -> "default"
- Simplify syncing logic
- Update UI to match
* feat(workflows): update default workflows
- Update TextToImage_SD15
- Add TextToImage_SDXL
- Add README
* feat(ui): refine workflow list UI
* fix(workflow_records): typo
* fix(tests): fix tests
* feat(ui): clean up workflow library hooks
* fix(db): fix mis-ordered db cleanup step
It was happening before pruning queue items - should happen afterwards, else you have to restart the app again to free disk space made available by the pruning.
* feat(ui): tweak reset workflow editor translations
* feat(ui): split out workflow redux state
The `nodes` slice is a rather complicated slice. Removing `workflow` makes it a bit more reasonable.
Also helps to flatten state out a bit.
* docs: update default workflows README
* fix: tidy up unused files, unrelated changes
* fix(backend): revert unrelated service organisational changes
* feat(backend): workflow_records.get_many arg "filter_text" -> "query"
* feat(ui): use custom hook in current image buttons
Already in use elsewhere, forgot to use it here.
* fix(ui): remove commented out property
* fix(ui): fix workflow loading
- Different handling for loading from library vs external
- Fix bug where only nodes and edges loaded
* fix(ui): fix save/save-as workflow naming
* fix(ui): fix circular dependency
* fix(db): fix bug with releasing without lock in db.clean()
* fix(db): remove extraneous lock
* chore: bump ruff
* fix(workflow_records): default `category` to `WorkflowCategory.User`
This allows old workflows to validate when reading them from the db or image files.
* hide workflow library buttons if feature is disabled
---------
Co-authored-by: Mary Hipp <maryhipp@Marys-MacBook-Air.local>