feat(app): non blocking enqueue_batch #7724

psychedelicious · 2025-03-03T03:31:45Z

Summary

Most service methods in the application are blocking. For example, enqueue_batch blocks the main thread - preventing HTTP and socket traffic. When enqueuing very large or complicated batches, this can cause unresponsiveness.

To make enqueue_batch non-blocking, we can make it async and run it in a thread. However, we have another source of "blocking" - a global mutex on database operations.

This global mutex is required because SQLite, by default, does not support concurrent reads and writes. We painstakingly wrap all DB operations in a re-entrant lock to prevent issues with interacting with a locked database.

SQLite has an alternate mode called WAL (write-ahead logging) mode, which supports concurrent reads and allow us to treat writes as concurrent.

(Technically, in WAL mode, SQLite will queue writes using the WAL log as a buffer, so it's not true concurrency. But we can interact with the DB as it if was concurrent.)

We can also enable the SQLite busy timeout at 5s. If we are waiting on an operation for more than 5s, SQLite will error. This is plenty of time for any SQLite operation we may need to do. If something takes longer than that, we have problems and should fail loudly.

The busy timeout ends up being a better failure mode than we had before w/ the global mutex, which had no timeout and could lock up the app indefinitely if something went awry.

So, to make enqueue_batch non-blocking, we need to:

Run that function in a thread and await it in the router.
Enable WAL mode and set a busy timeout.
Remove the global mutex.
- As a bonus, this makes every method that touches the DB simpler to read.

I also cleaned up many methods that touched the DB. Many read operations caught errors and did a connection rollback, but this is not necessary. Reads do not have anything to roll back and calling rollback is a noop, so we can remove this nonfunctional logic.

There's one draw-back to WAL mode. You cannot open the DB file over the network. For example, if your host machine is different from your client, and you share the DB via SMB, you cannot connect to the DB over SMB from the client.

Long ago, this drawback was the reason to not use WAL mode. But now, I think it is a fine tradeoff. You can still SSH to the host and connect to the DB directly.

Finally, I made a change to our RTKQ implementation fixing an issue where, during a long-running mutation (e.g. enqueuing a big batch), invalidated queries are not immediately re-fetched (e.g. changing a board's name would wait until the enqueue completes before refreshing).

Related Issues / Discussions

n/a

QA Instructions

Try the app. Everything should work. When enqueuing huge, complicated batches, the app should remain responsive.

Thanks to the massive perf improvements to batch prep in #7688, the benefits from making this change will be less impactful as we should never be blocked for long. Nevertheless, this really cleans up our DB code and reduces footgun potential from mishandled mutexes.

Merge Plan

n/a

Checklist

[z] The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

This allows for read and write concurrency without using a global mutex. Operations may still fail they take longer than the busy timeout (5s). If we get a database lock error after waiting 5s for an operation, we have a problem. So, I think it's actually better to use a busy timeout instead of a global mutex. Alternatively, we could add a timeout to the global mutex.

Rely on WAL mode and the busy timeout. Also changed: - Remove extraneous rollbacks when we were only doing a `SELECT` - Remove try/catch blocks that were made extraneous when removing the extraneous rollbacks

This allows tags to be invalidated while mutations are executing, resolving an issue in this situation: - A long-running mutation starts. - A tag is invalidated; for example, user edits a board name, and the boards list query tag is invalidated. - The boards list query isn't fired, and the board name isn't updated. - The long-running mutation finishes. - Finally, the boards list query fires and the board name is updated. This is the "delayed" behaviour. The "immediately" behaviour has the fires requests from tag invalidation immediately, without waiting for all mutations to finish. It may cause extra network requests and stale data if we are mutating a lot of things very quickly. I don't think it will be an issue in practice and the improved responsiveness will be a net benefit.

This reverts commit ddf00bf.

This reverts commit eb6a323.

psychedelicious · 2025-03-03T03:34:50Z

Note for reviewers - there is very little code added in this PR. The diff is large because of changed indentation. Many extraneous try/except blocks were removed, shifting code a bit. git isn't representing this very well.

psychedelicious added 10 commits March 3, 2025 12:57

experiment(app): simulate very long enqueue operations (15s)

d309579

experiment(app): make socketio server ping every 1s

cfd38eb

experiment(app): run enqueue_batch async in a thread

49dd810

experiment(app): remove db locking logic

9849bf8

Rely on WAL mode and the busy timeout. Also changed: - Remove extraneous rollbacks when we were only doing a `SELECT` - Remove try/catch blocks that were made extraneous when removing the extraneous rollbacks

Revert "experiment(app): make socketio server ping every 1s"

e1c322f

This reverts commit ddf00bf.

Revert "experiment(app): simulate very long enqueue operations (15s)"

9c08e73

This reverts commit eb6a323.

tidy(app): remove extraneous class attr type annotations

c62d82c

feat(app): use simpler syntax for enqueue_batch threaded execution

0645304

psychedelicious requested review from blessedcoolant, maryhipp, hipsterusername and brandonrising as code owners March 3, 2025 03:31

github-actions bot added api python PRs that change python files services PRs that change app services frontend PRs that change frontend files labels Mar 3, 2025

hipsterusername approved these changes Mar 3, 2025

View reviewed changes

psychedelicious merged commit 7399909 into main Mar 3, 2025
15 checks passed

psychedelicious deleted the psyche/experiment/app/non-blocking-enqueue branch March 3, 2025 03:40

psychedelicious mentioned this pull request Mar 3, 2025

fix(app): recursive cursor errors #7727

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(app): non blocking enqueue_batch #7724

feat(app): non blocking enqueue_batch #7724

psychedelicious commented Mar 3, 2025 •

edited

Loading

psychedelicious commented Mar 3, 2025

feat(app): non blocking enqueue_batch #7724

feat(app): non blocking enqueue_batch #7724

Conversation

psychedelicious commented Mar 3, 2025 • edited Loading

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

psychedelicious commented Mar 3, 2025

psychedelicious commented Mar 3, 2025 •

edited

Loading