Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(app): non blocking enqueue_batch #7724

Merged
merged 10 commits into from
Mar 3, 2025

Conversation

psychedelicious
Copy link
Collaborator

@psychedelicious psychedelicious commented Mar 3, 2025

Summary

Most service methods in the application are blocking. For example, enqueue_batch blocks the main thread - preventing HTTP and socket traffic. When enqueuing very large or complicated batches, this can cause unresponsiveness.

To make enqueue_batch non-blocking, we can make it async and run it in a thread. However, we have another source of "blocking" - a global mutex on database operations.

This global mutex is required because SQLite, by default, does not support concurrent reads and writes. We painstakingly wrap all DB operations in a re-entrant lock to prevent issues with interacting with a locked database.

SQLite has an alternate mode called WAL (write-ahead logging) mode, which supports concurrent reads and allow us to treat writes as concurrent.

(Technically, in WAL mode, SQLite will queue writes using the WAL log as a buffer, so it's not true concurrency. But we can interact with the DB as it if was concurrent.)

We can also enable the SQLite busy timeout at 5s. If we are waiting on an operation for more than 5s, SQLite will error. This is plenty of time for any SQLite operation we may need to do. If something takes longer than that, we have problems and should fail loudly.

The busy timeout ends up being a better failure mode than we had before w/ the global mutex, which had no timeout and could lock up the app indefinitely if something went awry.

So, to make enqueue_batch non-blocking, we need to:

  • Run that function in a thread and await it in the router.
  • Enable WAL mode and set a busy timeout.
  • Remove the global mutex.
    • As a bonus, this makes every method that touches the DB simpler to read.

I also cleaned up many methods that touched the DB. Many read operations caught errors and did a connection rollback, but this is not necessary. Reads do not have anything to roll back and calling rollback is a noop, so we can remove this nonfunctional logic.

There's one draw-back to WAL mode. You cannot open the DB file over the network. For example, if your host machine is different from your client, and you share the DB via SMB, you cannot connect to the DB over SMB from the client.

Long ago, this drawback was the reason to not use WAL mode. But now, I think it is a fine tradeoff. You can still SSH to the host and connect to the DB directly.

Finally, I made a change to our RTKQ implementation fixing an issue where, during a long-running mutation (e.g. enqueuing a big batch), invalidated queries are not immediately re-fetched (e.g. changing a board's name would wait until the enqueue completes before refreshing).

Related Issues / Discussions

n/a

QA Instructions

Try the app. Everything should work. When enqueuing huge, complicated batches, the app should remain responsive.

Thanks to the massive perf improvements to batch prep in #7688, the benefits from making this change will be less impactful as we should never be blocked for long. Nevertheless, this really cleans up our DB code and reduces footgun potential from mishandled mutexes.

Merge Plan

n/a

Checklist

  • [z] The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

This allows for read and write concurrency without using a global mutex. Operations may still fail they take longer than the busy timeout (5s).

If we get a database lock error after waiting 5s for an operation, we have a problem. So, I think it's actually better to use a busy timeout instead of a global mutex.

Alternatively, we could add a timeout to the global mutex.
Rely on WAL mode and the busy timeout.

Also changed:
- Remove extraneous rollbacks when we were only doing a `SELECT`
- Remove try/catch blocks that were made extraneous when removing the extraneous rollbacks
This allows tags to be invalidated while mutations are executing, resolving an issue in this situation:
- A long-running mutation starts.
- A tag is invalidated; for example, user edits a board name, and the boards list query tag is invalidated.
- The boards list query isn't fired, and the board name isn't updated.
- The long-running mutation finishes.
- Finally, the boards list query fires and the board name is updated.

This is the "delayed" behaviour. The "immediately" behaviour has the fires requests from tag invalidation immediately, without waiting for all mutations to finish.

It may cause extra network requests and stale data if we are mutating a lot of things very quickly. I don't think it will be an issue in practice and the improved responsiveness will be a net benefit.
@github-actions github-actions bot added api python PRs that change python files services PRs that change app services frontend PRs that change frontend files labels Mar 3, 2025
@psychedelicious
Copy link
Collaborator Author

Note for reviewers - there is very little code added in this PR. The diff is large because of changed indentation. Many extraneous try/except blocks were removed, shifting code a bit. git isn't representing this very well.

@psychedelicious psychedelicious merged commit 7399909 into main Mar 3, 2025
15 checks passed
@psychedelicious psychedelicious deleted the psyche/experiment/app/non-blocking-enqueue branch March 3, 2025 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api frontend PRs that change frontend files python PRs that change python files services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants