fix(services): recover perpetual services after Redis disruption without restart#22415
Conversation
…out restart A perpetual service mid-execution when Redis is disrupted is left with its runs-hash in state=running. The docket worker reschedules automatic perpetual tasks on every (re)connection, but via docket.add, which no-ops on a task that is already known -- including one stuck in state=running. The Foreman therefore never runs again and stale workers stay ONLINE indefinitely. Force-(re)schedule automatic perpetual tasks with docket.replace, which clears the stale running state: - register_and_schedule_perpetual_services uses replace (server-restart path) - new PerpetualServiceRecovery worker dependency reschedules via replace on each worker-loop start, recovering after a mid-runtime Redis reconnection Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Merging this PR will not alter performance
Comparing Footnotes
|
|
I think that there's a better fix for the underlying issue in Docket. I am working on a fix in chrisguidry/docket#439 |
|
Makes sense — fixing |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 267b345bd3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ) -> AsyncGenerator[None, None]: | ||
| worker_task: asyncio.Task[None] | None = None | ||
| async with Worker(docket) as worker: | ||
| async with Worker(docket, dependencies=[PerpetualServiceRecovery()]) as worker: |
There was a problem hiding this comment.
Bump pydocket lower bound for worker dependencies
This new dependencies= argument relies on newer Docket worker dependency/lifecycle support, but the package metadata still allows pydocket>=0.19.0. In environments that upgrade Prefect while keeping an allowed older pydocket, 0.19.x raises TypeError: Worker.__init__() got an unexpected keyword argument 'dependencies', and 0.20.x accepts the argument but never invokes worker_lifecycle, so the no-restart recovery path silently does not run. Please raise the minimum pydocket version to the first release that supports this lifecycle hook.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — worker_lifecycle was introduced in pydocket 0.20.2; versions 0.19.x–0.20.1 accept the dependencies kwarg but never invoke the hook, so PerpetualServiceRecovery would silently not run. Bumped the lower bound to >=0.20.2 in 60a3263.
worker_lifecycle, used by PerpetualServiceRecovery, was introduced in pydocket 0.20.2. Older versions accept the dependencies kwarg but never invoke the hook, so mid-runtime recovery would silently not run. Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
closes #22212
After a transient Redis outage, perpetual server services (Foreman, scheduler, etc.) can stop running permanently — so stale workers keep showing as ONLINE in the UI/Postgres long after their last heartbeat, until the server is restarted.
This PR force-(re)schedules automatic perpetual tasks with
docket.replaceinstead ofdocket.add, which recovers a service whose run was interrupted by the outage — including mid-runtime, without a server restart.Root cause
When the docket worker claims a perpetual task it sets its runs-hash to
state=runningand deletes theknownfield. If Redis is disrupted while the task is executing, that state persists. On every (re)connection the worker reschedules automatic perpetual tasks via its built-in_schedule_all_automatic_perpetual_tasks, but that usesdocket.add, whose scheduling script treats astate=runningtask as already-known and returnsEXISTS— a permanent no-op. So the Foreman never runs again and stale workers are never marked OFFLINE.#22213already setautomatic=Trueon all perpetual services so the worker attempts to reschedule them on reconnect; the remaining gap is that the attempt usesadd, which can't clear the stuckrunningstate.Reproduced in
repros/22212.py:Fix
docket.replaceunconditionally overwrites the schedule, clearing the stalerunningstate. Two paths use it:register_and_schedule_perpetual_servicesnow usesreplace— recovers stuck services on server restart.PerpetualServiceRecoveryworker dependency hooks docket'sworker_lifecycle, which the worker enters at the start of every loop (including the loop after a mid-runtime Redis reconnection), and force-reschedules every automatic perpetual task via a sharedreschedule_automatic_perpetual_tasks(docket)helper. This recovers without requiring a restart — the scenario in the issue.The helper iterates
docket.tasks(mirroring docket's own automatic-perpetual scheduling), so it only touches registered, enabled, automatic perpetual services.Notes
_schedule_all_automatic_perpetual_tasksusingadd); this is a Prefect-side recovery so customers don't have to wait on a docket release. A follow-up upstream fix inpydocketwould let us drop the worker dependency.Checklist
mint.json.Link to Devin session: https://app.devin.ai/sessions/08d8e50048124c0c856e30be495d491b
Requested by: @bdalpe