fix(services): recover perpetual services after Redis disruption without restart by devin-ai-integration[bot] · Pull Request #22415 · PrefectHQ/prefect

devin-ai-integration · 2026-06-30T18:50:01Z

After a transient Redis outage, perpetual server services (Foreman, scheduler, etc.) can stop running permanently — so stale workers keep showing as ONLINE in the UI/Postgres long after their last heartbeat, until the server is restarted.

This PR force-(re)schedules automatic perpetual tasks with docket.replace instead of docket.add, which recovers a service whose run was interrupted by the outage — including mid-runtime, without a server restart.

Root cause

When the docket worker claims a perpetual task it sets its runs-hash to state=running and deletes the known field. If Redis is disrupted while the task is executing, that state persists. On every (re)connection the worker reschedules automatic perpetual tasks via its built-in _schedule_all_automatic_perpetual_tasks, but that uses docket.add, whose scheduling script treats a state=running task as already-known and returns EXISTS — a permanent no-op. So the Foreman never runs again and stale workers are never marked OFFLINE.

#22213 already set automatic=True on all perpetual services so the worker attempts to reschedule them on reconnect; the remaining gap is that the attempt uses add, which can't clear the stuck running state.

Reproduced in repros/22212.py:

seeded runs:monitor_worker_health -> {state: running}
docket.add  -> Disposition.ALREADY_SCHEDULED   (state still 'running' — NO-OP, BUG)
docket.replace -> Disposition.SCHEDULED         (state -> 'queued' — RECOVERED)

Fix

docket.replace unconditionally overwrites the schedule, clearing the stale running state. Two paths use it:

register_and_schedule_perpetual_services now uses replace — recovers stuck services on server restart.
New PerpetualServiceRecovery worker dependency hooks docket's worker_lifecycle, which the worker enters at the start of every loop (including the loop after a mid-runtime Redis reconnection), and force-reschedules every automatic perpetual task via a shared reschedule_automatic_perpetual_tasks(docket) helper. This recovers without requiring a restart — the scenario in the issue.

class PerpetualServiceRecovery(Dependency["PerpetualServiceRecovery"]):
    @classmethod
    @asynccontextmanager
    async def worker_lifecycle(cls, docket, worker):
        await reschedule_automatic_perpetual_tasks(docket)  # docket.replace per automatic perpetual task
        yield

The helper iterates docket.tasks (mirroring docket's own automatic-perpetual scheduling), so it only touches registered, enabled, automatic perpetual services.

Notes

The underlying limitation is in docket (_schedule_all_automatic_perpetual_tasks using add); this is a Prefect-side recovery so customers don't have to wait on a docket release. A follow-up upstream fix in pydocket would let us drop the worker dependency.
Supersedes the stale draft fix(services): force-reschedule perpetual services after Redis disruption #22303 (same root cause; this is a cleaner, stateless reimplementation).

Checklist

This pull request references any related issue by including "closes Worker status not updated to OFFLINE after Redis disruption #22212"
If this pull request adds new functionality, it includes unit tests that cover the changes
If this pull request removes docs files, it includes redirect settings in mint.json.
If this pull request adds functions or classes, it includes helpful docstrings.

Link to Devin session: https://app.devin.ai/sessions/08d8e50048124c0c856e30be495d491b
Requested by: @bdalpe

…out restart A perpetual service mid-execution when Redis is disrupted is left with its runs-hash in state=running. The docket worker reschedules automatic perpetual tasks on every (re)connection, but via docket.add, which no-ops on a task that is already known -- including one stuck in state=running. The Foreman therefore never runs again and stale workers stay ONLINE indefinitely. Force-(re)schedule automatic perpetual tasks with docket.replace, which clears the stale running state: - register_and_schedule_perpetual_services uses replace (server-restart path) - new PerpetualServiceRecovery worker dependency reschedules via replace on each worker-loop start, recovering after a mid-runtime Redis reconnection Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration · 2026-06-30T18:50:05Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment, CI, and merge conflict monitoring

Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

codspeed-hq · 2026-06-30T19:08:03Z

Merging this PR will not alter performance

✅ 8 untouched benchmarks
⏩ 1 skipped benchmark¹

_{Comparing devin1/oss-8028-worker-status-not-updated-to-offline-after-redis-disruption (60a3263) with main (e5b9dbc)}

1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports. ↩

desertaxle · 2026-07-01T13:40:25Z

I think that there's a better fix for the underlying issue in Docket. I am working on a fix in chrisguidry/docket#439

devin-ai-integration · 2026-07-01T13:41:53Z

Makes sense — fixing _schedule_all_automatic_perpetual_tasks to use replace (or equivalent) in docket itself would eliminate the need for the PerpetualServiceRecovery dependency on the Prefect side. Happy to close this in favor of that upstream fix if you'd prefer, or keep it as a stopgap until the docket release lands. Up to you.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 267b345bd3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T13:44:54Z

 ) -> AsyncGenerator[None, None]:
    worker_task: asyncio.Task[None] | None = None
-    async with Worker(docket) as worker:
+    async with Worker(docket, dependencies=[PerpetualServiceRecovery()]) as worker:


Bump pydocket lower bound for worker dependencies

This new dependencies= argument relies on newer Docket worker dependency/lifecycle support, but the package metadata still allows pydocket>=0.19.0. In environments that upgrade Prefect while keeping an allowed older pydocket, 0.19.x raises TypeError: Worker.__init__() got an unexpected keyword argument 'dependencies', and 0.20.x accepts the argument but never invokes worker_lifecycle, so the no-restart recovery path silently does not run. Please raise the minimum pydocket version to the first release that supports this lifecycle hook.

Useful? React with 👍 / 👎.

Good catch — worker_lifecycle was introduced in pydocket 0.20.2; versions 0.19.x–0.20.1 accept the dependencies kwarg but never invoke the hook, so PerpetualServiceRecovery would silently not run. Bumped the lower bound to >=0.20.2 in 60a3263.

worker_lifecycle, used by PerpetualServiceRecovery, was introduced in pydocket 0.20.2. Older versions accept the dependencies kwarg but never invoke the hook, so mid-runtime recovery would silently not run. Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

bdalpe self-assigned this Jun 30, 2026

github-actions Bot added the bug Something isn't working label Jun 30, 2026

bdalpe and others added 2 commits June 30, 2026 18:54

fix double backticks; serialize recovery across HA workers with a lock

4902ca9

Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

test: recovery is skipped when the recovery lock is held

267b345

Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io> Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

bdalpe marked this pull request as ready for review July 1, 2026 13:39

bdalpe requested review from chrisguidry, cicdw, desertaxle and zzstoatzz as code owners July 1, 2026 13:39

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(services): recover perpetual services after Redis disruption without restart#22415

fix(services): recover perpetual services after Redis disruption without restart#22415
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin1/oss-8028-worker-status-not-updated-to-offline-after-redis-disruption

devin-ai-integration Bot commented Jun 30, 2026

devin-ai-integration Bot commented Jun 30, 2026

codspeed-hq Bot commented Jun 30, 2026 •

edited

Loading

desertaxle commented Jul 1, 2026

devin-ai-integration Bot commented Jul 1, 2026

chatgpt-codex-connector Bot left a comment

chatgpt-codex-connector Bot Jul 1, 2026

devin-ai-integration Bot Jul 1, 2026

Labels

2 participants

Uh oh!

Conversation

devin-ai-integration Bot commented Jun 30, 2026

Fix

Notes

Checklist

devin-ai-integration Bot commented Jun 30, 2026

🤖 Devin AI Engineer

codspeed-hq Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

desertaxle commented Jul 1, 2026

devin-ai-integration Bot commented Jul 1, 2026

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

devin-ai-integration Bot Jul 1, 2026

Choose a reason for hiding this comment

Labels

2 participants

codspeed-hq Bot commented Jun 30, 2026 •

edited

Loading