Skip to content

fix(services): recover perpetual services after Redis disruption without restart#22415

Open
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin1/oss-8028-worker-status-not-updated-to-offline-after-redis-disruption
Open

fix(services): recover perpetual services after Redis disruption without restart#22415
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin1/oss-8028-worker-status-not-updated-to-offline-after-redis-disruption

Conversation

@devin-ai-integration

Copy link
Copy Markdown
Contributor

closes #22212

After a transient Redis outage, perpetual server services (Foreman, scheduler, etc.) can stop running permanently — so stale workers keep showing as ONLINE in the UI/Postgres long after their last heartbeat, until the server is restarted.

This PR force-(re)schedules automatic perpetual tasks with docket.replace instead of docket.add, which recovers a service whose run was interrupted by the outage — including mid-runtime, without a server restart.

Root cause

When the docket worker claims a perpetual task it sets its runs-hash to state=running and deletes the known field. If Redis is disrupted while the task is executing, that state persists. On every (re)connection the worker reschedules automatic perpetual tasks via its built-in _schedule_all_automatic_perpetual_tasks, but that uses docket.add, whose scheduling script treats a state=running task as already-known and returns EXISTS — a permanent no-op. So the Foreman never runs again and stale workers are never marked OFFLINE.

#22213 already set automatic=True on all perpetual services so the worker attempts to reschedule them on reconnect; the remaining gap is that the attempt uses add, which can't clear the stuck running state.

Reproduced in repros/22212.py:

seeded runs:monitor_worker_health -> {state: running}
docket.add  -> Disposition.ALREADY_SCHEDULED   (state still 'running' — NO-OP, BUG)
docket.replace -> Disposition.SCHEDULED         (state -> 'queued' — RECOVERED)

Fix

docket.replace unconditionally overwrites the schedule, clearing the stale running state. Two paths use it:

  1. register_and_schedule_perpetual_services now uses replace — recovers stuck services on server restart.
  2. New PerpetualServiceRecovery worker dependency hooks docket's worker_lifecycle, which the worker enters at the start of every loop (including the loop after a mid-runtime Redis reconnection), and force-reschedules every automatic perpetual task via a shared reschedule_automatic_perpetual_tasks(docket) helper. This recovers without requiring a restart — the scenario in the issue.
class PerpetualServiceRecovery(Dependency["PerpetualServiceRecovery"]):
    @classmethod
    @asynccontextmanager
    async def worker_lifecycle(cls, docket, worker):
        await reschedule_automatic_perpetual_tasks(docket)  # docket.replace per automatic perpetual task
        yield

The helper iterates docket.tasks (mirroring docket's own automatic-perpetual scheduling), so it only touches registered, enabled, automatic perpetual services.

Notes

  • The underlying limitation is in docket (_schedule_all_automatic_perpetual_tasks using add); this is a Prefect-side recovery so customers don't have to wait on a docket release. A follow-up upstream fix in pydocket would let us drop the worker dependency.
  • Supersedes the stale draft fix(services): force-reschedule perpetual services after Redis disruption #22303 (same root cause; this is a cleaner, stateless reimplementation).

Checklist

  • This pull request references any related issue by including "closes Worker status not updated to OFFLINE after Redis disruption #22212"
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

Link to Devin session: https://app.devin.ai/sessions/08d8e50048124c0c856e30be495d491b
Requested by: @bdalpe

…out restart

A perpetual service mid-execution when Redis is disrupted is left with its
runs-hash in state=running. The docket worker reschedules automatic perpetual
tasks on every (re)connection, but via docket.add, which no-ops on a task that
is already known -- including one stuck in state=running. The Foreman therefore
never runs again and stale workers stay ONLINE indefinitely.

Force-(re)schedule automatic perpetual tasks with docket.replace, which clears
the stale running state:
- register_and_schedule_perpetual_services uses replace (server-restart path)
- new PerpetualServiceRecovery worker dependency reschedules via replace on each
  worker-loop start, recovering after a mid-runtime Redis reconnection

Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@bdalpe bdalpe self-assigned this Jun 30, 2026
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring
@github-actions github-actions Bot added the bug Something isn't working label Jun 30, 2026
bdalpe and others added 2 commits June 30, 2026 18:54
Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@codspeed-hq

codspeed-hq Bot commented Jun 30, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 8 untouched benchmarks
⏩ 1 skipped benchmark1


Comparing devin1/oss-8028-worker-status-not-updated-to-offline-after-redis-disruption (60a3263) with main (e5b9dbc)

Open in CodSpeed

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

@bdalpe bdalpe marked this pull request as ready for review July 1, 2026 13:39
@desertaxle

Copy link
Copy Markdown
Member

I think that there's a better fix for the underlying issue in Docket. I am working on a fix in chrisguidry/docket#439

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

Makes sense — fixing _schedule_all_automatic_perpetual_tasks to use replace (or equivalent) in docket itself would eliminate the need for the PerpetualServiceRecovery dependency on the Prefect side. Happy to close this in favor of that upstream fix if you'd prefer, or keep it as a stopgap until the docket release lands. Up to you.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 267b345bd3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

) -> AsyncGenerator[None, None]:
worker_task: asyncio.Task[None] | None = None
async with Worker(docket) as worker:
async with Worker(docket, dependencies=[PerpetualServiceRecovery()]) as worker:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bump pydocket lower bound for worker dependencies

This new dependencies= argument relies on newer Docket worker dependency/lifecycle support, but the package metadata still allows pydocket>=0.19.0. In environments that upgrade Prefect while keeping an allowed older pydocket, 0.19.x raises TypeError: Worker.__init__() got an unexpected keyword argument 'dependencies', and 0.20.x accepts the argument but never invokes worker_lifecycle, so the no-restart recovery path silently does not run. Please raise the minimum pydocket version to the first release that supports this lifecycle hook.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — worker_lifecycle was introduced in pydocket 0.20.2; versions 0.19.x–0.20.1 accept the dependencies kwarg but never invoke the hook, so PerpetualServiceRecovery would silently not run. Bumped the lower bound to >=0.20.2 in 60a3263.

worker_lifecycle, used by PerpetualServiceRecovery, was introduced in
pydocket 0.20.2. Older versions accept the dependencies kwarg but never
invoke the hook, so mid-runtime recovery would silently not run.

Co-authored-by: Brendan Dalpe <brendan.dalpe@prefect.io>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

2 participants