Skip to content

fix: gracefully handle JIT config failures and terminate unconfigured instance#4990

Merged
npalm merged 6 commits intomainfrom
fix/createJitConfig-error-handling
Mar 9, 2026
Merged

fix: gracefully handle JIT config failures and terminate unconfigured instance#4990
npalm merged 6 commits intomainfrom
fix/createJitConfig-error-handling

Conversation

@Brend-Smits
Copy link
Copy Markdown
Contributor

@Brend-Smits Brend-Smits commented Jan 7, 2026

This pull request enhances the robustness and reliability of the GitHub Actions runner scaling logic by improving error handling and retry mechanisms for GitHub API calls. It introduces the @octokit/plugin-retry plugin to automatically retry failed API requests, adds detailed logging for retry attempts, and ensures that failures in creating JIT configs for individual runner instances do not halt the entire scaling process. Additionally, new tests are added to verify handling of various API failure scenarios.

GitHub API client improvements:

  • Added @octokit/plugin-retry to dependencies (package.json) and integrated it into the Octokit client initialization to enable automatic retries for failed GitHub API requests. [1] [2] [3]
  • Configured the retry plugin to log detailed warnings on each retry attempt, including the HTTP method, URL, error message, and status code.

Error handling and resilience in JIT config creation:

  • Updated createJitConfig in scale-up.ts to catch and log errors for individual runner instances when creating JIT configs, allowing the process to continue for remaining instances and logging a summary of failed attempts at the end. [1] [2]
  • Instances that failed to generate a configuration, will now be terminated to avoid generating waste.

Testing improvements:

  • Added comprehensive tests to scale-up.test.ts to verify correct behavior when GitHub API calls fail for some instances, including retryable errors (e.g., 5xx), non-retryable errors (e.g., 4xx), and partial failures, ensuring only successful JIT configs are stored.
This ensures that even if there's a failed jit config creation for one of the instances, it proceeds with the other ones and does not just skip the entire batch. It will report the failed instances at the end.
@Brend-Smits Brend-Smits requested a review from a team as a code owner January 7, 2026 12:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 7, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails
npm/@octokit/plugin-retry 8.0.3 🟢 6.9
Details
CheckScoreReason
Security-Policy🟢 9security policy file detected
Code-Review🟢 10all changesets reviewed
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 56 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 5
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies🟢 5dependency not pinned by hash detected -- score normalized to 5
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Packaging🟢 10packaging workflow detected
SAST🟢 10SAST tool is run on all commits
npm/@octokit/plugin-retry 8.0.3 🟢 6.9
Details
CheckScoreReason
Security-Policy🟢 9security policy file detected
Code-Review🟢 10all changesets reviewed
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 56 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 5
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies🟢 5dependency not pinned by hash detected -- score normalized to 5
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Packaging🟢 10packaging workflow detected
SAST🟢 10SAST tool is run on all commits

Scanned Files

  • lambdas/functions/control-plane/package.json
  • lambdas/yarn.lock
@Brend-Smits Brend-Smits force-pushed the fix/createJitConfig-error-handling branch from cd7ca8f to 9f37a04 Compare January 7, 2026 13:07
Instances that failed to start up because of incorrect configuration never got terminated. This is now updated and failed instances get terminated right away. Previously we relied on a scale-down to do this.
@Brend-Smits Brend-Smits changed the title fix: ensure scale up creates instances only for runners that have jit configs Mar 9, 2026
Copy link
Copy Markdown
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npalm npalm merged commit c171550 into main Mar 9, 2026
10 checks passed
@npalm npalm deleted the fix/createJitConfig-error-handling branch March 9, 2026 20:07
npalm pushed a commit that referenced this pull request Mar 9, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.4.1](v7.4.0...v7.4.1)
(2026-03-09)


### Bug Fixes

* gracefully handle JIT config failures and terminate unconfigured
instance
([#4990](#4990))
([c171550](c171550))
* **install-runner.sh:** support Debian
([#5027](#5027))
([7755b7f](7755b7f))
* **lambda:** add jti claim to GitHub App JWTs to prevent concurrent
collisions
([#5056](#5056))
([07bd193](07bd193)),
closes
[#5025](#5025)
* **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in
the octokit group
([#5035](#5035))
([1c8083e](1c8083e))
* **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas
([#5028](#5028))
([0335e3a](0335e3a))
* **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas
([#5032](#5032))
([6dc97d5](6dc97d5))
* **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas
([#5052](#5052))
([1e798b1](1e798b1))
* **lambda:** bump the aws group in /lambdas with 7 updates
([#5021](#5021))
([c3c158d](c3c158d))
* **lambda:** bump the aws-powertools group in /lambdas with 4 updates
([#5022](#5022))
([e8369cf](e8369cf))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
Brend-Smits pushed a commit that referenced this pull request Mar 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.4.1](v7.4.0...v7.4.1)
(2026-03-09)


### Bug Fixes

* gracefully handle JIT config failures and terminate unconfigured
instance
([#4990](#4990))
([c171550](c171550))
* **install-runner.sh:** support Debian
([#5027](#5027))
([7755b7f](7755b7f))
* **lambda:** add jti claim to GitHub App JWTs to prevent concurrent
collisions
([#5056](#5056))
([07bd193](07bd193)),
closes
[#5025](#5025)
* **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in
the octokit group
([#5035](#5035))
([1c8083e](1c8083e))
* **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas
([#5028](#5028))
([0335e3a](0335e3a))
* **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas
([#5032](#5032))
([6dc97d5](6dc97d5))
* **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas
([#5052](#5052))
([1e798b1](1e798b1))
* **lambda:** bump the aws group in /lambdas with 7 updates
([#5021](#5021))
([c3c158d](c3c158d))
* **lambda:** bump the aws-powertools group in /lambdas with 4 updates
([#5022](#5022))
([e8369cf](e8369cf))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 11, 2026
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 11, 2026
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 12, 2026
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 13, 2026
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 13, 2026
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle
- Rename githubAppClient to githubInstallationClient for clarity
- Refactor to split owner/repo once instead of multiple times
- Fix error logging to handle non-Error objects properly

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh added a commit to shivdesh/terraform-aws-github-runner that referenced this pull request Mar 27, 2026
When the scale-down Lambda fails to de-register a runner from GitHub
(even after automatic retries via @octokit/plugin-retry), the EC2
instance should NOT be terminated. This prevents stale runner entries
in GitHub org settings.

This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for
automatic retries. While that handles transient failures, this ensures
that if de-registration ultimately fails, we don't leave orphaned
GitHub runner entries by terminating the EC2 instance prematurely.

Key changes:
- Extract deleteGitHubRunner() helper that catches errors per-runner
- Only terminate EC2 instance if ALL GitHub de-registrations succeed
- If any de-registration fails, leave instance running for next cycle
- Rename githubAppClient to githubInstallationClient for clarity
- Refactor to split owner/repo once instead of multiple times
- Fix error logging to handle non-Error objects properly

The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries
at the client level, so no custom retry logic is needed here.

Tests:
- Add test verifying EC2 is NOT terminated when de-registration fails
Brend-Smits added a commit that referenced this pull request Apr 1, 2026
… instance (#4990)

This pull request enhances the robustness and reliability of the GitHub
Actions runner scaling logic by improving error handling and retry
mechanisms for GitHub API calls. It introduces the
`@octokit/plugin-retry` plugin to automatically retry failed API
requests, adds detailed logging for retry attempts, and ensures that
failures in creating JIT configs for individual runner instances do not
halt the entire scaling process. Additionally, new tests are added to
verify handling of various API failure scenarios.

**GitHub API client improvements:**

* Added `@octokit/plugin-retry` to dependencies (`package.json`) and
integrated it into the Octokit client initialization to enable automatic
retries for failed GitHub API requests.
[[1]](diffhunk://#diff-37d09418dae74ded5678eabfa3b60993ee491e2fd9e49e11142f639b078ac9b2R41)
[[2]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dR21)
[[3]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dL29-R30)
* Configured the retry plugin to log detailed warnings on each retry
attempt, including the HTTP method, URL, error message, and status code.

**Error handling and resilience in JIT config creation:**

* Updated `createJitConfig` in `scale-up.ts` to catch and log errors for
individual runner instances when creating JIT configs, allowing the
process to continue for remaining instances and logging a summary of
failed attempts at the end.
[[1]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR537-R542)
[[2]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR582-R596)
* Instances that failed to generate a configuration, will now be
terminated to avoid generating waste.

**Testing improvements:**

* Added comprehensive tests to `scale-up.test.ts` to verify correct
behavior when GitHub API calls fail for some instances, including
retryable errors (e.g., 5xx), non-retryable errors (e.g., 4xx), and
partial failures, ensuring only successful JIT configs are stored.
Brend-Smits pushed a commit that referenced this pull request Apr 1, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.4.1](v7.4.0...v7.4.1)
(2026-03-09)


### Bug Fixes

* gracefully handle JIT config failures and terminate unconfigured
instance
([#4990](#4990))
([c171550](c171550))
* **install-runner.sh:** support Debian
([#5027](#5027))
([7755b7f](7755b7f))
* **lambda:** add jti claim to GitHub App JWTs to prevent concurrent
collisions
([#5056](#5056))
([07bd193](07bd193)),
closes
[#5025](#5025)
* **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in
the octokit group
([#5035](#5035))
([1c8083e](1c8083e))
* **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas
([#5028](#5028))
([0335e3a](0335e3a))
* **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas
([#5032](#5032))
([6dc97d5](6dc97d5))
* **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas
([#5052](#5052))
([1e798b1](1e798b1))
* **lambda:** bump the aws group in /lambdas with 7 updates
([#5021](#5021))
([c3c158d](c3c158d))
* **lambda:** bump the aws-powertools group in /lambdas with 4 updates
([#5022](#5022))
([e8369cf](e8369cf))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
LudovicTOURMAN pushed a commit to doctolib-lab/terraform-aws-github-runner that referenced this pull request Apr 7, 2026
… instance (github-aws-runners#4990)

This pull request enhances the robustness and reliability of the GitHub
Actions runner scaling logic by improving error handling and retry
mechanisms for GitHub API calls. It introduces the
`@octokit/plugin-retry` plugin to automatically retry failed API
requests, adds detailed logging for retry attempts, and ensures that
failures in creating JIT configs for individual runner instances do not
halt the entire scaling process. Additionally, new tests are added to
verify handling of various API failure scenarios.

**GitHub API client improvements:**

* Added `@octokit/plugin-retry` to dependencies (`package.json`) and
integrated it into the Octokit client initialization to enable automatic
retries for failed GitHub API requests.
[[1]](diffhunk://#diff-37d09418dae74ded5678eabfa3b60993ee491e2fd9e49e11142f639b078ac9b2R41)
[[2]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dR21)
[[3]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dL29-R30)
* Configured the retry plugin to log detailed warnings on each retry
attempt, including the HTTP method, URL, error message, and status code.

**Error handling and resilience in JIT config creation:**

* Updated `createJitConfig` in `scale-up.ts` to catch and log errors for
individual runner instances when creating JIT configs, allowing the
process to continue for remaining instances and logging a summary of
failed attempts at the end.
[[1]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR537-R542)
[[2]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR582-R596)
* Instances that failed to generate a configuration, will now be
terminated to avoid generating waste.

**Testing improvements:**

* Added comprehensive tests to `scale-up.test.ts` to verify correct
behavior when GitHub API calls fail for some instances, including
retryable errors (e.g., 5xx), non-retryable errors (e.g., 4xx), and
partial failures, ensuring only successful JIT configs are stored.
LudovicTOURMAN pushed a commit to doctolib-lab/terraform-aws-github-runner that referenced this pull request Apr 7, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.4.1](github-aws-runners/terraform-aws-github-runner@v7.4.0...v7.4.1)
(2026-03-09)


### Bug Fixes

* gracefully handle JIT config failures and terminate unconfigured
instance
([github-aws-runners#4990](github-aws-runners#4990))
([c171550](github-aws-runners@c171550))
* **install-runner.sh:** support Debian
([github-aws-runners#5027](github-aws-runners#5027))
([7755b7f](github-aws-runners@7755b7f))
* **lambda:** add jti claim to GitHub App JWTs to prevent concurrent
collisions
([github-aws-runners#5056](github-aws-runners#5056))
([07bd193](github-aws-runners@07bd193)),
closes
[github-aws-runners#5025](github-aws-runners#5025)
* **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in
the octokit group
([github-aws-runners#5035](github-aws-runners#5035))
([1c8083e](github-aws-runners@1c8083e))
* **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas
([github-aws-runners#5028](github-aws-runners#5028))
([0335e3a](github-aws-runners@0335e3a))
* **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas
([github-aws-runners#5032](github-aws-runners#5032))
([6dc97d5](github-aws-runners@6dc97d5))
* **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas
([github-aws-runners#5052](github-aws-runners#5052))
([1e798b1](github-aws-runners@1e798b1))
* **lambda:** bump the aws group in /lambdas with 7 updates
([github-aws-runners#5021](github-aws-runners#5021))
([c3c158d](github-aws-runners@c3c158d))
* **lambda:** bump the aws-powertools group in /lambdas with 4 updates
([github-aws-runners#5022](github-aws-runners#5022))
([e8369cf](github-aws-runners@e8369cf))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants