fix: gracefully handle JIT config failures and terminate unconfigured instance#4990
Merged
fix: gracefully handle JIT config failures and terminate unconfigured instance#4990
Conversation
This ensures that even if there's a failed jit config creation for one of the instances, it proceeds with the other ones and does not just skip the entire batch. It will report the failed instances at the end.
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Files
|
cd7ca8f to
9f37a04
Compare
Instances that failed to start up because of incorrect configuration never got terminated. This is now updated and failed instances get terminated right away. Previously we relied on a scale-down to do this.
5803005 to
097ccc6
Compare
edersonbrilhante
approved these changes
Mar 6, 2026
npalm
pushed a commit
that referenced
this pull request
Mar 9, 2026
🤖 I have created a release *beep* *boop* --- ## [7.4.1](v7.4.0...v7.4.1) (2026-03-09) ### Bug Fixes * gracefully handle JIT config failures and terminate unconfigured instance ([#4990](#4990)) ([c171550](c171550)) * **install-runner.sh:** support Debian ([#5027](#5027)) ([7755b7f](7755b7f)) * **lambda:** add jti claim to GitHub App JWTs to prevent concurrent collisions ([#5056](#5056)) ([07bd193](07bd193)), closes [#5025](#5025) * **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in the octokit group ([#5035](#5035)) ([1c8083e](1c8083e)) * **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas ([#5028](#5028)) ([0335e3a](0335e3a)) * **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas ([#5032](#5032)) ([6dc97d5](6dc97d5)) * **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas ([#5052](#5052)) ([1e798b1](1e798b1)) * **lambda:** bump the aws group in /lambdas with 7 updates ([#5021](#5021)) ([c3c158d](c3c158d)) * **lambda:** bump the aws-powertools group in /lambdas with 4 updates ([#5022](#5022)) ([e8369cf](e8369cf)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
Brend-Smits
pushed a commit
that referenced
this pull request
Mar 11, 2026
🤖 I have created a release *beep* *boop* --- ## [7.4.1](v7.4.0...v7.4.1) (2026-03-09) ### Bug Fixes * gracefully handle JIT config failures and terminate unconfigured instance ([#4990](#4990)) ([c171550](c171550)) * **install-runner.sh:** support Debian ([#5027](#5027)) ([7755b7f](7755b7f)) * **lambda:** add jti claim to GitHub App JWTs to prevent concurrent collisions ([#5056](#5056)) ([07bd193](07bd193)), closes [#5025](#5025) * **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in the octokit group ([#5035](#5035)) ([1c8083e](1c8083e)) * **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas ([#5028](#5028)) ([0335e3a](0335e3a)) * **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas ([#5032](#5032)) ([6dc97d5](6dc97d5)) * **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas ([#5052](#5052)) ([1e798b1](1e798b1)) * **lambda:** bump the aws group in /lambdas with 7 updates ([#5021](#5021)) ([c3c158d](c3c158d)) * **lambda:** bump the aws-powertools group in /lambdas with 4 updates ([#5022](#5022)) ([e8369cf](e8369cf)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
shivdesh
added a commit
to shivdesh/terraform-aws-github-runner
that referenced
this pull request
Mar 11, 2026
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh
added a commit
to shivdesh/terraform-aws-github-runner
that referenced
this pull request
Mar 11, 2026
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh
added a commit
to shivdesh/terraform-aws-github-runner
that referenced
this pull request
Mar 12, 2026
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh
added a commit
to shivdesh/terraform-aws-github-runner
that referenced
this pull request
Mar 13, 2026
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh
added a commit
to shivdesh/terraform-aws-github-runner
that referenced
this pull request
Mar 13, 2026
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle - Rename githubAppClient to githubInstallationClient for clarity - Refactor to split owner/repo once instead of multiple times - Fix error logging to handle non-Error objects properly The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
shivdesh
added a commit
to shivdesh/terraform-aws-github-runner
that referenced
this pull request
Mar 27, 2026
When the scale-down Lambda fails to de-register a runner from GitHub (even after automatic retries via @octokit/plugin-retry), the EC2 instance should NOT be terminated. This prevents stale runner entries in GitHub org settings. This change complements PR github-aws-runners#4990 which added @octokit/plugin-retry for automatic retries. While that handles transient failures, this ensures that if de-registration ultimately fails, we don't leave orphaned GitHub runner entries by terminating the EC2 instance prematurely. Key changes: - Extract deleteGitHubRunner() helper that catches errors per-runner - Only terminate EC2 instance if ALL GitHub de-registrations succeed - If any de-registration fails, leave instance running for next cycle - Rename githubAppClient to githubInstallationClient for clarity - Refactor to split owner/repo once instead of multiple times - Fix error logging to handle non-Error objects properly The @octokit/plugin-retry (added in github-aws-runners#4990) handles automatic retries at the client level, so no custom retry logic is needed here. Tests: - Add test verifying EC2 is NOT terminated when de-registration fails
Brend-Smits
added a commit
that referenced
this pull request
Apr 1, 2026
… instance (#4990) This pull request enhances the robustness and reliability of the GitHub Actions runner scaling logic by improving error handling and retry mechanisms for GitHub API calls. It introduces the `@octokit/plugin-retry` plugin to automatically retry failed API requests, adds detailed logging for retry attempts, and ensures that failures in creating JIT configs for individual runner instances do not halt the entire scaling process. Additionally, new tests are added to verify handling of various API failure scenarios. **GitHub API client improvements:** * Added `@octokit/plugin-retry` to dependencies (`package.json`) and integrated it into the Octokit client initialization to enable automatic retries for failed GitHub API requests. [[1]](diffhunk://#diff-37d09418dae74ded5678eabfa3b60993ee491e2fd9e49e11142f639b078ac9b2R41) [[2]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dR21) [[3]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dL29-R30) * Configured the retry plugin to log detailed warnings on each retry attempt, including the HTTP method, URL, error message, and status code. **Error handling and resilience in JIT config creation:** * Updated `createJitConfig` in `scale-up.ts` to catch and log errors for individual runner instances when creating JIT configs, allowing the process to continue for remaining instances and logging a summary of failed attempts at the end. [[1]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR537-R542) [[2]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR582-R596) * Instances that failed to generate a configuration, will now be terminated to avoid generating waste. **Testing improvements:** * Added comprehensive tests to `scale-up.test.ts` to verify correct behavior when GitHub API calls fail for some instances, including retryable errors (e.g., 5xx), non-retryable errors (e.g., 4xx), and partial failures, ensuring only successful JIT configs are stored.
Brend-Smits
pushed a commit
that referenced
this pull request
Apr 1, 2026
🤖 I have created a release *beep* *boop* --- ## [7.4.1](v7.4.0...v7.4.1) (2026-03-09) ### Bug Fixes * gracefully handle JIT config failures and terminate unconfigured instance ([#4990](#4990)) ([c171550](c171550)) * **install-runner.sh:** support Debian ([#5027](#5027)) ([7755b7f](7755b7f)) * **lambda:** add jti claim to GitHub App JWTs to prevent concurrent collisions ([#5056](#5056)) ([07bd193](07bd193)), closes [#5025](#5025) * **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in the octokit group ([#5035](#5035)) ([1c8083e](1c8083e)) * **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas ([#5028](#5028)) ([0335e3a](0335e3a)) * **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas ([#5032](#5032)) ([6dc97d5](6dc97d5)) * **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas ([#5052](#5052)) ([1e798b1](1e798b1)) * **lambda:** bump the aws group in /lambdas with 7 updates ([#5021](#5021)) ([c3c158d](c3c158d)) * **lambda:** bump the aws-powertools group in /lambdas with 4 updates ([#5022](#5022)) ([e8369cf](e8369cf)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
LudovicTOURMAN
pushed a commit
to doctolib-lab/terraform-aws-github-runner
that referenced
this pull request
Apr 7, 2026
… instance (github-aws-runners#4990) This pull request enhances the robustness and reliability of the GitHub Actions runner scaling logic by improving error handling and retry mechanisms for GitHub API calls. It introduces the `@octokit/plugin-retry` plugin to automatically retry failed API requests, adds detailed logging for retry attempts, and ensures that failures in creating JIT configs for individual runner instances do not halt the entire scaling process. Additionally, new tests are added to verify handling of various API failure scenarios. **GitHub API client improvements:** * Added `@octokit/plugin-retry` to dependencies (`package.json`) and integrated it into the Octokit client initialization to enable automatic retries for failed GitHub API requests. [[1]](diffhunk://#diff-37d09418dae74ded5678eabfa3b60993ee491e2fd9e49e11142f639b078ac9b2R41) [[2]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dR21) [[3]](diffhunk://#diff-cf7cdd79fe0ed0e3a2e8928c0c7667a096c47c47abdb2354ddadee67e80a226dL29-R30) * Configured the retry plugin to log detailed warnings on each retry attempt, including the HTTP method, URL, error message, and status code. **Error handling and resilience in JIT config creation:** * Updated `createJitConfig` in `scale-up.ts` to catch and log errors for individual runner instances when creating JIT configs, allowing the process to continue for remaining instances and logging a summary of failed attempts at the end. [[1]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR537-R542) [[2]](diffhunk://#diff-fbc68af2a40bf14ad13a80b13958c0b52d1d0fde5f0009416a693fb4b691ceaeR582-R596) * Instances that failed to generate a configuration, will now be terminated to avoid generating waste. **Testing improvements:** * Added comprehensive tests to `scale-up.test.ts` to verify correct behavior when GitHub API calls fail for some instances, including retryable errors (e.g., 5xx), non-retryable errors (e.g., 4xx), and partial failures, ensuring only successful JIT configs are stored.
LudovicTOURMAN
pushed a commit
to doctolib-lab/terraform-aws-github-runner
that referenced
this pull request
Apr 7, 2026
🤖 I have created a release *beep* *boop* --- ## [7.4.1](github-aws-runners/terraform-aws-github-runner@v7.4.0...v7.4.1) (2026-03-09) ### Bug Fixes * gracefully handle JIT config failures and terminate unconfigured instance ([github-aws-runners#4990](github-aws-runners#4990)) ([c171550](github-aws-runners@c171550)) * **install-runner.sh:** support Debian ([github-aws-runners#5027](github-aws-runners#5027)) ([7755b7f](github-aws-runners@7755b7f)) * **lambda:** add jti claim to GitHub App JWTs to prevent concurrent collisions ([github-aws-runners#5056](github-aws-runners#5056)) ([07bd193](github-aws-runners@07bd193)), closes [github-aws-runners#5025](github-aws-runners#5025) * **lambda:** bump @octokit/auth-app from 8.1.2 to 8.2.0 in /lambdas in the octokit group ([github-aws-runners#5035](github-aws-runners#5035)) ([1c8083e](github-aws-runners@1c8083e)) * **lambda:** bump axios from 1.13.2 to 1.13.5 in /lambdas ([github-aws-runners#5028](github-aws-runners#5028)) ([0335e3a](github-aws-runners@0335e3a)) * **lambda:** bump qs from 6.14.1 to 6.14.2 in /lambdas ([github-aws-runners#5032](github-aws-runners#5032)) ([6dc97d5](github-aws-runners@6dc97d5)) * **lambda:** bump rollup from 4.46.2 to 4.59.0 in /lambdas ([github-aws-runners#5052](github-aws-runners#5052)) ([1e798b1](github-aws-runners@1e798b1)) * **lambda:** bump the aws group in /lambdas with 7 updates ([github-aws-runners#5021](github-aws-runners#5021)) ([c3c158d](github-aws-runners@c3c158d)) * **lambda:** bump the aws-powertools group in /lambdas with 4 updates ([github-aws-runners#5022](github-aws-runners#5022)) ([e8369cf](github-aws-runners@e8369cf)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request enhances the robustness and reliability of the GitHub Actions runner scaling logic by improving error handling and retry mechanisms for GitHub API calls. It introduces the
@octokit/plugin-retryplugin to automatically retry failed API requests, adds detailed logging for retry attempts, and ensures that failures in creating JIT configs for individual runner instances do not halt the entire scaling process. Additionally, new tests are added to verify handling of various API failure scenarios.GitHub API client improvements:
@octokit/plugin-retryto dependencies (package.json) and integrated it into the Octokit client initialization to enable automatic retries for failed GitHub API requests. [1] [2] [3]Error handling and resilience in JIT config creation:
createJitConfiginscale-up.tsto catch and log errors for individual runner instances when creating JIT configs, allowing the process to continue for remaining instances and logging a summary of failed attempts at the end. [1] [2]Testing improvements:
scale-up.test.tsto verify correct behavior when GitHub API calls fail for some instances, including retryable errors (e.g., 5xx), non-retryable errors (e.g., 4xx), and partial failures, ensuring only successful JIT configs are stored.