Replies: 3 comments 2 replies
-
|
Hi there, I've run into this exact "context deadline exceeded" wall before with ARC on EKS. It is incredibly frustrating because it fails silently. The Root Cause The reason Karpenter isn't provisioning nodes is that the Pods aren't actually being created. The error happens during the API Admission phase. Basically, the K8s API server receives the request to create the runner pod, tries to send it to a MutatingWebhook (likely ARC's own webhook) to inject the container hooks, and that connection is timing out. Because the webhook fails, the API server rejects the Pod creation entirely. The Pod never hits the Pending state, so Karpenter never sees it. Since you mentioned migrating/upgrading versions, this is almost certainly one of two things:
Check: Run kubectl get mutatingwebhookconfigurations. Fix: Delete any webhooks that look old or related to the "summerwind" legacy controller. Only the one matching your current active deployment should remain.
The Issue: If your ARC Controller is running on a private node, the EKS Control Plane Security Group might not have permission to talk to your Node Security Group on port 9443. Fix: Ensure your Node Security Group allows Inbound traffic from the EKS Cluster Security Group on port 9443. One quick test: Try disabling the volumes and volumeMounts in your runner spec for just one run. Sometimes (rarely) the storage driver's webhook is the one timing out. But 90% of the time, it's the ARC webhook connection failing. Don't revert to Legacy yet! This is just a network/webhook config block. Once you clear that webhook timeout, the pod will go Pending and Karpenter will snap it up immediately. |
Beta Was this translation helpful? Give feedback.
-
|
This looks like an ARC 0.13.x regression with ephemeral runners: the controller times out creating the runner pod (context deadline exceeded) before Karpenter or volume provisioning can complete. It’s not your token, Karpenter, or cluster version. Current workarounds are reverting to the legacy controller / older ARC version or removing ephemeral volumes and advanced pod specs. This should be reported upstream so ARC waits properly for pod + PVC creation on Karpenter-backed clusters 🔥 |
Beta Was this translation helpful? Give feedback.
-
|
I've had this happen if a node is acting weird, by draining a node I resolved it. I run ARC 0.13.1 on EKS 1.32 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Why are you starting this discussion?
Question
What GitHub Actions topic or product is this about?
ARC (Actions Runner Controller)
Discussion Details
TL;DR: ARC 0.13.x controller fails to create ephemeral runner pods on
Karpenter-managed EKS with "context deadline exceeded" errors. Working with
bare minimal values but can't add nodeSelector/ephemeral volumes. Pod creation
times out before Karpenter can provision nodes.
What's failing: Pod creation at ephemeralrunner_controller.go:687
When: During reconciliation loop
Impact: Runners never come online despite valid GitHub token
Already Verified (Not the Issue)
Questions for Community
We are migrating from public GitHub runners to private runners mainly due to GitHub Copilot Coding Agent needs private connectivity for us.
Steps to reproduce error of "...Timeout: request did not complete within requested timeout - context deadline exceeded..." (see full log at bottom):
Install controller: (same error seen when doing latest version 0.13.1)
helm install arc --namespace "arc-systems" --version 0.13.0 oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controllerInstall arc-runner-set: (same error seen when doing latest version 0.13.1)
helm install "arc-runner-set" --namespace "github-runners" --version 0.13.0 -f values-minimal.yml oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-setvalues-minimal.yml looks like:
When running the above we get the main error logs on the controller as shown at the bottom BUT also never see in Karpenter logs or nodeclaims any activity to bring up nodes. However when running a test pod with node selector set to Karpenter managed nodes (just as above), the test pod comes up fine. So I know Karpenter is working fine.
Even w/ the above Helm values and the nodeSelector commented out so that we take Karpenter out of play & just have the runner pods run on managed nodes but it still fails with the same error below.
At this point the only thing I can get running consistently is the below extreme bare, no karpenter nodes values & it just runs on the managed node group that Karpenter runs on (not what we want):
Error log
Background of goal:
We run Karpenter on AWS EKS clusters quite a bit to utilize AWS EC2 SPOT nodes for our workloads. Now we are slowly migrating from Azure Devops agent runners on AWS ECS to github runners on AWS EKS. This goal is mainly for running Github Copilot Coding Agent on this private/self-hosted runner. We have been using the Coding Agent a lot on the public runner but need to private connectivity.
Ideally we would like to install all the tools and settings we have in our usual Docker container into the actions/runner-image then launch it according to the ARC runner Helm values examples. We have semi successfully done this with Karpenter to github runners but when upgrading the controller & scale sets to 0.13.1 there were some problems (possibly conflicting CRDs?) and we had to start over; ever since then we can't reproduce even the semi successful state due to the error at the bottom of this post. So we've reduced the runner set values to very very minimal to see what will work but are hitting this wall. Any help would be very much appreciated!
It seems like the ARC controller isn't waiting enough time to allow Karpenter to bring up a node & then allocate the runner to it, but we aren't sure about that. Should we try the actions-runner-controller legacy version instead w/ an increase to the syncPeriod to something greater than 1 minute? The runner-listener logs are all normal it seems but we just can't figure out what's going on with this context deadline exceeded error.
Beta Was this translation helpful? Give feedback.
All reactions