[CI] Rerun failing tests for periodic build pipelines by breskeby · Pull Request #139200 · elastic/elasticsearch

breskeby · 2025-12-08T15:19:11Z

When a Buildkite job is retried, the system fetches failed test information from the Develocity API and reruns only the tests that failed in the previous attempt, significantly reducing retry time and CI resource usage.

elasticsearchmachine · 2025-12-08T15:19:48Z

Pinging @elastic/es-delivery (Team:Delivery)

breskeby · 2025-12-08T15:37:41Z

build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle

@@ -123,7 +123,8 @@ develocity {

            // Add a build annotation


This updates the gradle build scan annotation in buildkite to also show the retry count to allow differentiating build scans of original and retry jobs

breskeby · 2025-12-08T15:38:10Z

.buildkite/hooks/pre-command

 EOF
 fi

+if [[ "${SMART_RETRIES:-}" == "true" && "${BUILDKITE_RETRY_COUNT:-0}" -gt 0 ]]; then


I just hate windows

jozala

I can't wait to have this running on CI.
Some minor comments left.
I'd like to discuss two things in pre-command about the --compressed in curl command and BUILD_SCAN_ID` validation.

jozala · 2025-12-10T09:39:08Z

.buildkite/hooks/pre-command

 EOF
 fi

+if [[ "${SMART_RETRIES:-}" == "true" && "${BUILDKITE_RETRY_COUNT:-0}" -gt 0 ]]; then


jozala · 2025-12-10T09:59:17Z

.buildkite/hooks/pre-command

+  if BUILD_JSON=$(curl --max-time 30 -H "Authorization: Bearer $BUILDKITE_API_TOKEN" -X GET "https://api.buildkite.com/v2/organizations/elastic/pipelines/${BUILDKITE_PIPELINE_SLUG}/builds/${BUILDKITE_BUILD_NUMBER}?include_retried_jobs=true" 2>/dev/null); then
+    if ORIGIN_JOB_ID=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg jobId "$BUILDKITE_JOB_ID" ' .jobs[] | select(.id == $jobId) | .retry_source.job_id' 2>/dev/null) && [ "$ORIGIN_JOB_ID" != "null" ] && [ -n "$ORIGIN_JOB_ID" ]; then
+      if BUILD_SCAN_URL=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg job_id "$ORIGIN_JOB_ID" '.meta_data["build-scan-" + $job_id]' 2>/dev/null) && [ "$BUILD_SCAN_URL" != "null" ] && [ -n "$BUILD_SCAN_URL" ]; then
+        BUILD_SCAN_ID=$(echo "$BUILD_SCAN_URL" | sed 's|.*/s/||')


I think we should start writing some scripts in other language than Bash. That's just a general thought. Nothing to change here I think. I'm just curious about your thoughts after writing this.

jozala · 2025-12-10T10:17:17Z

.buildkite/hooks/pre-command

+      if BUILD_SCAN_URL=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg job_id "$ORIGIN_JOB_ID" '.meta_data["build-scan-" + $job_id]' 2>/dev/null) && [ "$BUILD_SCAN_URL" != "null" ] && [ -n "$BUILD_SCAN_URL" ]; then
+        BUILD_SCAN_ID=$(echo "$BUILD_SCAN_URL" | sed 's|.*/s/||')
+
+        # Validate BUILD_SCAN_ID format to prevent injection attacks


What kind of injection attacks are possible here?
As far as I can see we are reading the BUILD_SCAN_ID from BUILD_JSON which we control here. Could we get rid of that to simplify the logic or sanitize this early?

I reworked this whole thing to directly pass the buildscanId via bk meta-data. that makes this brittle IMO

jozala · 2025-12-10T10:21:17Z

.buildkite/hooks/pre-command

+          if curl --request GET \
+            --url "$DEVELOCITY_FAILED_TEST_API_URL" \
+            --max-filesize 10485760 \
+            --max-time 30 \
+            --header 'accept: application/json' \
+            --header "authorization: Bearer $DEVELOCITY_API_ACCESS_KEY" \
+            --header 'content-type: application/json' 2>/dev/null | gunzip | jq '.' &> .failed-test-history.json; then


It should be more resilient if used --compressed in the curl command instead of | gunzip. This way cURL should take care of the compression method in the case it changes.

yeah indeed. I do already for windows 🤦 . fixed

jozala · 2025-12-10T10:26:43Z

.buildkite/hooks/pre-command

+            # Create Buildkite annotation for visibility
+            # Use unique context per job to support multiple retries
+            cat << EOF | buildkite-agent annotate --style info --context "smart-retry-$BUILDKITE_JOB_ID"
+Rerunning failed build job [$ORIGIN_JOB_NAME]($BUILD_SCAN_URL)
+
+**Gradle Tasks with Failures:** $FILTERED_WORK_UNITS
+
+This retry will skip test tasks that had no failures in the previous run.
+EOF


I really like this. It makes easy to understand what happened in the build.

jozala · 2025-12-10T10:33:35Z

.buildkite/hooks/pre-command.bat

+
+REM Smart retries implementation
+if "%SMART_RETRIES%"=="true" (
+  if defined BUILDKITE_RETRY_COUNT (
+    if %BUILDKITE_RETRY_COUNT% GTR 0 (
+      echo --- Resolving previously failed tests
+      set SMART_RETRY_STATUS=disabled
+      set SMART_RETRY_DETAILS=
+
+      REM Fetch build information from Buildkite API


Ohhh... That's another reason to write these scripts in another language - probably something as OS-agnostic as it can be.
That's just another general note. Nothing to change here.

...t/groovy/org/elasticsearch/gradle/internal/test/rerun/InternalTestRerunPluginFuncTest.groovy

jozala · 2025-12-10T11:48:21Z

...rnal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/model/FailedTestsReport.java

+public class FailedTestsReport {
+    private List<WorkUnit> workUnits;
+
+    public List<WorkUnit> getWorkUnits() {
+        return workUnits != null ? workUnits : java.util.Collections.emptyList();
+    }
+
+    public void setWorkUnits(List<WorkUnit> workUnits) {
+        this.workUnits = workUnits;
+    }
+}


NIT: We could probably use records for the model classes, but may not be worth to change it now. I'll let you to decide if it's worth.

jozala · 2025-12-10T12:05:02Z

...ools/src/testFixtures/groovy/org/elasticsearch/gradle/fixtures/AbstractGradleFuncTest.groovy

+        File createFailedTest(String clazzName) {
+            createTest(clazzName, testMethodContent(false, true, 1))
+        }


Seems to be unused.

Fix buildkite annotations Respect retries in buildkite build UIs Run platform periodic tests with --continue Add clever retries for windows Remove dockeravailability

- Store build-scan-id directly in Buildkite metadata from Gradle. - Update pre-command hooks (Bash and Batch) to prefer reading build-scan-id from metadata. - Remove logic to extract build scan ID from URL in pre-command hooks.

avoid relying on gunzip and reuse --compressed in curl call

- remove unused testfixture method - use records where possible

breskeby · 2025-12-10T19:48:03Z

I triggered a new test run after apply review feedback / minor refactorings https://buildkite.com/elastic/elasticsearch-periodic-platform-support/builds/11538

jozala

LGTM

elasticsearchmachine · 2025-12-11T11:37:28Z

💔 Backport failed

Status	Branch	Result
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	9.2	Commit could not be cherrypicked due to conflicts
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 139200

* upstream/main: (79 commits) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/140_pre_filter_search_shards/prefilter on non-indexed date fields} elastic#139381 Adjust error bounds for bfloat16 value checks (elastic#139371) Unmute some vector CSS tests (elastic#139370) Do not allow `project_routing` as a query param (elastic#139206) Unmute HalfFloat...Tests#testSynthesizeArrayRandom (elastic#139341) Remove leniency in LinkedProjectConfig builder methods (elastic#139012) EQL: fix project_routing (elastic#139366) Add patch version for 9.2 index version constant (elastic#139362) Mute org.elasticsearch.test.rest.yaml.RcsCcsCommonYamlTestSuiteIT test {p0=search.vectors/200_dense_vector_docvalue_fields/dense_vector docvalues with bfloat16} elastic#139368 ES|QL: Enable CCS tests for FORK (elastic#139302) Restructuring the semantic_text field type page (elastic#138571) AggregateMetricDouble fields should not build BKD indexes (elastic#138724) Feature/count by trunc with filter (elastic#138765) ESQL: Convert TS 500 error to 400 (elastic#139360) [CI] Rerun failing tests for periodic build pipelines (elastic#139200) revert muting saml test (elastic#139327) Add TDigest histogram as metric (elastic#139247) Links solved bugs to class cast exception changelog and unmutes errors (elastic#139340) Ensure integer sorts are rewritten to long sorts for BWC indexes (elastic#139293) Integrate stored fields format bloom filter with synthetic _id (elastic#138515) ...

* Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/pipelines/periodic.yml

* Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/hooks/pre-command # .buildkite/pipelines/periodic-platform-support.yml # .buildkite/pipelines/periodic.yml # build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/TestRerunTaskExtension.java

breskeby · 2025-12-12T11:35:25Z

💚 All backports created successfully

Status	Branch	Result
✅	9.2
✅	9.1
✅	8.19

Questions ?

Please refer to the Backport tool documentation

) * Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/pipelines/periodic.yml

) * Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/hooks/pre-command # .buildkite/pipelines/periodic-platform-support.yml # .buildkite/pipelines/periodic.yml # build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/TestRerunTaskExtension.java

breskeby requested a review from a team as a code owner December 8, 2025 15:19

breskeby added >non-issue :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team auto-backport Automatically create backport pull requests when merged v9.3.0 v9.1.9 v8.19.9 v9.2.3 labels Dec 8, 2025

breskeby self-assigned this Dec 8, 2025

breskeby changed the title ~~WIP TestRetry~~ Dec 8, 2025

breskeby force-pushed the support-rerun-failed-tests-only branch from 58d2ddd to b7a21b9 Compare December 8, 2025 15:35

breskeby commented Dec 8, 2025

View reviewed changes

jozala requested changes Dec 10, 2025

View reviewed changes

breskeby force-pushed the support-rerun-failed-tests-only branch from bba2995 to f0f3944 Compare December 10, 2025 19:01

breskeby added 13 commits December 10, 2025 20:09

WIP TestRetry

69a7fde

Allow jobs in periodic build to retry

5e6a1c1

Rerun only tests failed in previous build job iteration

be2f421

fix spotless

4db6c78

Fix typo

e6ff9d8

Make pre command hook more error prone

e1483f8

Fix handling of non existing test history

365f62c

Rerun more jobs in period builds

ad60c9c

Cleanup TestRerun logic

b1c4b1d

Cleanup

32a1748

Fix spotless

c6a97bb

Apply some further cleanup and polishing

7833833

Update periodic platform tests

03dbbf3

Fix buildkite annotations Respect retries in buildkite build UIs Run platform periodic tests with --continue Add clever retries for windows Remove dockeravailability

breskeby added 4 commits December 10, 2025 20:09

Apply review feedback

04a6433

Simplify BUILD_SCAN_ID resolution in smart retries

c1c4e65

- Store build-scan-id directly in Buildkite metadata from Gradle. - Update pre-command hooks (Bash and Batch) to prefer reading build-scan-id from metadata. - Remove logic to extract build scan ID from URL in pre-command hooks.

Apply review feedback using curl

84f374a

avoid relying on gunzip and reuse --compressed in curl call

More review feedback

ebe481c

- remove unused testfixture method - use records where possible

breskeby force-pushed the support-rerun-failed-tests-only branch from e123c79 to ebe481c Compare December 10, 2025 19:09

elasticsearchmachine added v9.1.10 v8.19.10 v9.2.4 and removed v9.1.9 v8.19.9 v9.2.3 labels Dec 11, 2025

jozala approved these changes Dec 11, 2025

View reviewed changes

breskeby merged commit 94454a3 into elastic:main Dec 11, 2025
33 of 36 checks passed

elasticsearchmachine added the backport pending label Dec 11, 2025

This was referenced Dec 12, 2025

[9.2] [CI] Rerun failing tests for periodic build pipelines (#139200) #139430

Merged

[9.1] [CI] Rerun failing tests for periodic build pipelines (#139200) #139431

Merged

breskeby mentioned this pull request Dec 12, 2025

[8.19] [CI] Rerun failing tests for periodic build pipelines (#139200) #139433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Rerun failing tests for periodic build pipelines#139200

[CI] Rerun failing tests for periodic build pipelines#139200
breskeby merged 17 commits intoelastic:mainfrom
breskeby:support-rerun-failed-tests-only

breskeby commented Dec 8, 2025 •

edited

Loading

elasticsearchmachine commented Dec 8, 2025

breskeby Dec 8, 2025

breskeby Dec 8, 2025

jozala Dec 10, 2025

jozala left a comment

jozala Dec 10, 2025

jozala Dec 10, 2025

breskeby Dec 10, 2025

jozala Dec 10, 2025

breskeby Dec 10, 2025 •

edited

Loading

jozala Dec 10, 2025

breskeby Dec 10, 2025

jozala Dec 10, 2025

jozala Dec 10, 2025

Uh oh!

jozala Dec 10, 2025

jozala Dec 10, 2025

breskeby Dec 11, 2025

breskeby commented Dec 10, 2025

jozala left a comment

Uh oh!

elasticsearchmachine commented Dec 11, 2025

breskeby commented Dec 12, 2025

Labels

3 participants

Conversation

breskeby commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Dec 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jozala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

breskeby Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

breskeby commented Dec 10, 2025

jozala left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Dec 11, 2025

💔 Backport failed

breskeby commented Dec 12, 2025

💚 All backports created successfully

Questions ?

Labels

3 participants

breskeby commented Dec 8, 2025 •

edited

Loading

breskeby Dec 10, 2025 •

edited

Loading