Skip to content

[CI] Rerun failing tests for periodic build pipelines#139200

Merged
breskeby merged 17 commits intoelastic:mainfrom
breskeby:support-rerun-failed-tests-only
Dec 11, 2025
Merged

[CI] Rerun failing tests for periodic build pipelines#139200
breskeby merged 17 commits intoelastic:mainfrom
breskeby:support-rerun-failed-tests-only

Conversation

@breskeby
Copy link
Contributor

@breskeby breskeby commented Dec 8, 2025

When a Buildkite job is retried, the system fetches failed test information from the Develocity API and reruns only the tests that failed in the previous attempt, significantly reducing retry time and CI resource usage.

@breskeby breskeby requested a review from a team as a code owner December 8, 2025 15:19
@breskeby breskeby added >non-issue :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team auto-backport Automatically create backport pull requests when merged v9.3.0 v9.1.9 v8.19.9 v9.2.3 labels Dec 8, 2025
@breskeby breskeby self-assigned this Dec 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@breskeby breskeby changed the title WIP TestRetry Dec 8, 2025
@breskeby breskeby force-pushed the support-rerun-failed-tests-only branch from 58d2ddd to b7a21b9 Compare December 8, 2025 15:35
@@ -123,7 +123,8 @@ develocity {

// Add a build annotation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updates the gradle build scan annotation in buildkite to also show the retry count to allow differentiating build scans of original and retry jobs

EOF
fi

if [[ "${SMART_RETRIES:-}" == "true" && "${BUILDKITE_RETRY_COUNT:-0}" -gt 0 ]]; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just hate windows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL

Copy link
Contributor

@jozala jozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't wait to have this running on CI.
Some minor comments left.
I'd like to discuss two things in pre-command about the --compressed in curl command and BUILD_SCAN_ID` validation.

EOF
fi

if [[ "${SMART_RETRIES:-}" == "true" && "${BUILDKITE_RETRY_COUNT:-0}" -gt 0 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL

Comment on lines 153 to 156
if BUILD_JSON=$(curl --max-time 30 -H "Authorization: Bearer $BUILDKITE_API_TOKEN" -X GET "https://api.buildkite.com/v2/organizations/elastic/pipelines/${BUILDKITE_PIPELINE_SLUG}/builds/${BUILDKITE_BUILD_NUMBER}?include_retried_jobs=true" 2>/dev/null); then
if ORIGIN_JOB_ID=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg jobId "$BUILDKITE_JOB_ID" ' .jobs[] | select(.id == $jobId) | .retry_source.job_id' 2>/dev/null) && [ "$ORIGIN_JOB_ID" != "null" ] && [ -n "$ORIGIN_JOB_ID" ]; then
if BUILD_SCAN_URL=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg job_id "$ORIGIN_JOB_ID" '.meta_data["build-scan-" + $job_id]' 2>/dev/null) && [ "$BUILD_SCAN_URL" != "null" ] && [ -n "$BUILD_SCAN_URL" ]; then
BUILD_SCAN_ID=$(echo "$BUILD_SCAN_URL" | sed 's|.*/s/||')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start writing some scripts in other language than Bash. That's just a general thought. Nothing to change here I think. I'm just curious about your thoughts after writing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

if BUILD_SCAN_URL=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg job_id "$ORIGIN_JOB_ID" '.meta_data["build-scan-" + $job_id]' 2>/dev/null) && [ "$BUILD_SCAN_URL" != "null" ] && [ -n "$BUILD_SCAN_URL" ]; then
BUILD_SCAN_ID=$(echo "$BUILD_SCAN_URL" | sed 's|.*/s/||')

# Validate BUILD_SCAN_ID format to prevent injection attacks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of injection attacks are possible here?
As far as I can see we are reading the BUILD_SCAN_ID from BUILD_JSON which we control here. Could we get rid of that to simplify the logic or sanitize this early?

Copy link
Contributor Author

@breskeby breskeby Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this whole thing to directly pass the buildscanId via bk meta-data. that makes this brittle IMO

Comment on lines 172 to 178
if curl --request GET \
--url "$DEVELOCITY_FAILED_TEST_API_URL" \
--max-filesize 10485760 \
--max-time 30 \
--header 'accept: application/json' \
--header "authorization: Bearer $DEVELOCITY_API_ACCESS_KEY" \
--header 'content-type: application/json' 2>/dev/null | gunzip | jq '.' &> .failed-test-history.json; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be more resilient if used --compressed in the curl command instead of | gunzip. This way cURL should take care of the compression method in the case it changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah indeed. I do already for windows 🤦 . fixed

Comment on lines +196 to +204
# Create Buildkite annotation for visibility
# Use unique context per job to support multiple retries
cat << EOF | buildkite-agent annotate --style info --context "smart-retry-$BUILDKITE_JOB_ID"
Rerunning failed build job [$ORIGIN_JOB_NAME]($BUILD_SCAN_URL)

**Gradle Tasks with Failures:** $FILTERED_WORK_UNITS

This retry will skip test tasks that had no failures in the previous run.
EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this. It makes easy to understand what happened in the build.

Comment on lines +30 to +39

REM Smart retries implementation
if "%SMART_RETRIES%"=="true" (
if defined BUILDKITE_RETRY_COUNT (
if %BUILDKITE_RETRY_COUNT% GTR 0 (
echo --- Resolving previously failed tests
set SMART_RETRY_STATUS=disabled
set SMART_RETRY_DETAILS=

REM Fetch build information from Buildkite API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh... That's another reason to write these scripts in another language - probably something as OS-agnostic as it can be.
That's just another general note. Nothing to change here.

Comment on lines 21 to 31
public class FailedTestsReport {
private List<WorkUnit> workUnits;

public List<WorkUnit> getWorkUnits() {
return workUnits != null ? workUnits : java.util.Collections.emptyList();
}

public void setWorkUnits(List<WorkUnit> workUnits) {
this.workUnits = workUnits;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: We could probably use records for the model classes, but may not be worth to change it now. I'll let you to decide if it's worth.

Comment on lines 331 to 333
File createFailedTest(String clazzName) {
createTest(clazzName, testMethodContent(false, true, 1))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@breskeby breskeby force-pushed the support-rerun-failed-tests-only branch from bba2995 to f0f3944 Compare December 10, 2025 19:01
- Store build-scan-id directly in Buildkite metadata from Gradle.
- Update pre-command hooks (Bash and Batch) to prefer reading build-scan-id from metadata.
- Remove logic to extract build scan ID from URL in pre-command hooks.
avoid relying on gunzip and reuse --compressed in curl call
- remove unused testfixture method
- use records where possible
@breskeby breskeby force-pushed the support-rerun-failed-tests-only branch from e123c79 to ebe481c Compare December 10, 2025 19:09
@breskeby
Copy link
Contributor Author

I triggered a new test run after apply review feedback / minor refactorings https://buildkite.com/elastic/elasticsearch-periodic-platform-support/builds/11538

Copy link
Contributor

@jozala jozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@breskeby breskeby merged commit 94454a3 into elastic:main Dec 11, 2025
33 of 36 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1 Commit could not be cherrypicked due to conflicts
9.2 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 139200

szybia added a commit to szybia/elasticsearch that referenced this pull request Dec 11, 2025
* upstream/main: (79 commits)
  Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/140_pre_filter_search_shards/prefilter on non-indexed date fields} elastic#139381
  Adjust error bounds for bfloat16 value checks (elastic#139371)
  Unmute some vector CSS tests (elastic#139370)
  Do not allow `project_routing` as a query param (elastic#139206)
  Unmute HalfFloat...Tests#testSynthesizeArrayRandom (elastic#139341)
  Remove leniency in LinkedProjectConfig builder methods (elastic#139012)
  EQL: fix project_routing (elastic#139366)
  Add patch version for 9.2 index version constant (elastic#139362)
  Mute org.elasticsearch.test.rest.yaml.RcsCcsCommonYamlTestSuiteIT test {p0=search.vectors/200_dense_vector_docvalue_fields/dense_vector docvalues with bfloat16} elastic#139368
  ES|QL: Enable CCS tests for FORK (elastic#139302)
  Restructuring the semantic_text field type page  (elastic#138571)
  AggregateMetricDouble fields should not build BKD indexes (elastic#138724)
  Feature/count by trunc with filter (elastic#138765)
  ESQL: Convert TS 500 error to 400 (elastic#139360)
  [CI] Rerun failing tests for periodic build pipelines (elastic#139200)
  revert muting saml test (elastic#139327)
  Add TDigest histogram as metric (elastic#139247)
  Links solved bugs to class cast exception changelog and unmutes errors (elastic#139340)
  Ensure integer sorts are rewritten to long sorts for BWC indexes (elastic#139293)
  Integrate stored fields format bloom filter with synthetic _id (elastic#138515)
  ...
breskeby added a commit to breskeby/elasticsearch that referenced this pull request Dec 12, 2025
* Rerun only tests failed in previous build job iteration
* Rerun more jobs in period builds
* Update periodic platform tests
* Run platform periodic tests with --continue
* Store build-scan-id directly in Buildkite metadata from Gradle.

(cherry picked from commit 94454a3)

# Conflicts:
#	.buildkite/pipelines/periodic.yml
breskeby added a commit to breskeby/elasticsearch that referenced this pull request Dec 12, 2025
* Rerun only tests failed in previous build job iteration
* Rerun more jobs in period builds
* Update periodic platform tests
* Run platform periodic tests with --continue
* Store build-scan-id directly in Buildkite metadata from Gradle.

(cherry picked from commit 94454a3)

# Conflicts:
#	.buildkite/pipelines/periodic.yml
breskeby added a commit to breskeby/elasticsearch that referenced this pull request Dec 12, 2025
* Rerun only tests failed in previous build job iteration
* Rerun more jobs in period builds
* Update periodic platform tests
* Run platform periodic tests with --continue
* Store build-scan-id directly in Buildkite metadata from Gradle.

(cherry picked from commit 94454a3)

# Conflicts:
#	.buildkite/hooks/pre-command
#	.buildkite/pipelines/periodic-platform-support.yml
#	.buildkite/pipelines/periodic.yml
#	build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle
#	build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/TestRerunTaskExtension.java
@breskeby
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.2
9.1
8.19

Questions ?

Please refer to the Backport tool documentation

breskeby added a commit that referenced this pull request Dec 12, 2025
)

* Rerun only tests failed in previous build job iteration
* Rerun more jobs in period builds
* Update periodic platform tests
* Run platform periodic tests with --continue
* Store build-scan-id directly in Buildkite metadata from Gradle.

(cherry picked from commit 94454a3)

# Conflicts:
#	.buildkite/pipelines/periodic.yml
breskeby added a commit that referenced this pull request Dec 12, 2025
)

* Rerun only tests failed in previous build job iteration
* Rerun more jobs in period builds
* Update periodic platform tests
* Run platform periodic tests with --continue
* Store build-scan-id directly in Buildkite metadata from Gradle.

(cherry picked from commit 94454a3)

# Conflicts:
#	.buildkite/pipelines/periodic.yml
breskeby added a commit that referenced this pull request Dec 12, 2025
)

* Rerun only tests failed in previous build job iteration
* Rerun more jobs in period builds
* Update periodic platform tests
* Run platform periodic tests with --continue
* Store build-scan-id directly in Buildkite metadata from Gradle.

(cherry picked from commit 94454a3)

# Conflicts:
#	.buildkite/hooks/pre-command
#	.buildkite/pipelines/periodic-platform-support.yml
#	.buildkite/pipelines/periodic.yml
#	build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle
#	build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/TestRerunTaskExtension.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending :Delivery/Build Build or test infrastructure >non-issue Team:Delivery Meta label for Delivery team v8.19.10 v9.1.10 v9.2.4 v9.3.0

3 participants