[CI] Rerun failing tests for periodic build pipelines#139200
[CI] Rerun failing tests for periodic build pipelines#139200breskeby merged 17 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-delivery (Team:Delivery) |
58d2ddd to
b7a21b9
Compare
| @@ -123,7 +123,8 @@ develocity { | |||
|
|
|||
| // Add a build annotation | |||
There was a problem hiding this comment.
This updates the gradle build scan annotation in buildkite to also show the retry count to allow differentiating build scans of original and retry jobs
| EOF | ||
| fi | ||
|
|
||
| if [[ "${SMART_RETRIES:-}" == "true" && "${BUILDKITE_RETRY_COUNT:-0}" -gt 0 ]]; then |
There was a problem hiding this comment.
I just hate windows
jozala
left a comment
There was a problem hiding this comment.
I can't wait to have this running on CI.
Some minor comments left.
I'd like to discuss two things in pre-command about the --compressed in curl command and BUILD_SCAN_ID` validation.
| EOF | ||
| fi | ||
|
|
||
| if [[ "${SMART_RETRIES:-}" == "true" && "${BUILDKITE_RETRY_COUNT:-0}" -gt 0 ]]; then |
.buildkite/hooks/pre-command
Outdated
| if BUILD_JSON=$(curl --max-time 30 -H "Authorization: Bearer $BUILDKITE_API_TOKEN" -X GET "https://api.buildkite.com/v2/organizations/elastic/pipelines/${BUILDKITE_PIPELINE_SLUG}/builds/${BUILDKITE_BUILD_NUMBER}?include_retried_jobs=true" 2>/dev/null); then | ||
| if ORIGIN_JOB_ID=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg jobId "$BUILDKITE_JOB_ID" ' .jobs[] | select(.id == $jobId) | .retry_source.job_id' 2>/dev/null) && [ "$ORIGIN_JOB_ID" != "null" ] && [ -n "$ORIGIN_JOB_ID" ]; then | ||
| if BUILD_SCAN_URL=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg job_id "$ORIGIN_JOB_ID" '.meta_data["build-scan-" + $job_id]' 2>/dev/null) && [ "$BUILD_SCAN_URL" != "null" ] && [ -n "$BUILD_SCAN_URL" ]; then | ||
| BUILD_SCAN_ID=$(echo "$BUILD_SCAN_URL" | sed 's|.*/s/||') |
There was a problem hiding this comment.
I think we should start writing some scripts in other language than Bash. That's just a general thought. Nothing to change here I think. I'm just curious about your thoughts after writing this.
| if BUILD_SCAN_URL=$(printf '%s\n' "$BUILD_JSON" | jq -r --arg job_id "$ORIGIN_JOB_ID" '.meta_data["build-scan-" + $job_id]' 2>/dev/null) && [ "$BUILD_SCAN_URL" != "null" ] && [ -n "$BUILD_SCAN_URL" ]; then | ||
| BUILD_SCAN_ID=$(echo "$BUILD_SCAN_URL" | sed 's|.*/s/||') | ||
|
|
||
| # Validate BUILD_SCAN_ID format to prevent injection attacks |
There was a problem hiding this comment.
What kind of injection attacks are possible here?
As far as I can see we are reading the BUILD_SCAN_ID from BUILD_JSON which we control here. Could we get rid of that to simplify the logic or sanitize this early?
There was a problem hiding this comment.
I reworked this whole thing to directly pass the buildscanId via bk meta-data. that makes this brittle IMO
.buildkite/hooks/pre-command
Outdated
| if curl --request GET \ | ||
| --url "$DEVELOCITY_FAILED_TEST_API_URL" \ | ||
| --max-filesize 10485760 \ | ||
| --max-time 30 \ | ||
| --header 'accept: application/json' \ | ||
| --header "authorization: Bearer $DEVELOCITY_API_ACCESS_KEY" \ | ||
| --header 'content-type: application/json' 2>/dev/null | gunzip | jq '.' &> .failed-test-history.json; then |
There was a problem hiding this comment.
It should be more resilient if used --compressed in the curl command instead of | gunzip. This way cURL should take care of the compression method in the case it changes.
There was a problem hiding this comment.
yeah indeed. I do already for windows 🤦 . fixed
| # Create Buildkite annotation for visibility | ||
| # Use unique context per job to support multiple retries | ||
| cat << EOF | buildkite-agent annotate --style info --context "smart-retry-$BUILDKITE_JOB_ID" | ||
| Rerunning failed build job [$ORIGIN_JOB_NAME]($BUILD_SCAN_URL) | ||
|
|
||
| **Gradle Tasks with Failures:** $FILTERED_WORK_UNITS | ||
|
|
||
| This retry will skip test tasks that had no failures in the previous run. | ||
| EOF |
There was a problem hiding this comment.
I really like this. It makes easy to understand what happened in the build.
|
|
||
| REM Smart retries implementation | ||
| if "%SMART_RETRIES%"=="true" ( | ||
| if defined BUILDKITE_RETRY_COUNT ( | ||
| if %BUILDKITE_RETRY_COUNT% GTR 0 ( | ||
| echo --- Resolving previously failed tests | ||
| set SMART_RETRY_STATUS=disabled | ||
| set SMART_RETRY_DETAILS= | ||
|
|
||
| REM Fetch build information from Buildkite API |
There was a problem hiding this comment.
Ohhh... That's another reason to write these scripts in another language - probably something as OS-agnostic as it can be.
That's just another general note. Nothing to change here.
...t/groovy/org/elasticsearch/gradle/internal/test/rerun/InternalTestRerunPluginFuncTest.groovy
Show resolved
Hide resolved
| public class FailedTestsReport { | ||
| private List<WorkUnit> workUnits; | ||
|
|
||
| public List<WorkUnit> getWorkUnits() { | ||
| return workUnits != null ? workUnits : java.util.Collections.emptyList(); | ||
| } | ||
|
|
||
| public void setWorkUnits(List<WorkUnit> workUnits) { | ||
| this.workUnits = workUnits; | ||
| } | ||
| } |
There was a problem hiding this comment.
NIT: We could probably use records for the model classes, but may not be worth to change it now. I'll let you to decide if it's worth.
| File createFailedTest(String clazzName) { | ||
| createTest(clazzName, testMethodContent(false, true, 1)) | ||
| } |
bba2995 to
f0f3944
Compare
Fix buildkite annotations Respect retries in buildkite build UIs Run platform periodic tests with --continue Add clever retries for windows Remove dockeravailability
- Store build-scan-id directly in Buildkite metadata from Gradle. - Update pre-command hooks (Bash and Batch) to prefer reading build-scan-id from metadata. - Remove logic to extract build scan ID from URL in pre-command hooks.
avoid relying on gunzip and reuse --compressed in curl call
- remove unused testfixture method - use records where possible
e123c79 to
ebe481c
Compare
|
I triggered a new test run after apply review feedback / minor refactorings https://buildkite.com/elastic/elasticsearch-periodic-platform-support/builds/11538 |
💔 Backport failed
You can use sqren/backport to manually backport by running |
* upstream/main: (79 commits)
Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/140_pre_filter_search_shards/prefilter on non-indexed date fields} elastic#139381
Adjust error bounds for bfloat16 value checks (elastic#139371)
Unmute some vector CSS tests (elastic#139370)
Do not allow `project_routing` as a query param (elastic#139206)
Unmute HalfFloat...Tests#testSynthesizeArrayRandom (elastic#139341)
Remove leniency in LinkedProjectConfig builder methods (elastic#139012)
EQL: fix project_routing (elastic#139366)
Add patch version for 9.2 index version constant (elastic#139362)
Mute org.elasticsearch.test.rest.yaml.RcsCcsCommonYamlTestSuiteIT test {p0=search.vectors/200_dense_vector_docvalue_fields/dense_vector docvalues with bfloat16} elastic#139368
ES|QL: Enable CCS tests for FORK (elastic#139302)
Restructuring the semantic_text field type page (elastic#138571)
AggregateMetricDouble fields should not build BKD indexes (elastic#138724)
Feature/count by trunc with filter (elastic#138765)
ESQL: Convert TS 500 error to 400 (elastic#139360)
[CI] Rerun failing tests for periodic build pipelines (elastic#139200)
revert muting saml test (elastic#139327)
Add TDigest histogram as metric (elastic#139247)
Links solved bugs to class cast exception changelog and unmutes errors (elastic#139340)
Ensure integer sorts are rewritten to long sorts for BWC indexes (elastic#139293)
Integrate stored fields format bloom filter with synthetic _id (elastic#138515)
...
* Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/pipelines/periodic.yml
* Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/pipelines/periodic.yml
* Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/hooks/pre-command # .buildkite/pipelines/periodic-platform-support.yml # .buildkite/pipelines/periodic.yml # build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/TestRerunTaskExtension.java
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
) * Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/pipelines/periodic.yml
) * Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/pipelines/periodic.yml
) * Rerun only tests failed in previous build job iteration * Rerun more jobs in period builds * Update periodic platform tests * Run platform periodic tests with --continue * Store build-scan-id directly in Buildkite metadata from Gradle. (cherry picked from commit 94454a3) # Conflicts: # .buildkite/hooks/pre-command # .buildkite/pipelines/periodic-platform-support.yml # .buildkite/pipelines/periodic.yml # build-tools-internal/src/main/groovy/elasticsearch.build-scan.gradle # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rerun/TestRerunTaskExtension.java
When a Buildkite job is retried, the system fetches failed test information from the Develocity API and reruns only the tests that failed in the previous attempt, significantly reducing retry time and CI resource usage.