I have a multi-stage YAML pipeline, and sometimes a stage fails due to transient issues (e.g., network timeouts, flaky tests, etc.). Instead of manually re-running the stage, I want to configure an automatic retry mechanism for failed stages in same build.
Approach Tried: I attempted to implement a monitoring stage that tracks and retries failed stages dynamically. This stage consists of two tasks:
Check the status of each stage using an API call and store the results in a variable.
Rerun failed stages based on the stored statuses.
However, when calling the API to get stage statuses, I encountered the following issue:
"Stage 'Build' is still running or undetermined. Skipping..."
Additionally, I have tried using the YAML property retryCountOnTaskFailure, but it is not a viable solution in our case. Our pipeline Provisions a Kubernetes cluster, Onboards it to Azure Arc and Runs Sonobuoy tests for various extension plugins.
Using retryCountOnTaskFailure causes resource leaks because retries leave many resources undeleted, leading to inconsistencies.
Questions: How can I reliably fetch the status of completed or failed stages in a YAML pipeline?
What is the best way to dynamically rerun only the failed stages while ensuring proper resource cleanup?
I have a multi-stage YAML pipeline, and sometimes a stage fails due to transient issues (e.g., network timeouts, flaky tests, etc.). Instead of manually re-running the stage, I want to configure an automatic retry mechanism for failed stages in same build.
Approach Tried: I attempted to implement a monitoring stage that tracks and retries failed stages dynamically. This stage consists of two tasks:
Check the status of each stage using an API call and store the results in a variable.
Rerun failed stages based on the stored statuses.
However, when calling the API to get stage statuses, I encountered the following issue:
"Stage 'Build' is still running or undetermined. Skipping..."
Additionally, I have tried using the YAML property retryCountOnTaskFailure, but it is not a viable solution in our case. Our pipeline Provisions a Kubernetes cluster, Onboards it to Azure Arc and Runs Sonobuoy tests for various extension plugins.
Using retryCountOnTaskFailure causes resource leaks because retries leave many resources undeleted, leading to inconsistencies.
Questions: How can I reliably fetch the status of completed or failed stages in a YAML pipeline?
What is the best way to dynamically rerun only the failed stages while ensuring proper resource cleanup?
Share Improve this question edited Mar 31 at 10:08 Mark Rotteveel 110k229 gold badges156 silver badges224 bronze badges asked Mar 28 at 11:49 user30090146user30090146 1 2- Can you share more details on how you're obtaining the status of the stages? – bryanbcook Commented Mar 28 at 15:34
- Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Bot Commented Mar 31 at 19:00
1 Answer
Reset to default 0Currently, Azure DevOps does not provide an out-of-the-box feature to automatically retry failed stages in a YAML pipeline. If this is a critical need for your workflow, you can submit a feature request to Azure DevOps Product Group.
The workaround as of now is to use a separate monitoring pipeline. Since the REST API to rerun stages requires the build to be completed, we cannot rerun failed stages within the same pipeline execution. Instead, we can use another pipeline to monitor failed build events and trigger reruns. Here’s how:
Create a Service Hook;
Configure an Azure DevOps Service Hook to send a payload when a build fails;
Use a WebHook URL like:
https://dev.azure/<YourADOOrgName>/_apis/public/distributedtask/webhooks/<WebHookName>?api-version=6.0-preview
;
Setup an Incoming WebHook Service Connection;
- Use the
<WebHookName>
in the URL to create an Incoming WebHook service connection;
- Use the
Create a YAML Pipeline to Listen for Failures
- This pipeline will be triggered by the WebHook, extract the failed build URL from the payload, fetch failed stages and rerun them via APIs.
trigger: none resources: webhooks: - webhook: adowebhookrerun # WebHook name connection: WebHookSvcCnnRerun # Incoming WebHook service connection name variables: payload: ${{convertToJson(parameters.adowebhookrerun)}} failedBuildURL: ${{parameters.adowebhookrerun.resource.url}} pool: vmImage: windows-latest steps: - pwsh: | Write-Host "================ Failed build payload: ================" Write-Host '$(payload)' Write-Host "================ Failed build URL: ================" Write-Host '$(failedBuildURL)' # Set $(System.AccessToken) of pipeline service account for API authentication $headers = @{ 'Authorization' = 'Bearer ' + '$(System.AccessToken)' 'Content-Type' = 'application/json' } $timeline = Invoke-RestMethod -Uri "$(failedBuildURL)/timeline?api-version=7.1" -Headers $headers -Method Get $failedStages = $timeline.records | Where-Object { $_.result -eq "failed" -and $_.type -eq "Stage"} foreach ($stage in $failedStages) { $body = @{ "state" = "retry" "forceRetryAllJobs" = $true "retryDependencies" = $true } | ConvertTo-Json -Depth 10 Write-Host "================ Rerun failed stage: $($stage.name) ($($stage.identifier)) ================" Invoke-RestMethod -Method Patch -Uri "$(failedBuildURL)/stages/$($stage.identifier)?api-version=7.1" -Headers $headers -Body $body }
Since the script authenticates against $(System.AccessToken)
, ensure that either or both of the following pipeline service accounts have permission to Queue builds:
Project Collection Build Service (<YourADOOrgname>)
<TheProjectName> Build Service (<ADOOrgName>)