How to know if my pipeline job was failed due preemption of worker VM
Last updated
Last updated
In many FinnGen pipelines jobs like Regenie GWAS the task are computed using cheapest VMs available in Google Cloud Platform. These are called SPOT VMs, the noteworthy characteristic of SPOT VM is that it can be preempted. This means that other Google cloud user can reserve computing capacity and take it away from your job. Preemption can occur in any time point of the job and will naturally terminate all processes in the SPOT instance. The task will fail with preemption error and no output or results are provided.
Pipeline in pipeline pipeline runtime environment user can configure the amount how many times task is of retried with SPOT VM. Removal of "preemptible" parameter means that job is run with Standard VM that is roughly 5 time more expensive than SPOT VM.
Cromwell using Google Cloud BATCH API back-end does not currently support automatic retrial of preempted task (all trials failed) with standard VM (22.11.2024, Note there is active development on this feature so hopefully this is soon outdated, Solved since 28.11.2024: Now there is automatic retry with standard VM).
User can see from Pipeline job metadata if the job is failed by preemption. The metadata can be downloaded from Pipelines GUI. See this link.
The easiest way to fix preemption error is to cloning and rerunning the pipeline job. This will add additional 3 preemption trials. Note cromwell is using call caching so it will does not repeat already successfull tasks. One pipeline can be failed even two consecutive time if there is high demand of computing capacity in GCP.