Retries and timeouts
Retries and timeouts are the two primary controls for handling failure on a Flyte task. Long-running tasks fail and stall in many ways — a transient network blip, a flaky third-party API, a slow scheduler, a runaway loop, a hung container. Retries decide whether to try again; timeouts decide how long to wait before giving up. Used together they keep your workflows reliable in the face of transient failures while preventing a single stuck attempt from burning resources indefinitely.
Both can be configured in the @env.task decorator or supplied per-call with override.
Neither can be set on the TaskEnvironment definition itself.
The action lifecycle
Every attempt of a task moves through a sequence of phases. Knowing the phases makes the timeout controls obvious, because each timeout bounds a specific stretch of this timeline:
| Phase | What’s happening |
|---|---|
| Queued | The action has been accepted and is waiting to be scheduled onto the cluster. |
| Waiting for resources | Scheduled, but waiting for compute (pods, GPUs, quota) to become available. |
| Initializing | Resources are in hand; the pod is starting (image pull, init containers, sidecars). |
| Running | Your code is actively executing. |
| Succeeded / Failed / Timed out / Aborted | Terminal phases. |
The diagram below shows one attempt, plus a second attempt after a retry, and which control governs each part of the timeline:
gantt
title Which control covers which part of the timeline
dateFormat X
axisFormat %Ss
section Attempt 1
Queued :q1, 0, 2
Waiting for resources :w1, 2, 5
Initializing :i1, 5, 6
Running (your code) :r1, 6, 11
section Per-attempt bounds
max_queued_time :crit, 0, 5
max_runtime :active, 6, 11
section Attempt 2 (after a retry + backoff)
Queued -> Running :q2, 14, 25
section Across all attempts
deadline :done, 0, 27
In short:
max_queued_timebounds the time spent waiting to run (Queued + Waiting for resources). It resets on every attempt.max_runtimebounds the time spent running your code. It resets on every attempt.deadlinebounds the total wall-clock from the first time the action was queued until it reaches a terminal phase — across every attempt, user retries and platform retries alike.
(The brief Initializing phase is charged to neither per-attempt bound.)
Retries
A retry is a fresh attempt at executing a failed action. Each retry runs in a brand-new pod, so nothing from the failed attempt — local files, in-memory state — carries over.
The code for the retry examples below can be found on GitHub.
First we import the required modules and set up a task environment:
from datetime import timedelta
import flyte
import flyte.errors
env = flyte.TaskEnvironment(name="retries", resources=flyte.Resources(cpu=1, memory="250Mi"))
Retry count
The simplest form passes an integer. A “retry” is any attempt after the first, so retries=3
means up to 4 attempts in total (1 original + 3 retries):
@env.task(retries=3)
async def call_service() -> str:
# retries=3 -> up to 4 attempts (1 original + 3 retries).
# Each retry runs in a fresh pod, so nothing from the failed attempt carries over.
return await fetch_from_flaky_upstream()
This is the right default for genuinely transient failures — a dropped connection, a brief
503 from a dependency — where simply trying again is likely to succeed.
Retries with exponential backoff
Retrying immediately is exactly the wrong thing to do against a downstream that is already
struggling: back-to-back retries pile load onto a service that needs room to recover.
A flyte.RetryStrategy with a flyte.Backoff policy inserts a growing delay between attempts
so a recovering dependency gets breathing room:
@env.task(
retries=flyte.RetryStrategy(
count=5,
backoff=flyte.Backoff(
base=timedelta(seconds=10), # first retry waits 10s
factor=2.0, # then 20s, 40s, 80s, ...
cap=timedelta(minutes=5), # never wait longer than 5m between retries
),
),
)
async def call_flaky_api() -> str:
# The delay before the n-th retry (0-indexed) is min(base * factor**n, cap).
# Backoff gives a recovering downstream room to breathe instead of hammering it.
return await fetch_from_flaky_upstream()
The delay before the n-th retry (0-indexed) is min(base * factor**n, cap). With the values
above the delays are 10s, 20s, 40s, 80s, then capped at 5m. The cap is what keeps an
aggressive factor from growing into hours; it is required whenever factor > 1.
Skip retries for failures that can’t be fixed
Some failures will never succeed no matter how many times you try — an invalid input, a
malformed config, a permission that was never granted. Retrying them just wastes the budget
(and the wall-clock) before the inevitable failure. Raise
flyte.errors.NonRecoverableError to signal that a failure is terminal: the action fails on
the spot, on attempt #1, with no retries consumed even when retries is set:
@env.task(retries=3)
async def validate_and_process(x: int) -> str:
if x < 0:
# A negative input will never succeed, so don't waste the retry budget on it.
# NonRecoverableError fails the action on attempt #1 — no retries are consumed.
raise flyte.errors.NonRecoverableError(
f"Input x={x} is negative — retrying will not help."
)
return f"processed({x})"
Finally, configure Flyte and run:
if __name__ == "__main__":
flyte.init_from_config()
run = flyte.run(validate_and_process, x=-5)
print(run.name)
print(run.url)
run.wait()
System retries
retries=N covers failures of your task — exceptions, non-zero exits, timeouts you’ve configured.
Failures caused by the underlying infrastructure (a node disappearing, ephemeral storage running out, a spot instance getting preempted, and so on) are handled separately. The platform retries these on its own and they do not consume your retries=N budget. They do, however, count against the deadline (see below), because the deadline is an absolute bound on total wall-clock regardless of who triggered the retry.
The platform does eventually give up, but only after many retries — a persistent infrastructure problem can churn for a while before the run is terminated. If you spot a task repeatedly failing on the same infrastructure error, abort it manually rather than waiting for the system to give up on its own. You don’t configure the system retry budget from Python; it’s a platform-level concern.
For workloads on spot/preemptible compute, see also Spot to on-demand fallback, which describes how interruptible tasks transition to on-demand on their final attempt.
Timeouts
A timeout is a wall-clock bound that, when exceeded, terminates the action with phase
Timed out. Without one, a stuck attempt has no natural end: a task waiting on a hung socket
or starved for a GPU that never frees up can sit there for hours. The timeout parameter takes
a flyte.Timeout value carrying any combination of three independent bounds, each optional — an
unspecified bound is unlimited.
The code for the timeout examples below can be found on GitHub.
First, the imports and environment:
import asyncio
from datetime import timedelta
import flyte
from flyte import Timeout
env = flyte.TaskEnvironment(name="timeouts", resources=flyte.Resources(cpu=1, memory="250Mi"))
max_runtime — bound a single attempt’s execution
max_runtime caps the time an attempt spends in the Running phase. It’s the answer to
“how long should one run of this code take?” — use it to reap a hung container so a retry can
take over instead of letting a wedged process run forever. It is per-attempt and resets on
each retry.
@env.task(timeout=Timeout(max_runtime=timedelta(minutes=30)))
async def train_model() -> str:
# max_runtime bounds the RUNNING phase of a single attempt. If the task is
# still running after 30 minutes, this attempt is reaped as TIMED_OUT.
# The budget is per-attempt: it resets fresh on every retry.
...
return "model trained"
max_queued_time — fail fast when capacity isn’t available
max_queued_time caps the time an attempt spends waiting to run — the Queued and
Waiting for resources phases — before execution begins. It answers “how long am I willing
to wait for this to even start?” When a task asks for a scarce resource (a specific GPU, a
large node) that the cluster can’t currently supply, this bound makes it fail fast instead of
stalling indefinitely. It is per-attempt and resets on each retry.
@env.task(timeout=Timeout(max_queued_time=timedelta(minutes=15)))
async def needs_scarce_gpu() -> str:
# max_queued_time bounds the time spent waiting to run (QUEUED +
# WAITING_FOR_RESOURCES). If the cluster can't find capacity within 15
# minutes, fail fast instead of stalling indefinitely. Per-attempt.
...
return "done"
deadline — bound the total wall-clock
deadline is the strongest of the three: an absolute budget on total wall-clock, measured from
the first time the action was queued to the moment it reaches a terminal phase — across all
attempts, including platform-driven system retries. It answers “what is the total time budget
for this work, no matter what?” Use it when a downstream consumer needs a definite outcome by a
certain time: with retries=5 and max_runtime=1h alone, an action could legally consume six
hours plus queue time before giving up. A deadline puts a hard ceiling on that.
@env.task(timeout=Timeout(deadline=timedelta(hours=2)))
async def must_finish_by() -> str:
# deadline is an absolute wall-clock budget across ALL attempts, measured
# from the first time the action was enqueued. Once 2 hours elapse, the
# action is reaped no matter which phase it is in or how many retries remain.
...
return "done"
When the deadline fires mid-attempt, the action terminates immediately as Timed out,
regardless of remaining retry budget or per-attempt timer state.
Combining the bounds
The three bounds are orthogonal and can be set together. A common shape is a per-attempt runtime cap, a queue-wait cap, and an absolute ceiling on the whole thing:
@env.task(
timeout=Timeout(
max_runtime=timedelta(minutes=30), # per attempt, RUNNING only
max_queued_time=timedelta(minutes=15), # per attempt, waiting to run
deadline=timedelta(hours=2), # absolute, across all attempts
),
)
async def fully_bounded() -> str:
...
return "done"
For backward compatibility, a bare int (seconds) or timedelta passed to timeout is
interpreted as max_runtime:
# A bare int (seconds) or timedelta is shorthand for Timeout(max_runtime=...).
@env.task(timeout=timedelta(minutes=30))
async def runtime_only() -> str:
...
return "done"
Combining retries and timeouts
Retries and timeouts compose, and the per-attempt versus absolute distinction is what makes the
combination expressive. Because max_runtime and max_queued_time are per-attempt, they retry
normally — each timed-out attempt counts as a failure and the next attempt gets a fresh budget.
Because deadline is absolute, it overrides retries entirely: once it fires, no further
attempts run.
| Timeout | With retries set |
Without retries |
|---|---|---|
max_runtime |
Each attempt is reaped at the budget and retried until retries are exhausted. | First timeout is final. |
max_queued_time |
Each attempt is reaped pre-Running and retried until retries are exhausted. | First timeout is final. |
deadline |
Retries continue until the budget is exhausted or the deadline fires — whichever comes first. | Action terminates at the deadline. |
The most useful pattern combines backoff-paced retries with an absolute deadline: keep retrying a flaky dependency, but never spend more than a fixed total budget on it.
@env.task(
retries=flyte.RetryStrategy(
count=5,
backoff=flyte.Backoff(base=timedelta(seconds=30), factor=2.0, cap=timedelta(minutes=5)),
),
timeout=Timeout(
max_runtime=timedelta(minutes=10), # cap any single attempt
deadline=timedelta(hours=1), # but never spend more than 1h total
),
)
async def resilient_work() -> str:
# Retries continue until either the retry budget is exhausted OR the 1h
# deadline fires — whichever comes first. The deadline wins ties.
...
return "done"
Finally, configure Flyte and run:
if __name__ == "__main__":
flyte.init_from_config()
run = flyte.run(fully_bounded)
print(run.name)
print(run.url)
run.wait()
Together, these controls let your workflows absorb transient failures gracefully while guaranteeing that broken or starved work is reaped on a schedule you choose rather than left to run unbounded.