2.0.0b53
Elastic
Package: flyteplugins.pytorch
Elastic defines the configuration for running a PyTorch elastic job using torch.distributed.
class Elastic(
nnodes: typing.Union[int, str],
nproc_per_node: int,
rdzv_backend: typing.Literal['c10d', 'etcd', 'etcd-v2'],
run_policy: typing.Optional[flyteplugins.pytorch.task.RunPolicy],
monitor_interval: int,
max_restarts: int,
rdzv_configs: typing.Dict[str, typing.Any],
)| Parameter | Type | Description |
|---|---|---|
nnodes |
typing.Union[int, str] |
Number of nodes to use. Can be a fixed int or a range string (e.g., “2:4” for elastic training). |
nproc_per_node |
int |
Number of processes to launch per node. |
rdzv_backend |
typing.Literal['c10d', 'etcd', 'etcd-v2'] |
Rendezvous backend to use. Typically “c10d”. Defaults to “c10d”. |
run_policy |
typing.Optional[flyteplugins.pytorch.task.RunPolicy] |
Run policy applied to the job execution. Defaults to None. |
monitor_interval |
int |
Interval (in seconds) to monitor the job’s state. Defaults to 3. |
max_restarts |
int |
Maximum number of worker group restarts before failing the job. Defaults to 3. |
rdzv_configs |
typing.Dict[str, typing.Any] |
Rendezvous configuration key-value pairs. Defaults to {“timeout”: 900, “join_timeout”: 900}. |