In AWS the terms interruptible instance and spot instance are used. In GCP the equivalent term is preemptible instance. Here we use the term interruptible instance generically for both providers.
An interruptible instance is a machine instance made available to your cluster by your cloud provider that is not guaranteed to be always available. As a result, interruptible instances are cheaper than regular instances. In order to use an interruptible instance for a compute workload you have to be prepared for the possibility that an attempt to run the workload could fail due to lack of available resources and will need to be retried.
When onboarding your organization onto Union Cloud you specify the configuration of your cluster. Among the options is the choice of whether to use interruptible instances.
For each interruptible instance node group that you specify, an additional on-demand node group (though identical in every other respect to the interruptible one) will also be configured. This on-demand node group will be used as a fallback when attempts to complete the task on the interruptible instance have failed.
Configuring tasks to use interruptible instances
To schedule tasks on interruptible instances and retry them if they fail, specify the
retries parameters in the
@task decorator. For example:
- A task will only be scheduled on a interruptible instance if it has the parameter
- An interruptible task, like any other task, can have a
- If an interruptible task does not have an explicitly set
retriesparameter, then the
retriesvalue defaults to
- An interruptible task with
retries=nwill be attempted
n-1times on a interruptible instance. If it still fails after
n-1attempts the final retry will be done on the fallback on-demand instance.
Advantages and disadvantages of interruptible instances
The advantage of using interruptible instance for a task is simply that is less costly than using an on-demand instance (all other parameters being equal). However, there are two main disadvantages:
- The task is successfully scheduled on an interruptible instance but is interrupted. In the worst case scenario, for
retries=nthe task may be interrupted
n-1times until, finally, the fallback on-demand instance is used. Clearly, this may be problem for time-critical tasks.
- Interruptible instances of the selected node type may simply be unavailable on the initially attempt to schedule. When this happens, the task may hang indefinitely until an interruptible instance becomes available. Note that this is a distinct failure mode from the previous one where an interruptible node is successfully scheduled but is then interrupted.
In general, Union recommends that you use interruptible instances whenever available, but only for tasks that are not time-critical.