🎯 Spot Instance Strategy Guide

How to save 30-90% on GPU costs with preemptible instances

-73%
Average Savings
$0.51
Lowest A100/hr (spot)
6-24h
Typical Runtime
2-5min
Termination Notice

⚠️ Critical: Spot Instances Can Terminate Anytime

Spot instances are surplus capacity that can be reclaimed with 2-30 seconds notice (varies by provider). Only use for fault-tolerant workloads with checkpointing enabled.

📊 Spot Pricing Comparison (January 2024)

Provider GPU Model On-Demand Spot Price Savings Termination Notice Availability
AWS EC2 A100 40GB $3.22/hr $0.97/hr -70% 2 minutes 65%
Google Cloud A100 40GB $2.77/hr $0.83/hr -70% 30 seconds 80%
Azure A100 80GB $3.67/hr $1.47/hr -60% 30 seconds 70%
Lambda Labs A100 40GB $2.40/hr $1.20/hr -50% 5 minutes 85%
Vast.ai A100 40GB $1.79/hr $0.51/hr -71% No guarantee Variable
RunPod A100 80GB $1.89/hr $0.76/hr -60% 10 seconds 75%

🚀 The 5-Step Spot Instance Strategy

Step 1: Enable Aggressive Checkpointing

Save model weights every 10-30 minutes. This is non-negotiable for spot instances.

# PyTorch Lightning automatic checkpointing
from pytorch_lightning.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    dirpath='./checkpoints',
    filename='model-{epoch:02d}-{val_loss:.2f}',
    save_top_k=3,
    monitor='val_loss',
    every_n_epochs=1,
    save_on_train_epoch_end=True,  # Critical for spot
    auto_insert_metric_name=False
)

Step 2: Implement Termination Handlers

Detect termination signals and save state immediately.

# AWS Spot Instance termination handler
import signal
import requests
import torch

def check_spot_termination():
    """Check AWS metadata for termination notice"""
    try:
        r = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=1
        )
        if r.status_code == 200:
            return True
    except:
        pass
    return False

def emergency_checkpoint(model, optimizer, epoch, path):
    """Emergency save on termination"""
    torch.save({
        'epoch': epoch,
        'model_state': model.state_dict(),
        'optimizer_state': optimizer.state_dict(),
    }, f"{path}/emergency_checkpoint.pt")
    print("🚨 Emergency checkpoint saved!")

# Register signal handler
signal.signal(signal.SIGTERM, lambda s, f: emergency_checkpoint(...))

Step 3: Use Bid Strategies

Set maximum prices and use multiple regions for better availability.

# Terraform multi-region spot configuration
resource "aws_spot_instance_request" "gpu_spot" {
  ami           = "ami-gpu-ubuntu"
  instance_type = "p3.2xlarge"  # V100
  spot_price    = "1.00"  # Max bid price

  # Spread across availability zones
  availability_zone = element(
    data.aws_availability_zones.available.names,
    count.index
  )

  tags = {
    Name = "GPU-Spot-${count.index}"
  }
}

Step 4: Implement Auto-Recovery

Automatically resume training when instances are terminated.

# Kubernetes Job with spot node selector
apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  backoffLimit: 100  # Retry on termination
  template:
    spec:
      restartPolicy: OnFailure
      nodeSelector:
        node.kubernetes.io/lifecycle: spot
      tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: NoSchedule
      containers:
      - name: trainer
        image: my-training:latest
        command: ["python", "train.py", "--resume-from-checkpoint"]

Step 5: Mix Spot with On-Demand

Use hybrid strategy: critical components on-demand, workers on spot.

# Ray cluster with mixed instance types
cluster_config = {
    "head_node": {
        "instance_type": "g4dn.xlarge",  # On-demand
        "spot": False
    },
    "worker_nodes": {
        "instance_type": "g4dn.12xlarge",  # Spot
        "spot": True,
        "min_workers": 2,
        "max_workers": 10,
        "spot_price": 3.00
    }
}

⚡ Best Practices by Workload Type

✅ PERFECT for Spot Instances:

⚠️ USE WITH CAUTION:

🎯 Provider-Specific Strategies

AWS EC2 Spot

Strategy: Use Spot Fleet with diversified instance types

Google Cloud Preemptible

Strategy: Combine with sustained use discounts

Azure Spot VMs

Strategy: Use eviction policies and max price caps

📈 Real-World Savings Examples

✅ Case Study: 70B LLM Fine-tuning

Setup: 8x A100 80GB spot instances on AWS

Strategy: Checkpoint every 500 steps, auto-resume on termination

Results:

🛠️ Monitoring & Automation Tools

Recommended Tools:

🔄 Recovery Script Template

#!/bin/bash
# Auto-recovery script for spot instances

CHECKPOINT_DIR="/mnt/checkpoints"
TRAINING_SCRIPT="train.py"
MAX_RETRIES=50
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
    echo "Starting training attempt $((RETRY_COUNT + 1))..."

    # Check for existing checkpoint
    if [ -f "$CHECKPOINT_DIR/latest.pt" ]; then
        echo "Resuming from checkpoint..."
        python $TRAINING_SCRIPT --resume "$CHECKPOINT_DIR/latest.pt"
    else
        echo "Starting fresh training..."
        python $TRAINING_SCRIPT --checkpoint-dir "$CHECKPOINT_DIR"
    fi

    EXIT_CODE=$?

    if [ $EXIT_CODE -eq 0 ]; then
        echo "Training completed successfully!"
        break
    elif [ $EXIT_CODE -eq 143 ]; then  # SIGTERM
        echo "Spot instance terminated. Waiting 30s before retry..."
        sleep 30
        RETRY_COUNT=$((RETRY_COUNT + 1))
    else
        echo "Training failed with code $EXIT_CODE"
        exit $EXIT_CODE
    fi
done

if [ $RETRY_COUNT -eq $MAX_RETRIES ]; then
    echo "Maximum retries reached. Exiting."
    exit 1
fi

🎯 Golden Rule

Time Value Equation: If your time is worth more than the savings, use on-demand. If you can tolerate interruptions and have good checkpointing, spot instances are free money.