AWS Spot instances

From Deep Learning Course Wiki
Jump to: navigation, search

Short of buying a GPU, AWS P2 instances are great for developing ML models on. However they can get quite expensive starting at $0.9 per hour. Luckily, Amazon lets us bid on the spare computing capacity it has at any given moment. The way to do this is to run a Spot instance.

Spot Instances

How much cheaper Spot instances are vary, but regularly P2 spot instances come at about 70–80% savings. They are much cheaper but come with a few caveats:

  • Spot instances can not be stopped and started again, they can only be terminated. This is a big one. Terminating an instance destroys it, so when you start a new instance to resume work, you lose any changes you’ve made. You can keep the volume/disk that the instance was operating off but you can’t easily start a new instance out of it.
  • You might be outbid at any time. When you are outbid your Spot instance is terminated automatically. However in several months of using Spot Instances, I only had this happen once.

Run a Spot Instance

To create a spot instance and to develop and run ML models on it, we want to use P2 instances. They come with one or more powerful NVIDIA K80 GPUs with lots of memory (11 GB) to test and train your models on.

Tools Needed

1. Install AWS cli. AWS cli is a command line utility that can be used instead of the web-based AWS Console to manage AWS services.

2. Then run aws configure to set your key, secret and region. Regions that AWS supports P2 instances in are N. Virginia (us-east-1), Oregon (us-west-2) and Ireland (eu-west-1). Usually you want to choose the region that’s geographically closest to you.

3. Finally, we download helper scripts, that will assist us in the setup:

git clone --depth=1

Virtual Private Cloud (VPC)

Before we can start P2 instances, we need to setup a Virtual Private Cloud (VPC). Which is just a fancy virtual network to launch your virtual machine in. Setting up a VPC can be a little intimidating. Good news is it has to be done only once.

Let's use the a part from Jeremy and Rachel's setup script. If you got the helper scripts from Needed Tools above, simply run the following: . ec2-spotter/fast_ai/ This will create a VPC, Internet Gateway, Subnet, Route Table, Security Group and most importantly a Key Pair. We will use the newly created key (located at ~/.ssh/aws-key-fast-ai.pem)to connect to the instance we are about to create. It will also print the ID of our newly created Subnet and Security group. We’ll need these for the next step.

Create the Instance

We have a little helper script named to launch the instance. We need to pass it the following arguments:

  • ami —Depending on which region we have picked and whether we want to use image or the Amazon one, we need to select an image (Amazon images below are updated to version 1.3 from April 2017):
Region/Provider Amazon
us-east-1 ami-31ecfb26 ami-fb8e19ed
us-west-2 ami-bc508adc ami-638c1e03
eu-west-1 ami-b43d1ec7 ami-c5afaaa3
  • subnetId — Use the subnet ID that printed.
  • securityGroupId — Use the security group ID that printed.

For example:

. ec2-spotter/fast_ai/ --ami ami-53b23433 --subnetId subnet-9f69c3d6 --securityGroupId sg-a62f2ede

The script will then print the IP of our new Spot instance.

We might also pass the following: volume_size (size of the root volume, in GB. Default 128), key_name (name of the key file we’ll use to log into the instance. Default: aws-key-fast-ai), ec2spotter_instance_type (type of instance to launch. Default p2.xlarge), bid_price (The maximum price we are willing to pay (USD). Default 0.5).


Using the IP of the Spot instance from the previous step, we can connect via ssh:

ssh -i ~/.ssh/aws-key-fast-ai.pem ubuntu@$instance_ip

All seems good!

We have a Virtual Private Network, a Spot instance instance running in it for a fraction of the price. But all is not roses!

If we shut our instance for the night, all of our changes would be lost. We need to find a way to persist the data on our Spot instances.

Persistence for Spot Instances: Approach 1 — Attached volume

The first approach uses a separate volume, where your models/data is kept. This volume is attached to your Stop instance after you start it up. Instructions on how to do this follow:

First off, start a Spot instance. (see “Run a Spot Instance” above).

Create a volume

(only do this once). We can do it via AWS cli or the web-based AWS Console.

Create a volume with AWS Console

Open the AWS Console, select EC2, then:

Step 1. Open “Volumes”

Step 2. Click the “Create Volume” button

Step 3. Make sure that the Volume Type is “General Purpose SSD (GP2)”. A fast SSD drive helps us when we constantly need to access the disk — e.g when our data is to big to fit in memory.

Step 4. Choose an appropriate size of the volume. I usually use 128 GBs.

Step 5. Make sure the Availability Zone is the same as the one the instance is in.


Create a colume with aws cli

In the below bash command, change the size of the volume as needed and set the availability zone to the one the instance is in.

aws ec2 create-volume --size $volume_size --availability-zone $availability_zone --volume-type gp2 --output text --query 'VolumeId'

Attach the volume to the instance

Again we can use the cli or the web console.

Attach the volume with aws cli

Volume ID was printed by the create-volume command in the previous step. If we don’t know our instance ID yet, we can review our instances.

aws ec2 attach-volume --volume-id <value> --instance-id <value> --device /dev/sdh

Attach the volume with the AWS Console

Open the AWS Console, select EC2, then:

Step 1. Select the Volume we just created (see the created data if unsure).

Step 2. Open “Actions”

Step 3. Click “Attach Volume”


Step 4. Click the Instance field. A list of instances will pop up.

Step 5. Select the instance to attach the volume to by clicking it.  We’ll leave the Device field at the default value

Step 6. Confirm with the “Attach” button.


Mount the volume

The steps below are from Amazon’s own tutorial on “Making an Amazon EBS Volume Available for Use”:

Step 1. SSH into your instance. 

ssh -i ~/.ssh/aws-key-fast-ai.pem ubuntu@$instance_ip

Step 2. Run lsblk to see under what name was the volume attached. Usually in ubuntu it will be the named “xvdf” and will be the last entry on the list.

Step 3. If we just created our new volume, we’ll need to format it with a file system. 

CAUTION: Only do this on a newly created volume, otherwise you will erase all data on it. 

Run: sudo mkfs -t ext4 device_name where device_name is “/dev/” plus the device name from step 2. For example: /dev/xvdh

Step 4. Create a directory to mount the volume: sudo mkdir mount_point

Step 5. Finally mount the volume sudo mount device_name mount_point

From now on anything you put in the mount_point dir will be stored on the attached volume. This means that you can terminate the Spot instance, then start a new one later, attach and mount the volume as described above and continue where you left off.

This approach works fine, but it has a big drawback — anything you do outside of the persistent volume is lost when you terminate the instance. Instead, the next approach doesn't have this issue.

Persistence for Spot Instances: Approach 2 — Swap root volume

This approach finally manages to make spot instances behave similarly to on demand ones. This works by swapping the root volume (where the operating systems runs) for another volume right after booting up. 

Use this script with a new instance, or an existing one.

Step 0.

  • Make sure you have jq installed.
  • Make sure you have downloaded the helper scripts.

If not run:

git clone --depth=1

With a new instance

Step 1. Start a Spot instance (see “Run a Spot Instance” above).

Step 2. Run sh ec2-spotter/fast_ai/ It creates a config file for launching a spot instance from an existing spot or on-demand instance named fast-ai-gpu-machine (this is what our new instance is named in Step 1). Instead of the finding the instance by name, you might pass the script the an instance id like this sh --instance_id i-0fd47cabf6ce1d534

CAUTION: The script will also terminate the instance from Step 1. If you have other instances launched named fast-ai-gpu-machine, the script might terminate them instead, so rename them before running it.

sh ec2-spotter/fast_ai/

NOTE: The config script above will work right off the bat if you used an Fast.AI AMI. However if you used an Amazon AMI, you need to update the parameter ec2spotter_preboot_image_id in ec2-spotter/my.conf: ami-5e63d13e (Oregon), ami-a192bad2 (Ireland) or ami-49c9295f (N. Virginia).

Step 3. Then every time you need a P2 Spot instance, just run: sh fast_ai/

It will start a new Spot instance, then at boot time it will swap its root volume with the volume of the Step 1 instance. This might take a few (2 to 5) minutes to finish.

Now when you terminate the instance and start it later (using, any changes to the filesystem it will persist!

Using an existing instance

Step 1. Stop your existing instance and detach its root volume.


Step 2. Give the newly detached volume any name.


Step 3. Create a copy of example.conf named my.conf:

cp ec2-spotter/example.conf ec2-spotter/my.conf

Step 4. Modify the settings inside my.conf. Especially:

ec2spotter_volume_name : the name you gave the volume in step 2.
ec2spotter_launch_zone : the availability zone where you want to launch your instance.
ec2spotter_subnet : ID of the subnet to use.
ec2spotter_security_group : ID of the security group to attach to the instance.
ec2spotter_preboot_image_id : The image to preboot the instance with. uses Ubuntu 16.04 as base for their ML images. So we need to supply the Ubuntu 16.04 ami here: ami-7c803d1c (Oregon), ami-d8f4deab (Ireland) or ami-6edd3078 (N. Virginia). Amazon uses Ubuntu 14.04 so we can use the following amis: ami-5e63d13e (Oregon), ami-a192bad2 (Ireland) or ami-49c9295f (N. Virginia).

If you don’t know subnet ID yet, you can get it from your subnets. Same for the security group.

Step 5. Then every time you need a P2 Spot instance just run sh fast_ai/

It will start a new Spot instance, then at boot time it will swap its root volume with the volume of the Step 1 instance. This might take a few (2 to 5) minutes to finish.

Now when you terminate the instance and start it later (using, any changes to the filesystem it will persist!

Note: If you are having issues with nvidia-smi not working in your spot instance, Gaiar Baimuratov posted a great solution.

This approach was adapted from a medium article.