Recently we moved our whole production infrastructure from EC2 Classic to EC2 VPC.
We are very happy with this move and this is an attempt to outline the motivation for this move and our experience doing it.
Cloud 66 Production Infrastructure - before the move
We have servers in all the other cloud providers we support but the main bulk of our servers run on EC2 and Rackspace. EC2 runs our production servers and Rackspace is used for our supporting systems: admin systems, monitoring tools, dashboards and more.
Our EC2 servers run Ubuntu 12.04 backed by EBS volumes and in case of some servers (like our Redis, file servers and metrics servers) we use high IOPS volumes.
We use Elastic Load Balancer to handle our frontend traffic and Multi AZ AWS RDS for our DB instances with read replicas for our reporting services.
As well as that we have Redis, Collectd, ElasticSearch and multi-terabytes of S3 storage. I’m going to write more in future about each one of these components.
We use a group of gateway servers to connect to deploying (client) servers. Those servers require to have fixed IP addresses to the client’s firewalls can be configured to allow only Cloud 66 IP addresses in. We achieve this by using Elastic IP addresses for our gateway servers. However this doesn’t work when deploying to servers running in the same data centre as they will be called with the internal IP address of our gateway servers. With EC2 Classic you cannot fix the internal IP address of the servers. So every time a server is restarted you are going to get a new internal IP address.
This made our maintenance jobs and upgrades very tricky.
Amazon VPC or Virtual Private Cloud works by allocating a private network subnet to your servers. You can separate your servers into different subnets inside your own VPC as you would do that in your own data centre. This give you a lot of flexibility and control over security and traffic routing.
Around a month ago, we started our move from EC2 Classic to EC2 VPC. This was a major move as we built our whole production site from scratch and changed the way we arrange our servers and route our internal and external traffic.
While we use Cloud 66 production to deploy Cloud 66 Staging, Development and Test stacks as well as all of our support systems, there is one stack we can’t deploy with Cloud 66 production: Cloud 66 production itself. It’s like pulling oneself over a fence with one’s bootstraps!
The move from EC2 Classic to EC2 VPC
The most important thing about the move is to know exactly what you want to achieve by the end of it. Once you have a clear picture of the endgame, it becomes a relatively easy task.
Currently you can’t move an EC2 instance into a VPC: you have to build a new server from an image in VPC. You also cannot use Elastic IPs (EIP) from EC2 Classic in EC2 VPC. Those two where the biggest challenges we had during our move.
Step 1: Design your VPC
Our new VPC has 5 different subnets: a public subnet, two private subnets and two RDS MultiAZ subnets.
The public subnet includes our Internet Gateway and our load balancers.
Each private subnet is protected with a different set of firewalls and access keys and houses all of our servers. A NAT server connects the private subnet 1 to the public subnet. This is an unusual setup to have on VPC: usually web servers and other public facing servers live in the public subnet while the backend server a located in the private subnet. However our requirement for all of our servers to reach the outside world meant we needed to have a single EIP to act as a one way gateway from our servers to the outside world. This is achieved with a NAT node.
Step 2: Rebuild the servers
We started the move by taking the latest images from our EC2 instances and building them again within our new VPC. We also built a new RDS cluster in the VPC.
The easiest way to rebuild the servers is to shut them down, take an AMI image (if you don’t have one already) and use that image to fire up the new EC2 VPC instance.
Step 3: Warm up whatever you can
We use Redis quite heavily at Cloud 66. From Sidekiq to storing temporal logs, Redis has been a great asset to have. After building our VPC Redis instances, we setup the new VPC servers to be readonly slaves to our Classic Redis master. This would keep new instances warm and ready to switch over.
Currently it is not possible to setup RDS replication between EC2 Classic and EC2 VPC, so we didn’t do that.
Step 4: Reduce your TTL on DNS
This is always the case when switching over traffic between two servers. DNS records need to be updated and any router on the way can cache the records which can cause delays in the switch over for some visitors. The most common way to deal with this is to reduce your DNS record TTLs to a low number like 5 minutes at least 24 hours before the move. This way all the caches will have to check for new values every 5 minutes. This does not guarantee a smooth switchover as some routers might not honour the TTL.
Step 5: Tunneling
With everything in place we started the switch over. To minimise downtime, we used SSH tunnels between EC2 Classic and EC2 VPC instances to switch over different services gradually: a quick shutdown of the service allowed us to use our latest DB backup to build our VPC database. Once database was in place we used an SSH tunnel from the Classic services to connect to the new VPC databases (remember, the Classic instances are considered public internet as far as VPC is concerned so you can’t just get the old servers to connect to the new DB servers).
We did the same thing for Redis: switching VPC redid servers to master and tunnelling Redis traffic from Classic servers to VPC Redis allowed us to replace old Redis servers as well.
Step 6: The final move
With all data being served from the new VPC instances, we shut the site down again shortly to move our data volumes. This involved shutting down our file servers, unmounting EBS volumes from those and mounting them back on the new VPC file servers.
With the last piece of backend service being moved, we then switched our DNS records to the new load balancers and restored the production traffic fully on the new VPC stack.
The move to VPC has given us great flexibility and control over our security, traffic routing and availability. It wasn’t easy and we had to plan for it. The most important thing was making sure we understand the limits of VPC and designing the new stack around that. Things like NAT servers, routing tables and internet gateways had to be configured to allow our unusual traffic requirements through.
VPC is not necessarily the best solution for every workload as it can add to the complexities of running a production stack, but if your workload requires private VPN connections, split subnets and tight control over internet access of your backend servers, it is a very good option to achieve those goals.