This week we all witnessed another major cloud outage. This time it was Microsoft Azure. According to Microsoft, the outage was caused by an undetected bug that was rolled out to production affecting the block storage used by almost all Azure services.
Cloud outages happen. There has been too many of them to keep track of. Not all outages are major but every day we see many small incidents affecting a small number of services.
I think the best way to mitigate the risk of cloud outages is to first accept that no matter how hard cloud providers try, there always going to be issues affecting their customers availability and performance. Once we accept that as a fact of life we can try to find a way to protect ourselves against it.
Hybrid Cloud, the hype and the reality
Hybrid cloud, where you can weave multiple cloud provides into one infrastructure has been talked about for a long time now. Frankly I have not seen any successful production stacks being deployed on a hybrid cloud, nor have I seen much desire in doing so from customers.
I believe this is because most hybrid cloud solutions focus on merging multiple clouds into one infrastructure and not attempting to build an immutable infrastructure setup that can be replicated quickly and easily on any cloud provider, which I think is the best way to leverage multi cloud providers to achieve high availability and performance.
Immutable stack is one that is not modified but rebuilt every time there is a need for a change. If you can build your entire stack quickly enough on any cloud provider, you can switch your users away from the one suffering from outages in an “acceptable” length of time, depending on the nature of your business.
Discipline requires tools
Building an immutable infrastructure requires both discipline and tools. Good tools help people enforce good practices easier.
Instead of policing your devs not jump on servers and run shell commands directly, you can make it very easy to script out what they want done on a server. Instead of constantly asking everyone to write rollback scripts for their migrations, make it easy to redeploy the whole stack from scratch as quickly as possible.
Not every solution is applicable to every situation. There will always be situations when you need to write migration scripts or rollback ones. We will always need to put out fires or debug issues by having direct access to servers. But let’s focus on the majority of cases. Let’s build our tools and therefore disciplines for the 80% cases instead of getting hung up on the 20% of the cases.
What makes an immutable stack?
Few principles can help us with achieving immutability in our infrastructure:
Limit the “sources of truth” to as few as possible.
If you have your application code to run on servers, Chef scripts to build and modify those servers, VM images that are built with your build system and database migration scripts, then you have 4 sources of truth. They can, and will, go out of sync over time. You can police this and put safeguards in place to minimise the possibility of this, but it’s always better to have fewer “sources of truth”.
Moreover, you can be sure that migrations never run end to end from start to finish. Dependencies change and components get updated so often that the migration from state A to state B for your servers 6 months ago is almost guaranteed not to run on fresh new servers now.Strive to make your data store as reproducible and possible. This sounds easier than it is and can be the most expensive part of your immutable infrastructure. Many Ops teams consider data outside of the remit of building infrastructure. We simply don’t have good tools for data + schema version control.
Taking pre-deployment backups is one strategy which requires good data store design and backup policies which can be performed quickly enough. Beyond databases, other data stores (like S3 or other cloud based block storage systems) can have version control and rollbacks.
However when it comes to resilience against cloud outages, multi cloud data availability is the most important part of the solution. Setting up database replications across cloud providers and data centres can help with that aim. In this scenario you constantly keep your data “warm” in multiple locations, prepared for a failover.Make sure you can redirect traffic as quickly as possible. Using fast response DNS services with low record TTL can help in many cases, although not everything between your servers and your users is going to honour TTL values. However in many cases a combination of low TTL DNS records, traffic proxy services like Cloudflare and maximising the use of CDN for your static assets can help with a vast majority of cases.
Combining these three principles can get you a long way in your quest for high availability and protecting you again cloud outages. At Cloud 66 we build tools to help you with building immutable infrastructure and would love to hear about your experiences with building high availability applications.