As the complexity of cloud infrastructure and code increases, disaster recovery quickly becomes a very expensive cost for developers and businesses alike. Time is the single most important variable in these situations - be it the time your application is offline or the time you spend solving an issue.
With this in mind, we're working hard to reduce the mean recovery time and we recently witnessed first hand the effect of this. Peter Berkenbosch, a freelancer and developer at Spree Commerce and Cloud 66 customer, unfortunately had to deal with recovering from a serious issue just recently.
His web application simply stopped responding to visitors, and a joint investigation effort identified the culprit as an issue with an unstable database server. Rather than spending time to understand the root cause of the issue, we were able to simply clone his stack to new hardware and use our database import feature to restore his data on the new stack.
Once the new stack came up, the application was back to serving requests smoothly and the DNS was simply switched over. This is an exemplary situation of a problem that could result from one of many components, which makes it hard to determine the exact cause. Instead of spending time (and thus money) understanding the issue, we were able to swiftly move his application from one stack to another so that he could continue serving content to his visitors.