Site Availability

site-availability

Being surrounded by nerds and geeks, means you've heard about websites being "always" available, but when you actually get involved with a project, having that luxury will become more and more like a utopia, but remember that there is still hope.

Now the question is as to how those websites achieved that. Well, that comes with a price; You may remember the time Facebook was down for half an hour each day, comparing to now that it barely goes down then, compare the brand worth of those days and now. But that is Facebook and people desperately need it every single second, so they are gonna have to pay the price! Good news is that thanks to technology's getting cheaper and cheaper, achieving high availability now looks more promising than ever.

High Availablity

High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is - from the users point of view - unavailable.

There are three principles of systems design in reliability engineering which can help achieve high availability.

Elimination of single points of failure:
This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
Reliable crossover:
In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
Detection of failures as they occur:
If the two principles above are observed, then a user may never see a failure (note the usage of may).

In the above list Cloud 66 tries to be the second one on the list "the Reliable crossover". What else does it offer?

Redundancy

Load Balancer
Failover Groups

Monitoring
Redundancy

This is about elimination of single points of failure. In Cloud 66 we offer redundancy for most parts including: database replication, multiple web servers and secondary stack.

Load Balancer

Load balancers are used to distribute traffic across your web servers, and offers benefits such as maximizing throughput, minimizing response times and avoiding overload on any single server. Ultimately, load balancing increases the reliability of your stack.

Load balancers have a feature called health check which is to hit an endpoint on each of the servers behind it to check whether they are healthy or not. If they are the traffic will be distributed to them.

Again for the health check endpoint is worth being defined in a way that it goes through every component of your stack, like database queries, or calling background workers so you'll test almost everything by hitting this endpoint. Yes I know if the database having issues your health check will fail on all servers behind the load balancer so nothing will be distributed, but that surely indicates that your website is down which needs someone to jump in front of the steering wheal and take control. However, on the other hand it will eliminate the issues that one of the servers may have.

So for instance imagine one of your servers' used memory is so high that background jobs are not working properly while a simple endpoint health check is still available. So having an endpoint to check all of these with one query is a big improvement. However, that may come with a cost. Having an endpoint that goes through all the components might need resources which would have effect on the actual website's performance. Making that balance is totally up to you and it is attainable with trial and errors. Maybe have a complex health check end-point, but less frequent?

Failover Groups

In computing, failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application.

In cloud 66 Jargon a failover group is a managed quick response DNS address that automatically follows your stack web endpoints. You can connect it to up to 2 stacks at any time - a primary and secondary stack. Should you need to switch traffic between your stacks, simply toggle the switch and your traffic will flow to the secondary stack within 5 minutes.

Let's imagine you want to set up your secondary stack. It is best if you build the secondary one in another region -or whatever expression your cloud provider use- to avoid geographical issues.

Monitoring

Although Cloud 66 detects server connectivity issues, we don’t currently detect application states. There is a lot of complexity around the numerous scenarios possible, and some applications such as APIs may not have a healthy state in the same way that web applications do. Other applications may not be open to the world.

There are external services tailored for this type of monitoring, which allow you to customize your monitoring as required. For example, Pingdom offers a great amount of flexibility in this area.

If you are using a monitoring tool, you can define an endpoint in your application for the monitoring system to check! This endpoint would be better to include everything like involving database queries, background workers, so you'll test almost everything by hitting this endpoint or you can have one for each component.

With all being set correctly your site will be in a fairly good shape in terms of availability. These being said, you need to be ready for the time that things go south which I will cover in the next one.

PS. The price of having high availability site goes up exponentially, meaning that if being up 90% of time costs $10, 99% and %99.9 will roughly cost $100 and $1000 respectively.