Achieving High-Availability for Client-Server Architectures

  • Read-time: 20 minutes
  • Difficulty: Medium (DNS terms, networking terms, light math)
  • Requirements: none

Many system architects and administrators need their systems to withstand failure, routing-around faults without manual intervention. Automated fault detection and self-healing methods quickly payoff, ensuring revenue stability and quality user experience. Just as an example, a single day of downtime on a server takes your annual up-time availability from 100% to 99.7%. One bad day can easily rune a whole year’s worth of availability goals (not to mention revenue and reputation!).  How do we then achieve three or even four nines of availability without spending a ton of money on bespoke high availability platforms?

You can employ one or more well-known tactics to auto-detect and route client requests around failure conditions. In this article we explore some common techniques to detect and automatically route around failure conditions.

Defining Availability

You would think that the math surrounding availability would be straight-forward – no? Well, once you put it through the legal engines of large companies it becomes anything but…

Still, the basis for availability is straightforward:

Availability = (Number of seconds in a year - Outage seconds) / Total seconds in a year

Using the example above, being down for a day (or 86,400 seconds) yields the following:

(31,536,000 - 86,400) / 31,536,000 = 99.72%

Simple – yes? Well, here come the details…. Do we mean “unplanned outage time” or does this include planned maintenance outages? That is, if your provider tells you that they’re going to be down for the month of January does that count against the time? Typically, no. Availability numbers are only unplanned outage events. So, there’s caveat number one.

Then there are the time boundaries that govern the outage. Do we mean when the system actually went down (that seems appropriate, right)? Well, typically, no – but it depends on the contract. The legal jargon can go many ways but many companies start the downtime clock “at time of detection,” or even “at time of ticket,” (which is really bad if their ticket response time is slow). Caveat number two.

It’s less ambiguous to time when service is up, right? Not so fast! Most legal contracts will establish that the outage stops with “initial service restoration”. So, if you have five web hosts serving your content, and all are down, the clock stops when the first one is back on-line. To the layman, this makes sense. In practice – it’s less obvious.

Let’s delve in a bit:

Using the five web servers we posited above – let’s stop the clock when the first one is back (opens up port 80 and port 443 for traffic). Great right?! Not quite. The one server that is up and working is going to get the load typically meant for five servers. Depending on how well this server manages its resources, it might be able to keep up with some of the traffic.

However, if the website is complex, changes are high that the server is crushed against the flood of request traffic (meant for many more servers) and becomes ineffectual to the point of crashing. This cycle often continues until enough web servers (of the five) are up to effectually handle the load.  This might mean three or even four servers. There’s another caveat.

Oh my, is there no end? Well, fortunately for us in the technical space, these are details that are left to others – lawyers and executives usually. We just need to focus getting systems up and healthy. However, it critical that both architects and administrators know where the lines are drawn.

Timing the Failure and Recovery

Without re-hashing the caveats above – what is failure really? From a technical perspective the failure is made up of two components:

  • Component 1: The outage itself – services going offline (like ports down or being un-responsive)
  • Component 2: The convergence (or self-healing). That is, the systems detecting and routing around the failure.

The first component is out of our hands – the failure. For the sake of this writing we’re going to assume that failure is obvious and instantaneous. In reality, faults can be subtle and hard to pin-point.

The second component is a bit more interesting to this paper. It’s made up of the two sub-pieces:

  • Sub-Piece 1: the detection of the failure (e.g., a heart-beat violation, not responding to X pings, etc.)
  • Sub-Piece 2: have the system coalesce and route around the failure

The first piece (detection) can take many forms depending on the situation and system.

Using networking as a convenient example: Layer-2 switches have different modes of detecting “ethernet link down” when a far-end device is disconnected. Most switches notice the ethernet link is gone when either a RD-input or link pulse test are received within a specified time (don’t worry if you don’t know these terms – just read that a network switch doesn’t see expected traffic for some time period). This is the failure condition – or failure detection. The switches then communicate among each other so that they are all aware of the failure; they can then begin the recovery process.

The second piece, recovery or restoration, often takes up much more time of the overall outage duration than detection (speaking broadly). Continuing with the example above, let’s say that all the switches in the network see the fault and adapt (if possible) sending traffic down alternate paths.

In ethernet networking this algorithm is called “spanning tree protocol” (STP). It uses availability graphs to determine alternate routes to get to the end-point. You can think of routing around a traffic accident by using side-streets to get to your destination. Once all devices have updated their routing around the fault the network is said to have converged on the new path to the end-point – meaning that the network is again in unified and stable state.

The sum of the all that is the outage time:

Outage time = detection method time + convergence time

Client-Server Recovery Methods

In the example above, the network recovered using STP and routing around the failure. This allowed traffic to continue regardless of the fault. We want to do something similar – but at the application layer.

For client-server models “Availability” means that our client is given a response to its request. Let’s go back to our original example – having some web servers that are serving up web pages. How do we achieve high uptimes with this build, regardless of faults?

There are three main ways we can route a clients requests accounting for failure and convergence:

1. We can use client DNS request to select the IP address of healthy machines (and take non-healthy machines out of the selection).

2. We can use an application load balancer to proxy connections routing around faults (send traffic only to healthy and responsive web servers).

3. We can use network “anycast” to deliver the request to the nearest on-line server that is healthy.

Let’s look at each of these along with their respective pros and cons:

DNS Routing

I won’t go into the details of DNS routing here; that’s left to other sources. I’m going to assume you have some basic DNS understanding for the following explanation. In general, DNS servers are going to receive the client’s request and select one or more IP addresses to handle that request (sending back the IP address of one of our five web servers).

In the DNS world there are really two ways to route around failures:

1. use “smart DNS” or “policy DNS”

2. handing out multiple DNS answers to the client.

In the first case, smart DNS, the DNS service itself is constantly monitoring the health of the controlled machines (testing the IP address A-records that it maps to the domain). When a server (IP address) stops being responsive (ping fail, tcp connection port 8080 unresponsive, etc.) it takes that IP out of the mix of possible IP answers – marking it as “unhealthy”.

It can be noted that all (most) DNS servers don’t do this natively – however additional modules and services have come about that make this possible. Many large DNS providers now offer this service for additional cost (e.g., AWS Route 53 has many forms of server health monitoring).

Pros: fairly cheap to implement, simple to moderate complexity, keeps knowledge server specific (nothing client-side to worry about)

Cons: need to find or build DNS servers that can do this, need to configure policies for each authoritative domain, the DNS server needs to have access to client machines (network reachability), works for small or medium installations – doesn’t scale to 10’s of thousands of machines, timing must exceed the DNS TTL – which can be 10s or 100s of seconds

The second DNS method – client multi-answer – is simply having the DNS server send back multiple answers (A-records). From there, the client can pick one and, if it’s not responsive move onto the next answer.  When the client finds a server that responds it then caches that and is used for further look-ups.

Pros: fairly cheap to implement, simple – DNS does this naturally, works for installation of all sizes

Cons: timing exceeds the DNS TTL (as above), not many clients do this natively (usually requires custom client handler or control over the client)

Web or Application Load Balancers

Load balancers are meant for specifically for this use-case. They sit between the client request and the server farm as a traffic-cop and heath-checker. Clients connect to the load balancer (instead of directly to the web server) and the load balancer then uses its policy engine to select the best candidate from the pool of controlled servers.

There are many different companies and methods to do this – from simple to extremely complex. And, as you can imagine, the price reflects that. This is often the selected approach for companies that have the means and need high-availability. For example, the company F5 sells load balancers.

Pros: Very effective at routing around failures, can also account for slow responses (busy servers), can watch trends and route around failures before they happen, timing is typically fast – using advanced detection methods and trending

Cons: Can be costly, need to buy multiple load balancers to remove single point of failure and have them clustered, yet another device to buy, secure, control, and manage

Anycast routing

Anycast routing is a networking term that means, “send the traffic to one of any of the servers that have a given IP address.” This uses a network protocol (BGP) to have multiple servers present the same IP address and then the network selects one.

This is done using the BGP protocol (not covered here) to essentially find the closest IP address to terminate the IP traffic. You see, BGP has no problem finding the “shortest” path (typically using hop-count) to route the client’s IP traffic to the closest IP address. In this case, the client has already done DNS resolution, received an A-record with an IP address, and makes its TCP / HTTP request to that server. It just so happens that (5) different servers exist with that exact IP address – perhaps geographically spread-out for speed.

The clients IP request will end up terminating at the server as though there were no outage.

Pros: Cheap, fast convergence using BGP network, simple to use once boxes have the correct interface plumbed and the network supports BGP, supports geographic close-ness, need to install BGP daemon (bird, quagga, zebra, etc.)

Cons: Need to make sure that machines that are impaired or failed “take-down” their IP interface link so BGP doesn’t see it, need to make sure the network has the correct connectivity to perform the BGP magic (speaks BGP), little control over routing selection other than taking the server interface down, can be hard to determine request flow since it happens over the network (and not within the application stack)

Choosing the Right Method

Choosing the right one for you is going to come down to the usual things: What kind of availability do you need? Three nines? Four nines? Maybe even five? (sheesh).  If you need super-high availability then you might look at combining these technologies – a hybrid if you will. I believe that either load balancer or Anycast will have the fastest detection and convergence – mostly because of the long TTL that DNS clients could have – so consider that.

Also, if you’re doing this on the cheap, or don’t have permission to deploy additional hardware into the network, then there’s a good chance that load balancers are out.

If you don’t feel like tackling the network portion – maybe keep away from Anycast – it’s fast and cheap – but you’ll need to roll up your sleeves, install and configure a routing daemon, and start making networking changes to the boxes.

DNS can be a simple way to do this; it sits in an easy-to-use application space and with some DNS mods in place comes nearly auto-magically. However, you need to either self-host DNS or use a provider with smart DNS capability.

What About “The Cloud?”

Okay – but what if we’re not back in 2002? What if there was this thing called “the cloud” and we could let it do all this for us? After all – isn’t that a huge portion of what the cloud does?

Yes – but nothing is free. AWS, Azure, GCP, route around failures everyday (heck every second?). But they’ve done the tricks above to make that work (or have build full-scope application and network solutions to mimic them).

The cloud is a great choice – but you should understand a few items still:

  • Cost – cloud can be expensive, and worse, hard to predict and budget. It creates additional instances using intelligent scaling groups, Kubernetes, or micro-services, but it’s hard to forecast what your spend is going to be in advance. Of course, each of these technologies comes with its own set of pros and cons.
  • Vendor Lock-in – Once you build for Azure, it’s not trivial to move to AWS,  or try to move back to a partial on-prem solution to defray the cost. Once you start down the dark path, forever will it dominate your destiny! Well, that might be a but much – but Yoda knows it can certainly be hard to migrate away from a cloud-based offering.
  • Complexity – you’re going to give up control of how and when you route around… unless you really dig-in and use the CSP features. This can be both rewarding (they have a lot of intelligence wrapping scaling and fault) but it can be daunting as well.

Outro

Even AWS has outages that make the news. They aren’t frequent, but they happen. Those outages typically come from software or human based failures – global configuration snafu, routing table mis-configs, security attacks. But, while nobody is perfect, planning for the failure condition is part and parcel to system design.

Even if you don’t implement one or more of the following, knowing what your options are, along with their pros and cons, help you decide when and if you need to use them.

Now go forth and heal thyself doctor.

Scroll to Top