Section 7.4. Remote Services Redundancy

7.4. Remote Services Redundancy

We'll be talking about general application scalability and redundancy in Chapter 9, but there are some redundancy issues that are specifically applicable to remote services.

Assuming you have some degree of control over the various remote services you wish to use, you'll need to carefully calculate how redundant you need each component to be, assuming the component can be set up redundantly. General BCP might dictate that you need one or more hot spares depending on the number of online nodes that comprise a component. Judging how many spare nodes you'll need should take into account several metrics:

Number of nodes in the group

The greater the number of nodes in a single group, the larger number of simultaneous failures can be expected. One spare for a group of two might be fine, whereas one spare for a group of 100 would be less acceptable.

MTBF (Mean Time Between Failures) for nodes

This can be a very difficult calculation to make, since software (unlike simple hardware) doesn't have a published MTBF, especially if you wrote it yourself. The shorter the MTBF, the more likely you are to have multiple nodes fail at once.

MTBF is an interesting statistic, mostly because it's so often wrongly interpreted. When a drive says it has a MTBF of 500,000 hours, which is about 57 years, it doesn't mean that it will last for 57 years without failing. You also need to take into account the service life of the device. Imagine our drive has a MTBF of 57 years and a service life of one year. If we replaced the disk every year at the end of its service life, then we should last for 57 years without a drive failing. If we had 57 disks running for a year, the probability is that one disk would fail within that year. The service life (and the warranty!) is the important statistic to watch out for.

Number of nodes your service needs to continue operating

If you have 10 servers performing the same function, and your site can run on 9, then you already have one spare. But you can't afford to have more than one node fail at once.

TCO (total cost of ownership) per node

Every piece of hardware costs money, but also sucks up power and system administrator time. Purchasing, racking, installing, and configuring hardware should be taken into account for the TCO.

Development resources needed to accommodate extra nodes

If adding more nodes will require extra development resources, then that needs to be factored into the cost. A system that can easily survive with three of its five nodes operational, but would require three months of work to support more than five nodes, may not be worth expanding.

Unlike an application as a whole, a key remote service could be disabled while your primary service is still running, so users are still able to initiate actions that rely on making remote service requests. This is in stark contrast to web application server failuresif Apache crashes you can no longer serve pages. You don't need to deal with requests that come in and have partially succeeded; perhaps they have already sent back a response, or written data to storage.

It's important to remember that any component in the chain can failthe local machine's network interface could go down, DNS could go down, the network switch could collapse, one of the routing points between the hosts could fail, or some part of the remote server itself could fail. Failures in all of these points in the calling chain manifest themselves in different ways, and some are difficult to monitor (although we'll talk about that more in Chapter 10).

When components in the system fail, we want the system as a whole to carry on, if it possibly can. Ideally, we want as many components as we can to provide hot failover behavior. So what do we mean by hot failover in this case? When we have multiple available instances of a remote service, hot failover is the ability to automatically migrate traffic from a failed node to a functioning node.

For some services this means using a dedicated hardware or software load-balancing appliance that monitors the things it's balancing. In this case, you then need multiple load balancers to ensure hot failover in the case where one of the balancers fails. For nonbalanced services, this can mean trying a list of hosts until a functional one is found.

It's not just fully failed components we need to skip over either. In the case where a service is reachable but returns a certain class of error response (one that specifies that the remote service failed), then we might want our application to retry the request on a sibling server. In this case, a load balancer doesn't really help usit can only detect that the service is available, not that it can successfully execute different requests.

User facing components such as web servers tend to need specialist software or hardware load balancer to handle hot failover; we've already discussed load balancing in Chapter 2. Behind the scenes, components can be given hot failover abilities right in your application code. This can reduce the complexity and cost of your architecture by eliminating extra balancing and routing nodes.

The most basic example is for software database load balancing. For the cluster of database servers we want to connect to, we have a list of hostnames. First we shuffle the list so that we pick a random server to connect to each time. Next we iterate over the list, trying to connect to each in turn. When we find a host we can connect to, we stop looking and return the connection handle. If we try all hosts in the list and don't manage to connect to any, we return zero and let the application logic worry about what needs to be done:

function db_connect($hosts, $user, $pass){
        shuffle($hosts);
        foreach($hosts as $host){
                debug("Trying to connect to $host...");
                $dbh = @mysql_connect($host, $user, $pass, 1);
                if ($dbh){
                        debug("Connected to $host!");
                        return $dbh;
                }
                debug("Failed to connect to $host!");
    }
    debug("Failed to connect to all hosts in list - giving up!");
        return 0;
}

A slightly more complex example might be for a service where more than the connection mattersa service where even if we manage to connect, we might not be able to converse, or the service might not be able to fulfill our request:

function store_file($storage_hosts, $filename){
        shuffle($storage_hosts);
        foreach($storage_hosts as $host){
                $result = store_file_2($host, $filename);
                if ($result){ return $result; }
        }
        return 0;
}
function store_file_2($host, $filename){
        ...
        if ($connection_failed){
                return 0;
        }
        ...
        if ($operation_failed){
                return 0;
         }
         return $result;
}

Here we shuffle and loop over each possible hostname, retrying the operation until it succeeds. Success is defined as connecting to the remote service, issuing our command, and getting the correct response. In the case where we contact the service but the service fails to respond or gives us a failure response, we move on and try the next one.

In some cases we won't have hot failover capacity, or all servers in the pool will be unavailable. The action in these circumstances depends on the kind of action being performed. If we were querying a remote service for search results, then we might need to display an error message to the user when all else fails. If we were updating a remote system, then we might want to queue the update request locally so we can resend it later when the remote services becomes available. In the case of a read request to a remote service, we can sometimes fall back on a local cache for frequently requested values. All of these, of course, depend on the nature of the request. We don't want to cache time-sensitive data or queue search queries.