Tuesday, July 29, 2014

ElasticSearch Resiliancy and Failover

ElasticSearch is designed with the assumption that things break. Hardware fails and software crashes. That’s a fact of life. ElasticSearch mainly deals with this through the use of clustering. That means that many machines can be combined together to work as a single unit.

Data can be replicated or copied across multiple servers so that the loss of one or more nodes can be tolerated by the cluster and still have the cluster respond to requests in a timely manner. This obviously depends on the exact configuration.

The primary mechanism for ElasticSearch resiliency is divided into 3 parts - nodes, shards, and replicas. Nodes are the individual servers and as nodes are added to the cluster, data is spread across them in terms of shards and replicas. A shard is a slice or section of the data and is specified when an index is created and cannot be changed. A replica is simply copy of a shard. The number of replicas can be changed on the fly.

If a cluster is formed from a single node, then all primary shards on are on that node. If there is a failure of the node or of a shard, then data will likely be lost.

When you add a node to the cluster so you now have 2 nodes, then primary shards will be spread across both nodes. If a node is lost, then some data will not be available until that node is restored. If that node is lost, then data will be lost.

The next step would be to add a replica. Again, a replica is a copy of the data. In a 2-node scenario, the primary shards will be spread across both nodes, but the a replica of the same shard will be allocated to the alternate node. The effect is that both nodes will have a full copy of the data, but one node will not hold all of the primary shards. If a node is lost in this case, the remaining node can handle all requests. When the failed node is restored or a replacement node is added to the cluster, ElasticSearch will replicate shards to the second node and achieve equilibrium again.

Adding more nodes to the cluster will spread the 2 data copies across all 3 nodes now. The cluster can handle the failure of a single node, but not 2 nodes.

If a second replica is added (making 3 copies of the data), then each node would effectively have a copy of the data. This configuration should be able to handle the loss of 2 of the 3 nodes, though you really don’t want that to happen.

ElasticSearch has some features that allow you to influence shard allocation. There are a couple of different algorithms that it can use to control shard placement. Recently, they have added the size of the shard into the allocation strategy such that one node does not end of having lots of large shards while another node has primarily small shards.

ElasticSearch also has a rack awareness feature. This allows you to tell ElasticSearch that some nodes are physically close to each other while other nodes are physically far apart. For example, you can have some nodes in one data center and other nodes of the same cluster in another data center. ElasticSearch will try to keep requests localized as much as possible. Having a cluster split across data centers is not really recommended for performance reasons, but is an option.

ElasticSearch has a federated search feature. This allows 2 clusters to be in physically separate data centers while essentially a third cluster will arbitrate requests across clusters. This is a very welcome feature for ElasticSearch.

ElasticSearch has also added a snapshot-restore feature as of the 1.0 release. This allows an entire cluster to be backed up and restored, or one or more indices can be specified for a particular snapshot. Snapshots can be taken on a fully running cluster and can be scripted and taken periodically.

Once the initial snapshot occurs, subsequent snapshots taken are incremental in nature. http://www.elasticsearch.org/blog/introducing-snapshot-restore/

One of the great things about snapshot/restore is that it can be used to keep a near-line cluster that mirrors the production cluster. If the production cluster goes down, the backup cluster can be brought back online at roughly the same point in time document wise.

This feature works by shutting the snapshots to the near-line cluster, then applying the snapshots to that cluster. Since the snapshots are incremental, the process should be pretty quick, even for clusters with a good amount of volume.

There are a variety of ways that ElasticSearch can be made resilient depending on the exact architecture. ElasticSearch continues to evolve in this regard as more and more companies rely on it for mission critical applications.

  - Craig


  1. Hi Craig

    Can you tell me how long it takes for a replica shard to be promoted to primary if the original primary fails (underlying hardware fails)? I assume there are some delays until a primary shard is declared as unavailable? What happens once this node comes back online? Will it now replicated in the other direction?

    Thanks for your help!

  2. There are some settings in ES that can control this promotion rate. Normally, ES will promote a replica to primary pretty quickly, and also shuffle other shards and create replicas when it determines a node is offline. This isn't always what you want as the node might be down temporarily. As long as there is at least one complete set of shards, primary or not, then ES will function properly.

    When that node comes back online, I'm not certain if it will promote the original shard back to primary if it has already selected another one, I would guess that it just a replica now. If the node comes back before ES starts shuffling shards around and making new primary promotions, then everything should be as it was.

    Does that help?

  3. Thanks for the fast reply! Yes that definitely helps! The reason I was asking is because we consider deploying Elasticsearch on a hyper-converged solution (Nutanix) where all stored data is already redundant. The goal is to disable replica shards all together and let a failed node restart automatically using the vSphere cluster. This will take around <3 minutes and think that's acceptable since a node failure is happens rarely. For upgrade we can plan maintenance windows.

    Have you seen Elasticsearch installations before using VMs and shared storage?