Tuesday, July 29, 2014

ElasticSearch Resiliancy and Failover

ElasticSearch is designed with the assumption that things break. Hardware fails and software crashes. That’s a fact of life. ElasticSearch mainly deals with this through the use of clustering. That means that many machines can be combined together to work as a single unit.

Data can be replicated or copied across multiple servers so that the loss of one or more nodes can be tolerated by the cluster and still have the cluster respond to requests in a timely manner. This obviously depends on the exact configuration.

The primary mechanism for ElasticSearch resiliency is divided into 3 parts - nodes, shards, and replicas. Nodes are the individual servers and as nodes are added to the cluster, data is spread across them in terms of shards and replicas. A shard is a slice or section of the data and is specified when an index is created and cannot be changed. A replica is simply copy of a shard. The number of replicas can be changed on the fly.

If a cluster is formed from a single node, then all primary shards on are on that node. If there is a failure of the node or of a shard, then data will likely be lost.

When you add a node to the cluster so you now have 2 nodes, then primary shards will be spread across both nodes. If a node is lost, then some data will not be available until that node is restored. If that node is lost, then data will be lost.

The next step would be to add a replica. Again, a replica is a copy of the data. In a 2-node scenario, the primary shards will be spread across both nodes, but the a replica of the same shard will be allocated to the alternate node. The effect is that both nodes will have a full copy of the data, but one node will not hold all of the primary shards. If a node is lost in this case, the remaining node can handle all requests. When the failed node is restored or a replacement node is added to the cluster, ElasticSearch will replicate shards to the second node and achieve equilibrium again.

Adding more nodes to the cluster will spread the 2 data copies across all 3 nodes now. The cluster can handle the failure of a single node, but not 2 nodes.

If a second replica is added (making 3 copies of the data), then each node would effectively have a copy of the data. This configuration should be able to handle the loss of 2 of the 3 nodes, though you really don’t want that to happen.

ElasticSearch has some features that allow you to influence shard allocation. There are a couple of different algorithms that it can use to control shard placement. Recently, they have added the size of the shard into the allocation strategy such that one node does not end of having lots of large shards while another node has primarily small shards.

ElasticSearch also has a rack awareness feature. This allows you to tell ElasticSearch that some nodes are physically close to each other while other nodes are physically far apart. For example, you can have some nodes in one data center and other nodes of the same cluster in another data center. ElasticSearch will try to keep requests localized as much as possible. Having a cluster split across data centers is not really recommended for performance reasons, but is an option.

ElasticSearch has a federated search feature. This allows 2 clusters to be in physically separate data centers while essentially a third cluster will arbitrate requests across clusters. This is a very welcome feature for ElasticSearch.

ElasticSearch has also added a snapshot-restore feature as of the 1.0 release. This allows an entire cluster to be backed up and restored, or one or more indices can be specified for a particular snapshot. Snapshots can be taken on a fully running cluster and can be scripted and taken periodically.

Once the initial snapshot occurs, subsequent snapshots taken are incremental in nature. http://www.elasticsearch.org/blog/introducing-snapshot-restore/

One of the great things about snapshot/restore is that it can be used to keep a near-line cluster that mirrors the production cluster. If the production cluster goes down, the backup cluster can be brought back online at roughly the same point in time document wise.

This feature works by shutting the snapshots to the near-line cluster, then applying the snapshots to that cluster. Since the snapshots are incremental, the process should be pretty quick, even for clusters with a good amount of volume.

There are a variety of ways that ElasticSearch can be made resilient depending on the exact architecture. ElasticSearch continues to evolve in this regard as more and more companies rely on it for mission critical applications.

  - Craig

Tuesday, July 22, 2014

What is Big Data?

I've been wanting to write this for a long time, but now have time to put some thoughts down based on discussions and comments from various groups of technical and business people.

The question is, What is Big Data?

Is Big Data just that, big ... data? Well, there have always been very large data sets, certainly constrained to the technology limitations of the time. Storage systems that once were measured in gigabytes, then terabytes, are now measured in petabytes. Certainly out ability to store vast quantities of data have increased dramatically.

With the advent of the Internet and the World Wide Web, our ability to collect information has also grown dramatically as well. Where companies and individuals were limited to collecting data within their particular sphere, now collecting data from locations and individuals around the globe is a daily occurance.

I've hear people sometimes assert that Big Data is simply doing the same things we were before, just in a different way, maybe with different tools. I can understand that point of view. I think that is generally the first step in the process of moving towards a Big Data mindset.

I would put forth that Big Data really is something different than just working with large data sets. To me, Big Data is really a different mind set. We live in a vastly more connected world than just a decade ago. The daily effect of Data in people's lives in general is incredibly pervasive. It's almost impossible to escape.

Big Data is about thinking differently about your data.

Big Data is about connecting data sources. What other data is available that can enhance what you have? Can you add user behavior from your web site? How about advertising or industry data? How about public sources from government or private sources like Twitter, Facebook, search engines like Google, Bing, Yahoo and others?

Big Data is more about openness and sharing. Not everything can be shared and you really, really need to be careful of user identifiable information, but what can you share? What can you publish or what data sets can you contribute to that enrich the overall community?

Big Data many times involves working with unstructured data. SQL database by their nature, enforce very strict structures on your data. If your data doesn't conform, then you're kind of out of luck. There are some things that SQL databases do extremely well, but there are real limits.

There are many new tools and techniques for working with unstructured data and extracting value from them. So called NOSQL data stores are designed to work with data that has little or no structure, providing new capabilities. Open source search engines like ElasticSearch and SOLR provide incredible search and faceting/aggregation abilities.

We have many machine learning algorithms and tools that let us dissect and layer structure on top of our data to help us make sense of it. Algorithms help us to cluster, classify, and figure out which documents/data are similar to other ones.

We can process volumes of data in ways that we couldn't before. Traditional compute requires the data to be moved to the application/algorithm, then the answer was written back to storage. Now we have platforms like Hadoop that effectively distribute large data sets and move the algorithm to the data allowing it to be processed and answers written in place, or to be distributed elsewhere.

Does Big Data require big/large data to be useful? You can run most of these tools on your laptop, so no, not really. You can even run many tools like Map/Reduce on a $35 Raspberry Pi. With Big Data tools, we can do things on our laptops that required data centers in the past.

Big Data requires experimentation. It requires people to ask new questions and develop new answers. What can we do now that we couldn't do before? It requires a different mindset.

  - Craig

Elasticsearch 1.2.2 released

Spreading the word about the latest ElasticSearch release. The latest release is based on Lucene 4.8.1. and includes some nice bug fixes.

There are fixes for possible translog corruption and some file locking issues with previous Lucene releases. It also includes a fix for caching issues with percolation of nested documents.

Great work by the the ElasticSeach team. Be sure and git the new release! http://www.elasticsearch.org/downloads/1-2-2/

  - Craig


Beta launch of ExecuteSalces.com

https://beta.executesales.com/ is not open and available for Beta. ExecuteSales is a platform connecting companies and sales service providers, enabling both to generate new business and achieve higher sales revenue.

I've been working on the search core of ExecuteSales for the last 18 months along with some other fantastic team members and owners and am very excited to get to beta! Beat is totally free so check it out!

  - Craig