Friday, July 20, 2012

Time To Build Your Big-Data Muscles

Time To Build Your Big-Data Muscles from talks mainly about getting educated for a Big Data job. They provide an interesting statistic that for every 100 open Big Data jobs, there are only 2 qualified candidates. Definitely a good place to be if you're looking for work in this area.

Having a background in mathematics and engineering are your best bests. Load up on statistics if you're able to. I'm more on the engineering side myself. My work involves taking the algorithms and principles that have been developed and applying them to the business data sets. There is so much work to do just in the application of algorithms alone.

  - Craig

Wednesday, July 18, 2012

Cameron Befus speaks at UHUG (Hadoop)

This evening, we heard from Cameron Befus, former CTO of Tynt which was purchased by 33 Across Inc. Here are some notes from the presentation on Hadoop.

Why use Hadoop?

A framework for parallel processing of large data sets.

Design considerations
 - System will manage and heal itself
 -  Performance to scale linearly
 -  Compute move algorithm to data

Transition cluster size - local -> cloud -> dedicated hardware.

Hadoop process optiized for data retrieval
 - schema on read
 - nosql databases
 - map reduce
 - asynchronus
 - parallel computing

Built for unreliable, commodity hardware.
Scale by adding boxes
More cost effective

Sends the program to the data.
 - Larger choke points occur in transferring large volumes of data around the cluster

#1 Data Driven Decision MakingEveryone wants to make good decisions and as everyone know, to make good decisions you need data, not just any data, the right data.

Walmart mined sales that occurred previous to a coming hurricane and found the highest selling products are:
  #1 batteries,
  #2 pop tarts.

This means $$$ to Walmart.

Google can predict flu trends around the world just from search queries. Query velocity matches medical data.

With the advent of large disks, it becomes cost effective to simply store everything, then use a system like hadoop to run through and process the data to find value.

Combining data sets can extract value when they may not be valuable on their own. Turning lead into gold as it were.

How big is big data?
  not just size,
  rate of growth

Other uses,
  load testing
  number crunching
  building lucene indexes
  just about anything that can be easily parrallelized

  - Craig

Saturday, July 14, 2012

Wither MongoDB?

So, I've been thinking of writing this for a while. MongoDB is very popular, maybe the most popular of the document store DBs. It certainly has a lot of mind share in the NOSQL space. I've played around with it and I've taken a lot of time trying to figure out what it's scalability and durability model are.

As with any technology, it's very important to understand the strengths and weaknesses of any product. With some as new as MongoDB or other NOSQL datastores, it's kind of tough to really understand a product without using it and getting used to it. MongoDB has been around long enough now that there is quite a bit of information and customer experience with it.

As far as I can see, there are 3 big reasons that MongoDB has become popular. The first is a pretty easy to use API. The second is that it is pretty easy to get a single node up and running. The third is that they have good integration with 3rd party applications and frameworks. If you combine those 3 things, it makes it very easy for a developer to start an instance and start developing against it with a pretty simple API and framework integration. There are also some similarities in it's query language to SQL which gives the user some familiarity.

On the surface, MongoDB is pretty neat. It's easy to get going, operates quickly, has an easy API, and touts scalability and durability. As with and technology, there are some definite drawbacks as well.  MongoDBs real niche is as a read-mostly, eventually consistent system. You could think of it more as a caching system with on-disk storage. I know it's used for more than that, but to me, that's a space that if fits well in.

MongoDB has a type of durability, but that's not really a focus of the system. It's really designed as a n1-r1-w1 system and this makes it a very fast system. It's just not a highly durable system. Contrast this with a system like Riak that concentrates far less on speed, but takes great pains to be a highly scalable and highly durable system. Riak buckets are all CAP tunable so if your data needs a high level of durability and constancy, you can specify that and Riak handles it fine. You can specify n3-r3-w3, or even n5-r5-w5 if you want it. It looks like the best you can really do with MongoDB is n3-r3-w2.

MongoDB really wants to be n1-r1-w1 for all operations. n1-r1-w1 is fast, but not durable at all. If n>w, your always in an eventually consistent state. That means that if you read after a write, you are not guaranteed to read the last copy of your data. This is more evident with threading. You can help the problem by specifying w-2, but It looks like you can't get w-3. If you need to guarantee that reads will be consistent, then you can specify that all reads and writes are executed on the replica set master. That however limits your scalability as your other nodes in the replica set essentially exist only as failover. To get around that problem, you have to add more shards to your cluster, which means even more machines that are sitting around only as cold failover. It seems like a very inefficient system to me.

MongoDB's scale model is to start with a single node. If you need more durability, then you can add nodes to form a replica set. Replica sets are essentially an advanced form of master-slave replication. One of the big improvements is that "slave" nodes in the replica set can assume master if the master goes down. When you switch from a single node to a replica set, your application needs to be changed to configure the extra slave nodes. That creates some pain for the develop that some other systems do not.

Once you need more scalability, you have to go to sharding. This is where things get very complicated. You now have to add 2 more types of nodes. You need to add a config server. This guy's job is to manage where the data blocks are on each replica set. This job is VERY important. If you have a problem with or lose a config server, then you are in a lot of trouble. It looks like if your config server get's wiped out, then your cluster is not recoverable, similar to Hadoop losing it's Name Node. The web site doesn't say much about this problem, only that it is critical not to lose the config server(s).

For dev, you can run a single config server but for production, you have to run 3 of them for durability. If you are running 3 config servers in production and only 1 of them becomes unavailable, then your cluster is running in a degraded state. It will apparently still serve reads and some writes, but block functions will not work. Presumably if your writes fill a block and that block needs to be split or moved, it you would have a problem. This is something that rubs me very much the wrong way. Even in the most durable MongoDB cluster, losing a single config node puts your cluster in a degraded state.

The other node type you have to add is mongos (mongo s), which is a load balancer. Your client will communicate with a mongos and the mongos will proxy the request to the proper shard. The mongos appears to be a cached copy of the config server. If the mongos that your client is talking with goes down, then your client can't communicate with the cluster. The mongos would need to be restarted. I don't know if a client can send requests to multiple mongos instances. It would seem that this would be helpful.

So once you start sharding, you have to deal with 3 types of nodes, plus master and slave nodes in a replica set that act very differently as well. This is very complicated to me. I'm much more a fan of systems like Riak and ElasticSearch where all nodes are equal and node use is very efficient. You can scale MongoDB, but it is very hard and not efficient unless you doing something like n1-r1-w1. Of course, if you do this, then your cluster is not at all durable.

This is one of the things that causes me issues with MongoDB. It just seems like there are a lot of compromises in the architecture. There are ways that you can get around the problems, but that usually involves very large tradeoffs. For example, the "safe mode" setting being off by default is a problem and has caused unexpected data loss for customers. You can turn it on, but many people don't find this out till after they has lost data. People really expect that their data stores will be durable and not lose their data, by default.

Another example is the fsync problem. Normally MongoDB will let the OS fsync when it wants to, but this creates an open window for data loss if the node fails before fsync. You can force an fsync every minute if fscync has not happened, but you've only shrunk the data loss window. To get around that, then you can enable logging to make the node more durable. So now you can avoid data loss, but now you're giving up much of the performance that 10Gen touts about MongoDB.

Another tough example is the very well known global write lock. 10Gen is working on the problem, but the current solution, from what I understand, would be to reduce the scope of the global write lock from the entire cluster to the each node in the cluster. This is definitely better, but still doesn't seem that great. You're still blocking the entire node for writes. If you got many writes across a number of database, you're still pretty hosed. I assume that there is some underlying technical issue why the write lock exists and that removing it is a very tough problem.

MongoDb is an interesting system and is being used by a lot of companies. For me, there are too many tradeoffs, unless you're using it as a prototype where you don't need to scale, you have a lot of in-house expertise with MongoDB, or you're using a service like MongoHQ that takes care of the operations issues for you.

There are a number of NoSql datastores in this space and there is going to have to be fallout and consolidation as products mature. This is easy to understand. While MongoDB is popular now it seems that there is some push-back on MongoDB that is building due to some of it's short comings. I personally wonder if MongoDB will be very popular in another 2-3 years. I think as better systems mature, that MongoDB may be in danger of being surpassed unless some of it's core problems can be fixed.

Go ahead and give MongoDB a try and see if it works for you. Just be sure to understand what you may be getting into.

  - Craig

Friday, July 13, 2012

A Brief Introduction to Riak, from Clusters to Links to MapReduce

I'm a big fan of Riak, especially it's architecture. I can across this article on dzone today, A Brief Introduction to Riak, from Clusters to Links to MapReduce by Scott Leberknight.

Scott gives a great introduction to Riak and runs down all of it's basic uses including linking and map/reduce. Nice job Scott!

  - Craig

Saturday, July 7, 2012

Goodbye MongoDB

I ran across these blog posts a while ago and wanted to post them. It's interesting to see how some companies are fairing with MongoDB that have been working with it for some time. No datastore is perfect, but it seems that MongoDB has more than it's share of technical and architectural problems. I know they got $42m in funding recently, but it seems like their going to have to completely re-architect their product to overcome many of their core problems.

MongoDB's claim to fame is a very easy to use API, and it is pretty to start a single node and start developing. That's very commendable. Once you go beyond a single node though, MongoDB gets difficult very quickly. Most other NoSQL Datastores are much, much easier to scale, but suffer on the API side and ease of use. To me, it is going to be much easier to build a better API and put that in from of many of the other Datastores than it's going to be for 10Gen to fix many of the core problems with MongoDB.

Read for yourself and make up your own mind.

Kiip - A Year with MongoDB

ZOPYX - Goodbye MongoDB

  - Craig

Friday, July 6, 2012

ElasticSearch videos - Berlin Buzzwords 2012

I wanted to pass on these videos on ElasticSearch from Berlin Buzzwords 2012. Enjoy!

The videos/slides from the Berlin Buzzwords 2012 conference are now live

Big Data, Search and Analytics, by kimchy

Three Nodes and One Cluster by Lukas and Karel

Querying 24 Billion Records in 900ms by Jodok

Scaling massive elasticsearch clusters by Rafal

  - Craig