NoSql Tips and Tricks: 2012

Thursday, August 9, 2012

Querying 24 Billion Records in 900ms (ElasticSearch)

This is a video I cam across that's from Berlin Buzzwords Querying 24 Billion Records in 900ms. The presentation is only 20 minutes long, but includes slides which is very helpful.

The speaker gives a lot of good information about trying to scale out ES on AWS to handle 24B records. They were able to do it successfully, which is excellent. They ended up moving to dedicated hardware to reduce the AWS cost.

- Craig

foursquare now uses ElasticSearch

foursquare now uses ElasticSearch! This is pretty exciting news! It's great to see people switching over to ElasticSeach and seeing such great success. The article doesn't have a ton of details, but definitely worth looking at.

- Craig

Wednesday, August 8, 2012

NoSQLUnit 0.3.2 Released

This is something quite interesting that I ran across today - NoSQLUnit 0.3.2. I didn't realize that there was a project out there like this, very cool!

Some NoSQL projects are easier to run embedded than others, and this could certainly help that. You always need help classes to facilitate running unit tests against a data source. I've writing unit tests against ElasticSearch, which has a local/memory mode that is great for writing tests against. Still, you have to generate your test data and load it into the search engine for testing.

Great project!

- Craig

Friday, July 20, 2012

Time To Build Your Big-Data Muscles

Time To Build Your Big-Data Muscles from fastcompany.com talks mainly about getting educated for a Big Data job. They provide an interesting statistic that for every 100 open Big Data jobs, there are only 2 qualified candidates. Definitely a good place to be if you're looking for work in this area.

Having a background in mathematics and engineering are your best bests. Load up on statistics if you're able to. I'm more on the engineering side myself. My work involves taking the algorithms and principles that have been developed and applying them to the business data sets. There is so much work to do just in the application of algorithms alone.

- Craig

Wednesday, July 18, 2012

Cameron Befus speaks at UHUG (Hadoop)

This evening, we heard from Cameron Befus, former CTO of Tynt which was purchased by 33 Across Inc. Here are some notes from the presentation on Hadoop.

Why use Hadoop?

A framework for parallel processing of large data sets.

Design considerations
- System will manage and heal itself
- Performance to scale linearly
- Compute move algorithm to data

Transition cluster size - local -> cloud -> dedicated hardware.

Hadoop process optiized for data retrieval
- schema on read
- nosql databases
- map reduce
- asynchronus
- parallel computing

Built for unreliable, commodity hardware.
Scale by adding boxes
More cost effective

Sends the program to the data.
- Larger choke points occur in transferring large volumes of data around the cluster

#1 Data Driven Decision MakingEveryone wants to make good decisions and as everyone know, to make good decisions you need data, not just any data, the right data.

Walmart mined sales that occurred previous to a coming hurricane and found the highest selling products are:
#1 batteries,
#2 pop tarts.

This means $$$ to Walmart.

Google can predict flu trends around the world just from search queries. Query velocity matches medical data.

With the advent of large disks, it becomes cost effective to simply store everything, then use a system like hadoop to run through and process the data to find value.

Combining data sets can extract value when they may not be valuable on their own. Turning lead into gold as it were.

How big is big data?
not just size,
complexity
rate of growth
performance
retention

Other uses,
load testing
number crunching
building lucene indexes
just about anything that can be easily parrallelized

- Craig

Saturday, July 14, 2012

Wither MongoDB?

So, I've been thinking of writing this for a while. MongoDB is very popular, maybe the most popular of the document store DBs. It certainly has a lot of mind share in the NOSQL space. I've played around with it and I've taken a lot of time trying to figure out what it's scalability and durability model are.

As with any technology, it's very important to understand the strengths and weaknesses of any product. With some as new as MongoDB or other NOSQL datastores, it's kind of tough to really understand a product without using it and getting used to it. MongoDB has been around long enough now that there is quite a bit of information and customer experience with it.

As far as I can see, there are 3 big reasons that MongoDB has become popular. The first is a pretty easy to use API. The second is that it is pretty easy to get a single node up and running. The third is that they have good integration with 3rd party applications and frameworks. If you combine those 3 things, it makes it very easy for a developer to start an instance and start developing against it with a pretty simple API and framework integration. There are also some similarities in it's query language to SQL which gives the user some familiarity.

On the surface, MongoDB is pretty neat. It's easy to get going, operates quickly, has an easy API, and touts scalability and durability. As with and technology, there are some definite drawbacks as well. MongoDBs real niche is as a read-mostly, eventually consistent system. You could think of it more as a caching system with on-disk storage. I know it's used for more than that, but to me, that's a space that if fits well in.

MongoDB has a type of durability, but that's not really a focus of the system. It's really designed as a n1-r1-w1 system and this makes it a very fast system. It's just not a highly durable system. Contrast this with a system like Riak that concentrates far less on speed, but takes great pains to be a highly scalable and highly durable system. Riak buckets are all CAP tunable so if your data needs a high level of durability and constancy, you can specify that and Riak handles it fine. You can specify n3-r3-w3, or even n5-r5-w5 if you want it. It looks like the best you can really do with MongoDB is n3-r3-w2.

MongoDB really wants to be n1-r1-w1 for all operations. n1-r1-w1 is fast, but not durable at all. If n>w, your always in an eventually consistent state. That means that if you read after a write, you are not guaranteed to read the last copy of your data. This is more evident with threading. You can help the problem by specifying w-2, but It looks like you can't get w-3. If you need to guarantee that reads will be consistent, then you can specify that all reads and writes are executed on the replica set master. That however limits your scalability as your other nodes in the replica set essentially exist only as failover. To get around that problem, you have to add more shards to your cluster, which means even more machines that are sitting around only as cold failover. It seems like a very inefficient system to me.

MongoDB's scale model is to start with a single node. If you need more durability, then you can add nodes to form a replica set. Replica sets are essentially an advanced form of master-slave replication. One of the big improvements is that "slave" nodes in the replica set can assume master if the master goes down. When you switch from a single node to a replica set, your application needs to be changed to configure the extra slave nodes. That creates some pain for the develop that some other systems do not.

Once you need more scalability, you have to go to sharding. This is where things get very complicated. You now have to add 2 more types of nodes. You need to add a config server. This guy's job is to manage where the data blocks are on each replica set. This job is VERY important. If you have a problem with or lose a config server, then you are in a lot of trouble. It looks like if your config server get's wiped out, then your cluster is not recoverable, similar to Hadoop losing it's Name Node. The web site doesn't say much about this problem, only that it is critical not to lose the config server(s).

For dev, you can run a single config server but for production, you have to run 3 of them for durability. If you are running 3 config servers in production and only 1 of them becomes unavailable, then your cluster is running in a degraded state. It will apparently still serve reads and some writes, but block functions will not work. Presumably if your writes fill a block and that block needs to be split or moved, it you would have a problem. This is something that rubs me very much the wrong way. Even in the most durable MongoDB cluster, losing a single config node puts your cluster in a degraded state.

The other node type you have to add is mongos (mongo s), which is a load balancer. Your client will communicate with a mongos and the mongos will proxy the request to the proper shard. The mongos appears to be a cached copy of the config server. If the mongos that your client is talking with goes down, then your client can't communicate with the cluster. The mongos would need to be restarted. I don't know if a client can send requests to multiple mongos instances. It would seem that this would be helpful.

So once you start sharding, you have to deal with 3 types of nodes, plus master and slave nodes in a replica set that act very differently as well. This is very complicated to me. I'm much more a fan of systems like Riak and ElasticSearch where all nodes are equal and node use is very efficient. You can scale MongoDB, but it is very hard and not efficient unless you doing something like n1-r1-w1. Of course, if you do this, then your cluster is not at all durable.

This is one of the things that causes me issues with MongoDB. It just seems like there are a lot of compromises in the architecture. There are ways that you can get around the problems, but that usually involves very large tradeoffs. For example, the "safe mode" setting being off by default is a problem and has caused unexpected data loss for customers. You can turn it on, but many people don't find this out till after they has lost data. People really expect that their data stores will be durable and not lose their data, by default.

Another example is the fsync problem. Normally MongoDB will let the OS fsync when it wants to, but this creates an open window for data loss if the node fails before fsync. You can force an fsync every minute if fscync has not happened, but you've only shrunk the data loss window. To get around that, then you can enable logging to make the node more durable. So now you can avoid data loss, but now you're giving up much of the performance that 10Gen touts about MongoDB.

Another tough example is the very well known global write lock. 10Gen is working on the problem, but the current solution, from what I understand, would be to reduce the scope of the global write lock from the entire cluster to the each node in the cluster. This is definitely better, but still doesn't seem that great. You're still blocking the entire node for writes. If you got many writes across a number of database, you're still pretty hosed. I assume that there is some underlying technical issue why the write lock exists and that removing it is a very tough problem.

MongoDb is an interesting system and is being used by a lot of companies. For me, there are too many tradeoffs, unless you're using it as a prototype where you don't need to scale, you have a lot of in-house expertise with MongoDB, or you're using a service like MongoHQ that takes care of the operations issues for you.

There are a number of NoSql datastores in this space and there is going to have to be fallout and consolidation as products mature. This is easy to understand. While MongoDB is popular now it seems that there is some push-back on MongoDB that is building due to some of it's short comings. I personally wonder if MongoDB will be very popular in another 2-3 years. I think as better systems mature, that MongoDB may be in danger of being surpassed unless some of it's core problems can be fixed.

Go ahead and give MongoDB a try and see if it works for you. Just be sure to understand what you may be getting into.

- Craig

Friday, July 13, 2012

A Brief Introduction to Riak, from Clusters to Links to MapReduce

I'm a big fan of Riak, especially it's architecture. I can across this article on dzone today, A Brief Introduction to Riak, from Clusters to Links to MapReduce by Scott Leberknight.

Scott gives a great introduction to Riak and runs down all of it's basic uses including linking and map/reduce. Nice job Scott!

- Craig

Saturday, July 7, 2012

Goodbye MongoDB

I ran across these blog posts a while ago and wanted to post them. It's interesting to see how some companies are fairing with MongoDB that have been working with it for some time. No datastore is perfect, but it seems that MongoDB has more than it's share of technical and architectural problems. I know they got $42m in funding recently, but it seems like their going to have to completely re-architect their product to overcome many of their core problems.

MongoDB's claim to fame is a very easy to use API, and it is pretty to start a single node and start developing. That's very commendable. Once you go beyond a single node though, MongoDB gets difficult very quickly. Most other NoSQL Datastores are much, much easier to scale, but suffer on the API side and ease of use. To me, it is going to be much easier to build a better API and put that in from of many of the other Datastores than it's going to be for 10Gen to fix many of the core problems with MongoDB.

Read for yourself and make up your own mind.

Kiip - A Year with MongoDB

ZOPYX - Goodbye MongoDB

- Craig

Friday, July 6, 2012

ElasticSearch videos - Berlin Buzzwords 2012

I wanted to pass on these videos on ElasticSearch from Berlin Buzzwords 2012. Enjoy!

The videos/slides from the Berlin Buzzwords 2012 conference are now live
on: http://www.elasticsearch.org/videos/

Big Data, Search and Analytics, by kimchy
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

Three Nodes and One Cluster by Lukas and Karel
http://www.elasticsearch.org/videos/2012/06/05/three-nodes-and-one-cluster.html

Querying 24 Billion Records in 900ms by Jodok
http://www.elasticsearch.org/videos/2012/06/05/querying-24-billion-records-in-900ms.html

Scaling massive elasticsearch clusters by Rafal
http://www.elasticsearch.org/videos/2012/06/05/scaling-massive-elasticsearch-clusters.html

- Craig

Thursday, June 21, 2012

MapR at UJUG - Real-time Hadoop (Ted Dunning)

Real-time and Long-time with Storm and MapR - http://info.mapr.com/ted-utahjug. Slides may take some time to come online.

Hadoop is great for processing vats of data, but sucks for real-time (by design).
Real-time and Long-time working together. Hadoop for the Long-time part, Storm for the Real-time part. Presentation can blend the two parts into a cohesive whole.

Simply dumping into NoSQL engine doesn't quite work.
Insert rate is limited

I have 15 versions of my landing page
Each visitor is assigned to a version
- Which version?
A conversion or sale or whatever can happen
- how long to wait?

Probability as expressed by humans is subjective and depends on information and experience.

Bayesian Bandit
- Compute the distribution
- Sample p1 and p2 from these distributions
- Put a coin in bandit 1 if p1 > p2
- Else, put the coin in bandit 2

Very interesting, which I could capture more of the slides and demos. Basically, it's about trying to maximize value of what you're looking for with little knowledge. Regret is the different between a perfect outcome and the actual outcome.

Ted showed a very cool video that shows the algorithm learning in real time. Video link may be in the slides.

We can encode a distribution by sampling.

Original Bayesian Bandit only requires real-time
Generalized Bandit may require access to lng history for learning
Bandit variables can include content

- Craig

MapR at UJUG - Hadoop, why should I care? (Ted Dunning)

We've got Ted Dunning from MapR speaking at UJUG (Utah Java User Group) tonight. First up, Hadoop, why should I care.

Hadoop is a project and community. Programmed using Map-Reduce. Map functions do data processing. Reduce functions do aggregations. Parallelism is done automatically. Input is broken into chunks and passed to a corresponding map function. Output of map is input to reduce function. The reduce function will combine, shuffle and short, and finally reduce the output.

Problems that work well on Map-Reduce:
1) Counting problems
2) Sessionize used logs
3) Very large scale joins
4) Large scale matix recommendations
5) Computing the quadrillionth digit of pi. (apparently done using FFTs)

Map-Reduce in inherently batch oriented.

Many programs require iterative execution of MapReduce:
1) k-means
2) Alternating least squares
3) Page rank

Iteration of Map-Reduce in Hadoop is hopelessly bad
1) 30s delay between iteratiob
2) Loss of all memory state, side connections
3) Requires serialization of intermediate data to disk at every iteration.

Need to either change the algorithm or change the framework.

Google also put out a paper for large scale graph processing called Pregal.

Apache Giraph is an 80% clone of Pregal
- Nodes have stat base on reading a chunk of input
- Nodes can send messages to other nodes
- Next super-step. each node receive all messages, sends new ones
- When all nodes agree to stop, we are done.

Programming model is similar to Map-Reduce, but iterative. It is also still batch oriented.

Neither Map-Reduce or Giraph or good for generating progressing answers.

For real-time requirements, Storm+ online algorithms works.
- Must have O(1) total memory and O91) time per input records
- With suitable persistence substrate, we can cleanly combine Map-Reduce + Storm.

What is Storm?
A storm program is called a topology
- Spouts inject data into a topology
- Bolts process data
- The units of data are called tuples
- All processing is flow-through
- Bolts can buffer or persist
- Output tuples can by anchored

Output general goes to places like Lucene/SOLR, HBase, HDFS, or MapR transactional file system

Very cool presentation! Ted Dunning has incredible experience is BigData processing algorithms and has been with MapR since 2010.

Here are some more presentations from Ted Dunning.

- Craig

Monday, June 18, 2012

Introducing Distributed Execution and MapReduce Framework

Read through Introducing Distributed Execution and MapReduce Framework today. I had read about some of the plans for Infinispan a while ago and was quite interested to see how it was coming along.

If you're not familiar with Infinispan, it's a Java-based, in-memory data grid sponsored by JBoss/Redhat. It's really a very cool project. What I think is most impressive about Infinispan is the ability to run Map/Reduce jobs across the data grid using the Callable interface. It looks like they have an extended DistributedCallable and DistributedExecutorServiceto help with running tasks on the data grid.

They are modeling their Map/Reduce framework after Hadoop while providing what could a more familiar Callable/ExecutorService model. It may lower the barrier to entry for a subset of developers more familiar with standard java threading models. I guess the only real drawback is that since this is an in-memory data grid, you're limited on how much data that you can traverse. However, it should be very fast since you're memory-centric instead of file-centric as with Hadoop.

- Craig

Wednesday, June 13, 2012

ElasticSearch #5 of 9 Open Source Big Data Technologies to Watch

ElasticSearch shows up as #5 on the CIO.com 9 Open Source Big Data Technologies to Watch! Hooray for ElasticSearch! It really is a great project if you're looking at search. Check it out!

- Craig

Friday, June 8, 2012

Big Search with Big Data Principles - Eric Pugh

This is a presentation given by Eric Pugh at Lucene Revolution 2012. There is a lot of good information in Big Search with Big Data Principles, thought the audio is a bit hokey at the beginning.

Eric goes through some of his experiences with working with a large vendor, some of the obsticles encountered and solutions implemented.

Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cell during document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr. Instead we'll talk about how to apply principles of big data like "Bring the code to the data, not the data to the code" to Solr. How to answer the question "How many servers will I need?" when your volume of data is exploding. Some examples of models for predicting server and data growth, and how to look back and see how good your models are! You'll leave this session armed with an understanding of why Big Data is the buzzword of the year, and how you can apply some of the principles to your own search environment.

- Craig

Thursday, June 7, 2012

3 Secrets to Lightning Fast Mobile Design at Instagram

This article comes from highscalability.com, 3 Secrets to Lightning Fast Mobile Design at Instagram. The article is based on a slide deck from Mike Krieger, co-founder of Instagram.

Very interestingly, they talk about focusing on the user experience and worrying about the coding second. The focus is to make the user feel productive by performing actions optimistically, then backing out the actions if they are canceled, like file sharing. They automatically start uploading the photo. If the user cancels or otherwise does not want to share the photo, then it is deleted on the back end. If the user clicks the like button or posts a comment, then the action is immediately noted in the UI while the action is completed in the background.

This really is in stark contrast to traditional apps where the app only goes to work after the user takes some action. The user is forced to wait while the app works, and this is bad for the user experience. This does put more load on the back end, but really improves the user experience. Speed is king!

- Craig

Wednesday, June 6, 2012

Slides: Scaling Massive ElasticSearch Clusters

This set of slides, Slides: Scaling Massive ElasticSearch Clusters, comes from Rafal Kuc at Sematext. Rafal gave this talk just a day or 2 ago at Berlin Buzzwords.

ElasticSearch does not have as large of a community as SOLR, but is really gaining a lot of mind share and is receiving a lot of converts. ElasticSearch is really a great project. Check it out when you get a chance!

- Craig

Tuesday, June 5, 2012

Search is the Dominant Metaphor for Working with Big Data

Found another good article on dzone.com from Eric Pugh of the Apache Software Foundation. He wrote down some quick thoughts around his attendance of the recent LuceneRevolution 2012. He has a few more articles on dzone.com here.

If you'r enot paying attending to search, you really need to be. Search absolutely is a core component of Big Data processing. It's great that you can process data with Hadoop, but what do you do with it? Search/Lucene/SOLR/ElasticSearch can be used on both the input and output side. You can pre-process your data with Lucene analyzers or use search to define the data sets you need to run through Hadoop. You can also have search as your endpoint, especially in a streaming process to make your processed data available to your users.

Search is king, and no where is that more evident than with Big Data!

- Craig

Using Java to access MongoDB, Redis, CouchDB, Riak, Cassandra

I ran across this article ,Using Java to access MongoDB, Redis, CouchDB, Riak, Cassandra, on java.dzone.com and thought it was interesting. It looks like this guy is playing around with building his own persistence layer in an append-only logger style, kinda cool.

One of the nice things about the article is that he shows a very basic persistence example for 5 different NoSQL data stores. I haven't seen anyone show side-by-side examples like this before. Cassandra seems to be the most involved in setting up a connection and performing a quick append, though none of the examples look very difficult.

Good luck Carlo Scarioni!

- Craig

Saturday, June 2, 2012

Build your own twitter like real time analytics - a step by step guide

This is a quick little trip over to highscalability.com for an article entitled Build your own twitter like real time analytics - a step by step guide.

You can certainly build a batch-based system using hadoop or something similar. The real trick is to build a real-time analytics system. The article points to the following 3 features:

Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or public cloud

I have been working on some batch processing techniques for processing data. I need to employ some different processing to move it to a streaming process.

- Craig

Friday, May 18, 2012

Martin Fowler on ORM Hate

Just picked up this article on java.dzone.com Martin Fowler on ORM Hate. Fowler always has very interesting things to say due to his expertise in the industry.

There has been a lot of push-back against ORMs, especially in the NoSQL space and Fowler takes some time to talk about this. ORMs are a tool like any other and a lot depends on how we use them. One of the biggest complaints is that they are a big black box, which can be true, but they also are a HUGE help with relational mapping. Even manual SQL queries force you to do some sort of mapping operation since you are going from relational to object.

One trend I see is incorporating NoSQL support into various ORMs. Document stores seem to be popular for this, possibly because of the simplicity of the data model. The capabilities of each store however, very greatly, so this is difficult. Some stores like MongoDB and CouchDB provide support for more advanced queries, even SQL-like queries while other stores like Riak provide relatively limited support, essentially key-only access. Basho is working on adding more advanced support now with secondary keys and Riak Search.

- Craig

Thursday, May 10, 2012

Cell Architectures

Back to one of my favorite sites again. highscalability.com brings us a great article on Cell Architectures.

Cells are essentially a cluster of services that form some scalable part of the architecture. Cells are self-contained and can be swapped out as needed. Each cell will generally serve a specific portion of traffic. If a cell fails, then only a subset of users are affected.

My original thought of the size of a cell is much smaller than the examples give. For example, a cell may have a cluster of 50 or more servers. Obviously, these are larger companies that have large cells like this. Still, it's interesting to think about how cells could be put together on a smaller basis, depending on need.

From the article:

Cell Architectures have several advantages:

Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.

Cell are added in an incremental fashion as more capacity is required.

Cells isolate failures. One cell failure does not impact other cells.

Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.

Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.

Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

Enjoy!

- Craig

Thursday, April 19, 2012

Firebase - Realtime Data Synchronization for Clients

I ran across a link for this product today and decided to check it out. Go to www.firebase.com and check out the examples and tutorial. The product is currently pre-release. You can sign up for an account if you want to play with their API.

The product API is javascript based and supports a REST interface for use in other applications and also via node.js. It looks like it is essentially a document store that is more property oriented similar to Amazon's DynamoDB. Individual properties can be updated without having to touch or index the entire document as with traditional document store. Every property has it's URI that can be accessed via the API or REST.

You can attach JS callbacks to the data so that your app can be notified when that data has been changed by another client. That's really it's claim to fame. One drawback that I see right now is that there is no client authentication and all of the data is wide open. The examples are in JS so the code and paths are all open. Anyone taking a look at your JS can play with all of your data.

Firebase definately looks like an interesting product and and certainly fills an interesting niche. Take a look and let me know what you think!

Friday, April 13, 2012

An Introduction to NoSQL Patterns

I read this article today over at dzone, An Introduction to NoSQL Patterns. Ricky Ho has really put a lot of information into his article, and I think it deserves to be called more that and introduction.

- Craig

Tuesday, March 27, 2012

JustOneDB - NEWSQL looks like Postgresql

I was browsing around some of the add-ons available on Heroku and discovered a NEWSQL style database called JustOneDB. The interesting thing is that to your application, it looks and acts just like Postgresql, but has more of a NOSQL style back end. Of course, that's kind of the whole purpose of NEWSQL anyway, but pretty interesting none the less.

There is a lot of information available on the project yet, but they tie in with Heroku and AppHarbor. I did find an entry on Bloor Research that has a section talking about JustOneDB.

Here's a small section out of their technical data sheet. It looks like they are using a very different storage architecture that may be very compelling. One of the very interesting things their back end apparently provides is fully indexed query performance with actually requiring indexes. Cool. It look like this is a commercial product. Too bad, it would be fun to take a look at how they are doing all of their cool things :)

JustOneDB uses a unique storage model which is neither row nor column based; but instead stores data by relationship. Data is duplicated and compressed along each relationship such that every query encounters data arranged in optimal spatial locality for a given query. This allows JustOneDB to fully exploit the virtuous characteristics of modern hardware such as CPU cache pre-fetch, bulk transfer rates and multiple core architectures to achieve stellar performance on a modest hardware footprint.

- Craig

Tuesday, March 13, 2012

How Eventual is Eventual Consistency? Probablistic Bounded Staleness (PBS)

So I came across this article on eventual consistency on dzone and thought I'd pass it along. I comes from a Riak meetup from last month and talks about calculating what eventual consistency means on dynamo style systems. It's really quite interesting and comes with a link for a calculator. The idea is that if you switch from N=3, R=W=1 to N=2, R=2, N=1 system, how does that effect consistency and at what point in time could you reasonably expect your data to be perfectly consistent.

http://java.dzone.com/articles/nosql-weekly-how-eventual

and a link to the calculator

http://bailis.org/projects/pbs/#demo

Enjoy!

- Craig

Wednesday, February 15, 2012

MySQL Cluster 7.2 Released with 70x Increased Performance and NoSQL Features

I ran across a tweet about this on infoq.com and thought it was pretty interesting. MySQL Cluster 7.2 Released with 70x Increased Performance and NoSQL Features.

I know that there are several companies that are using MySQL in a nosql style, usually simplifying and de-normalizing the schema, doing mostly key access, things like that. It's pretty neat to see Oracle trying to take things to a new level with MySQL. I'll need to look into this some more.

- Craig

Saturday, February 11, 2012

Martin Fowler - Polyglot Persistence

I don't think this article is particularly new, but certainly makes a good point about persistence. We're not really transitioning to a nosql persistence world as much as we are transitioning to a polyglot persistence world.

Anybody that is doing anything out there right now is using some RDBMS. As with any data system, there is a lot of inertia in that data. The larger the data collection, the larger the inertia. Most nosql systems are being brought in to supplement existing RDBMS systems. This may be because some nosql systems have a particular strength or advantage over their existing RDBMS, or a nosql system is brought in simply for experimentation. There are very few places that replace their existing RDBMS. Especially for transactional systems, I think RDBMS certainly still have the upper hand.

Enjoy!

- Craig

Saturday, January 21, 2012

Iteration Speed Controls Innovation Speed

I was talking to some colleagues at my work and trying to explain some of the benefits in the speed at which we can iterate out product. The underlying point is that the faster we can iterate our product, the faster we can innovate for our customers. If we can build certain high-speed processes, it would give us the ability to do things that our competitors simply could not duplicate.

Think of it this way. It you have a process that takes a week to complete, then you could probably only repeat that process maybe 2-3 times/month. Since the process takes so long to complete, it forces longer preparation and analysis phases because you don't want to run a week-long process without making sure you are well prepared.

If you could take that same process and reduce that week of time down to an hour, it fundamentally transforms how fast you can innovate. You can now run your process multiple times per day if you want and still have time to analyze your results. You can also reduce the time between runs now that you can run so quickly. It now becomes economical to make small, incremental changes to the process and re-run and check the results. You can experiment as much as you want.

I think a really great example of this is Google Chrome versus Microsoft IE. The IE release cycle is typically measured in years while the Chrome release cycle is about 1 or 2 months. Chrome is able to react to the market and add features at a rate that is currently not possible for IE. The result is that Chrome has been gaining browser market share at an incredible rate, and much of that at the expense of IE.

Microsoft is trying to increase their release rate and hence their innovation rate. IE 8, 9, and now 10 have been coming out at much faster rates than has been typical for them. However, they are no where near the rate at which Chrome is innovating and improving. The Firefox team has also been generating releases at a much faster rate than they have historically, and hence innovating at a much faster rate.

I think the question becomes, how do you increase your rate of iteration and hence innovation. I think it really depends on the organization. It's easy enough to improve certain processes and make even considerable gains. However, in order to make gains that drive 1, 2 or even 3 orders of magnitude in iteration speed, you have to fundamentally transform how you approach and tackle problems.

At my company, we have been discussing some ways that we can operate that are fundamentally different from our competitors. The idea is to try and transform our industry in ways that our competitors can not. We can do this by drastically increasing our iteration rate and hence the rate at which we can innovate.

- Craig

Wednesday, January 18, 2012

SOPA and PIPA explained by Clay Shirkey of TED on GigaOM

What does a kid’s birthday cake have to do with SOPA?

Incredibly excellent presentation on the effects of SOPA and PIPA and the backstory. If you want to know about why these bills are being produced and the real effects and desires of the media industry, this is the plain story. I had some ideas about this, but after watching this presentation, everything becomes very, very clear.

- Craig

Amazon DynamoDB

This article comes from gigaom.com and contains an bit of an interesting take on the Amazon DynamoDB announcement.

The interesting thing they take note with is that of the service being build upon SSDs instead of plain old HDDs. This is likely the wave of the future as memory/flash prices continue to fall a great rate. HDDs still remain vastly cheaper on a per GB basis and are likely to remain so for the foreseeable future.

Services with SSDs will remain a premium service, but may well be a worthwhile tradeoff for a number of companies. If you can't get enough I/Os on say EC2, then you're pretty much forced to run everything out of ram which can be even more expensive.

It is very nice to see Amazon running services on SSDs and here's hoping the keep going in that direction, especially for EC2 and EBS instances!

- Craig

Thursday, January 5, 2012

Flexible Indexing in Lucene 4

Just finished viewing a presentation on some of the new Lucene 4 features by Uwe Schindler Flexible Indexing in Lucene 4.

If you haven't been following the development of Lucence and some of the related projects, SOLR, and ElasticSearch, you really need to pay attention. It's just amazing to me the rate of advancement in these projects and the huge impact these projects are making on business and the web in general.

As they say, Search is King, and Lucene is king of open source search.

- Craig

Wednesday, January 4, 2012

Book Review: Mahout In Action by Manning

I just finished reading Mahout in Action by Manning late last night. I originally bought it as a MEAP, but didn't get around to reading it until I had the published copy of the book.

To me, there are 2 ways to look at this book. The first is as a book on Mahout, which is obviously the direct subject matter. The second way is as a general book on machine learning algorithms, particularly those dealing with recommendations, clustering, and classification.

As a book on Mahout, I think the book serves it's purpose well. It explains the reasons for the project and some of the design decisions around the library itself. Since Mahout is designed to work we very large data sets, some particular decisions were made around collections, specifically not using the standard Java collections libraries due to the storage overhead associated with Java Collections. I thought it was a little funny at first, but it makes great sense given the subject matter and needs.

I really like the Mahout focus on measurement and debugging facilities. The real trick in using Mahout and machine learning algorithms in general is really the input data, as well as the desired application. Your input data needs to be transformed in a way that is compatible with the algorithm and purpose. If the input is transformed well, then the algorithm can do it's job. If it is not transformed well, then the algorithm is going to give poor or unpredictable results.

Mahout has very strong tools to interrogate the model that is built from the input data. This is an imperative part of having success. Along with this, substantial tools are provided to measure the success of the algorithm. Different measures can be used depending on the algorithm used. The book does an excellent job of walking you through these process as well as providing tips on how you can tell if you've got something wrong. For example, if your classification algorithm is providing near 100% accuracy on your test data, you've likely done something wrong as the best tuned algorithms only provide around 84% accuracy across a variety of input data.

I really enjoyed the layout of the book. Each section essentially starts with an introduction on the subject being discussed, then takes you into some basic exercises with one of the algorithms, then takes you through a real example and then finishes with some advice on possible issues that could come up and associated solutions. It felt very thorough and complete to me.

I think the book also needs to be considered as a very good, general book on machine learning algorithms. This is the third book now that I've read on the subject and I am very happy with it in that regard.

The machine learning topics and algorithms covered in this book are a bit more focused that the other books I read, but the advantage is much deeper coverage on the chosen topics. There are some new concepts and terminology here that I had not seen covered before. One in particular is the concept of target leaks in classification.

This book does not provide a mathematical basis for understanding machine learning or the covered algorithms. It also does not cover a great breadth of algorithms or even all of those that are provided by Mahout. There are other books on machine learning that do a better job at both of those.

If you're just generally interested in the machine learning algorithms and techniques, this is really a great book to use as it covers the topics end-to-end very well. You will not be disappointed. You get the added benefit of learning Mahout to boot.

As a book on learning Mahout, this is the only game in town, very lucky for us, it is a very good game. All in all, I highly recommend this book!

- Craig