NoSql Tips and Tricks: January 2012

Saturday, January 21, 2012

Iteration Speed Controls Innovation Speed

I was talking to some colleagues at my work and trying to explain some of the benefits in the speed at which we can iterate out product. The underlying point is that the faster we can iterate our product, the faster we can innovate for our customers. If we can build certain high-speed processes, it would give us the ability to do things that our competitors simply could not duplicate.

Think of it this way. It you have a process that takes a week to complete, then you could probably only repeat that process maybe 2-3 times/month. Since the process takes so long to complete, it forces longer preparation and analysis phases because you don't want to run a week-long process without making sure you are well prepared.

If you could take that same process and reduce that week of time down to an hour, it fundamentally transforms how fast you can innovate. You can now run your process multiple times per day if you want and still have time to analyze your results. You can also reduce the time between runs now that you can run so quickly. It now becomes economical to make small, incremental changes to the process and re-run and check the results. You can experiment as much as you want.

I think a really great example of this is Google Chrome versus Microsoft IE. The IE release cycle is typically measured in years while the Chrome release cycle is about 1 or 2 months. Chrome is able to react to the market and add features at a rate that is currently not possible for IE. The result is that Chrome has been gaining browser market share at an incredible rate, and much of that at the expense of IE.

Microsoft is trying to increase their release rate and hence their innovation rate. IE 8, 9, and now 10 have been coming out at much faster rates than has been typical for them. However, they are no where near the rate at which Chrome is innovating and improving. The Firefox team has also been generating releases at a much faster rate than they have historically, and hence innovating at a much faster rate.

I think the question becomes, how do you increase your rate of iteration and hence innovation. I think it really depends on the organization. It's easy enough to improve certain processes and make even considerable gains. However, in order to make gains that drive 1, 2 or even 3 orders of magnitude in iteration speed, you have to fundamentally transform how you approach and tackle problems.

At my company, we have been discussing some ways that we can operate that are fundamentally different from our competitors. The idea is to try and transform our industry in ways that our competitors can not. We can do this by drastically increasing our iteration rate and hence the rate at which we can innovate.

- Craig

Wednesday, January 18, 2012

SOPA and PIPA explained by Clay Shirkey of TED on GigaOM

What does a kid’s birthday cake have to do with SOPA?

Incredibly excellent presentation on the effects of SOPA and PIPA and the backstory. If you want to know about why these bills are being produced and the real effects and desires of the media industry, this is the plain story. I had some ideas about this, but after watching this presentation, everything becomes very, very clear.

- Craig

Amazon DynamoDB

This article comes from gigaom.com and contains an bit of an interesting take on the Amazon DynamoDB announcement.

The interesting thing they take note with is that of the service being build upon SSDs instead of plain old HDDs. This is likely the wave of the future as memory/flash prices continue to fall a great rate. HDDs still remain vastly cheaper on a per GB basis and are likely to remain so for the foreseeable future.

Services with SSDs will remain a premium service, but may well be a worthwhile tradeoff for a number of companies. If you can't get enough I/Os on say EC2, then you're pretty much forced to run everything out of ram which can be even more expensive.

It is very nice to see Amazon running services on SSDs and here's hoping the keep going in that direction, especially for EC2 and EBS instances!

- Craig

Thursday, January 5, 2012

Flexible Indexing in Lucene 4

Just finished viewing a presentation on some of the new Lucene 4 features by Uwe Schindler Flexible Indexing in Lucene 4.

If you haven't been following the development of Lucence and some of the related projects, SOLR, and ElasticSearch, you really need to pay attention. It's just amazing to me the rate of advancement in these projects and the huge impact these projects are making on business and the web in general.

As they say, Search is King, and Lucene is king of open source search.

- Craig

Wednesday, January 4, 2012

Book Review: Mahout In Action by Manning

I just finished reading Mahout in Action by Manning late last night. I originally bought it as a MEAP, but didn't get around to reading it until I had the published copy of the book.

To me, there are 2 ways to look at this book. The first is as a book on Mahout, which is obviously the direct subject matter. The second way is as a general book on machine learning algorithms, particularly those dealing with recommendations, clustering, and classification.

As a book on Mahout, I think the book serves it's purpose well. It explains the reasons for the project and some of the design decisions around the library itself. Since Mahout is designed to work we very large data sets, some particular decisions were made around collections, specifically not using the standard Java collections libraries due to the storage overhead associated with Java Collections. I thought it was a little funny at first, but it makes great sense given the subject matter and needs.

I really like the Mahout focus on measurement and debugging facilities. The real trick in using Mahout and machine learning algorithms in general is really the input data, as well as the desired application. Your input data needs to be transformed in a way that is compatible with the algorithm and purpose. If the input is transformed well, then the algorithm can do it's job. If it is not transformed well, then the algorithm is going to give poor or unpredictable results.

Mahout has very strong tools to interrogate the model that is built from the input data. This is an imperative part of having success. Along with this, substantial tools are provided to measure the success of the algorithm. Different measures can be used depending on the algorithm used. The book does an excellent job of walking you through these process as well as providing tips on how you can tell if you've got something wrong. For example, if your classification algorithm is providing near 100% accuracy on your test data, you've likely done something wrong as the best tuned algorithms only provide around 84% accuracy across a variety of input data.

I really enjoyed the layout of the book. Each section essentially starts with an introduction on the subject being discussed, then takes you into some basic exercises with one of the algorithms, then takes you through a real example and then finishes with some advice on possible issues that could come up and associated solutions. It felt very thorough and complete to me.

I think the book also needs to be considered as a very good, general book on machine learning algorithms. This is the third book now that I've read on the subject and I am very happy with it in that regard.

The machine learning topics and algorithms covered in this book are a bit more focused that the other books I read, but the advantage is much deeper coverage on the chosen topics. There are some new concepts and terminology here that I had not seen covered before. One in particular is the concept of target leaks in classification.

This book does not provide a mathematical basis for understanding machine learning or the covered algorithms. It also does not cover a great breadth of algorithms or even all of those that are provided by Mahout. There are other books on machine learning that do a better job at both of those.

If you're just generally interested in the machine learning algorithms and techniques, this is really a great book to use as it covers the topics end-to-end very well. You will not be disappointed. You get the added benefit of learning Mahout to boot.

As a book on learning Mahout, this is the only game in town, very lucky for us, it is a very good game. All in all, I highly recommend this book!

- Craig