Wednesday, January 4, 2012

Book Review: Mahout In Action by Manning

I just finished reading Mahout in Action by Manning late last night. I originally bought it as a MEAP, but didn't get around to reading it until I had the published copy of the book.

To me, there are 2 ways to look at this book. The first is as a book on Mahout, which is obviously the direct subject matter. The second way is as a general book on machine learning algorithms, particularly those dealing with recommendations, clustering, and classification.

As a book on Mahout, I think the book serves it's purpose well. It explains the reasons for the project and some of the design decisions around the library itself. Since Mahout is designed to work we very large data sets, some particular decisions were made around collections, specifically not using the standard Java collections libraries due to the storage overhead associated with Java Collections. I thought it was a little funny at first, but it makes great sense given the subject matter and needs.

I really like the Mahout focus on measurement and debugging facilities. The real trick in using Mahout and machine learning algorithms in general is really the input data, as well as the desired application. Your input data needs to be transformed in a way that is compatible with the algorithm and purpose. If the input is transformed well, then the algorithm can do it's job. If it is not transformed well, then the algorithm is going to give poor or unpredictable results.

Mahout has very strong tools to interrogate the model that is built from the input data. This is an imperative part of having success. Along with this, substantial tools are provided to measure the success of the algorithm. Different measures can be used depending on the algorithm used. The book does an excellent job of walking you through these process as well as providing tips on how you can tell if you've got something wrong. For example, if your classification algorithm is providing near 100% accuracy on your test data, you've likely done something wrong as the best tuned algorithms only provide around 84% accuracy across a variety of input data.

I really enjoyed the layout of the book. Each section essentially starts with an introduction on the subject being discussed, then takes you into some basic exercises with one of the algorithms, then takes you through a real example and then finishes with some advice on possible issues that could come up and associated solutions. It felt very thorough and complete to me.

I think the book also needs to be considered as a very good, general book on machine learning algorithms. This is the third book now that I've read on the subject and I am very happy with it in that regard.

The machine learning topics and algorithms covered in this book are a bit more focused that the other books I read, but the advantage is much deeper coverage on the chosen topics. There are some new concepts and terminology here that I had not seen covered before. One in particular is the concept of target leaks in classification.

This book does not provide a mathematical basis for understanding machine learning or the covered algorithms. It also does not cover a great breadth of algorithms or even all of those that are provided by Mahout. There are other books on machine learning that do a better job at both of those.

If you're just generally interested in the machine learning algorithms and techniques, this is really a great book to use as it covers the topics end-to-end very well. You will not be disappointed. You get the added benefit of learning Mahout to boot.

As a book on learning Mahout, this is the only game in town, very lucky for us, it is a very good game. All in all, I highly recommend this book!

  - Craig