Thursday, June 21, 2012

MapR at UJUG - Hadoop, why should I care? (Ted Dunning)

We've got Ted Dunning from MapR speaking at UJUG (Utah Java User Group) tonight. First up, Hadoop, why should I care.

Hadoop is a project and community. Programmed using Map-Reduce. Map functions do data processing. Reduce functions do aggregations. Parallelism is done automatically. Input is broken into chunks and passed to a corresponding map function. Output of map is input to reduce function. The reduce function will combine, shuffle and short, and finally reduce the output.

Problems that work well on Map-Reduce:
1) Counting problems
2) Sessionize used logs
3) Very large scale joins
4) Large scale matix recommendations
5) Computing the quadrillionth digit of pi. (apparently done using FFTs)

Map-Reduce in inherently batch oriented.

Many programs require iterative execution of MapReduce:
1) k-means
2) Alternating least squares
3) Page rank

Iteration of Map-Reduce in Hadoop is hopelessly bad
1) 30s delay between iteratiob
2) Loss of all memory state, side connections
3) Requires serialization of intermediate data to disk at every iteration.

Need to either change the algorithm or change the framework.

Google also put out a paper for large scale graph processing called Pregal.

Apache Giraph is an 80% clone of Pregal
 - Nodes have stat base on reading a chunk of input
 - Nodes can send messages to other nodes
 - Next super-step. each node receive all messages, sends new ones
 - When all nodes agree to stop, we are done.

Programming model is similar to Map-Reduce, but iterative. It is also still batch oriented.

Neither Map-Reduce or Giraph or good for generating progressing answers.

For real-time requirements, Storm+ online algorithms works.
  - Must have O(1) total memory and O91) time per input records
 - With suitable persistence substrate, we can cleanly combine Map-Reduce + Storm.

What is Storm?
A storm program is called a topology
 - Spouts inject data into a topology
 - Bolts process data
 - The units of data are called tuples
 - All processing is flow-through
 - Bolts can buffer or persist
 - Output tuples can by anchored

Output general goes to places like Lucene/SOLR, HBase, HDFS, or MapR transactional file system

Very cool presentation! Ted Dunning has incredible experience is BigData processing algorithms and has been with MapR since 2010.

Here are some more presentations from Ted Dunning.

  - Craig