Thursday, June 21, 2012

MapR at UJUG - Real-time Hadoop (Ted Dunning)

Real-time and Long-time with Storm and MapR - Slides may take some time to come online.

Hadoop is great for processing vats of data, but sucks for real-time (by design).
Real-time and Long-time working together. Hadoop for the Long-time part, Storm for the Real-time part. Presentation can blend the two parts into a cohesive whole.

Simply dumping into NoSQL engine doesn't quite work.
Insert rate is limited

I have 15 versions of my landing page
Each visitor is assigned to a version
 - Which version?
A conversion or sale or whatever can happen
 - how long to wait?

Probability as expressed by humans is subjective and depends on information and experience.

Bayesian Bandit
 - Compute the distribution
 - Sample p1 and p2 from these distributions
 - Put a coin in bandit 1 if p1 > p2
 - Else, put the coin in bandit 2

Very interesting, which I could capture more of the slides and demos. Basically, it's about trying to maximize value of what you're looking for with little knowledge. Regret is the different between a perfect outcome and the actual outcome.

Ted showed a very cool video that shows the algorithm learning in real time. Video link may be in the slides.

We can encode a distribution by sampling.

Original Bayesian Bandit only requires real-time
Generalized Bandit may require access to lng history for learning
Bandit variables can include content

  - Craig

MapR at UJUG - Hadoop, why should I care? (Ted Dunning)

We've got Ted Dunning from MapR speaking at UJUG (Utah Java User Group) tonight. First up, Hadoop, why should I care.

Hadoop is a project and community. Programmed using Map-Reduce. Map functions do data processing. Reduce functions do aggregations. Parallelism is done automatically. Input is broken into chunks and passed to a corresponding map function. Output of map is input to reduce function. The reduce function will combine, shuffle and short, and finally reduce the output.

Problems that work well on Map-Reduce:
1) Counting problems
2) Sessionize used logs
3) Very large scale joins
4) Large scale matix recommendations
5) Computing the quadrillionth digit of pi. (apparently done using FFTs)

Map-Reduce in inherently batch oriented.

Many programs require iterative execution of MapReduce:
1) k-means
2) Alternating least squares
3) Page rank

Iteration of Map-Reduce in Hadoop is hopelessly bad
1) 30s delay between iteratiob
2) Loss of all memory state, side connections
3) Requires serialization of intermediate data to disk at every iteration.

Need to either change the algorithm or change the framework.

Google also put out a paper for large scale graph processing called Pregal.

Apache Giraph is an 80% clone of Pregal
 - Nodes have stat base on reading a chunk of input
 - Nodes can send messages to other nodes
 - Next super-step. each node receive all messages, sends new ones
 - When all nodes agree to stop, we are done.

Programming model is similar to Map-Reduce, but iterative. It is also still batch oriented.

Neither Map-Reduce or Giraph or good for generating progressing answers.

For real-time requirements, Storm+ online algorithms works.
  - Must have O(1) total memory and O91) time per input records
 - With suitable persistence substrate, we can cleanly combine Map-Reduce + Storm.

What is Storm?
A storm program is called a topology
 - Spouts inject data into a topology
 - Bolts process data
 - The units of data are called tuples
 - All processing is flow-through
 - Bolts can buffer or persist
 - Output tuples can by anchored

Output general goes to places like Lucene/SOLR, HBase, HDFS, or MapR transactional file system

Very cool presentation! Ted Dunning has incredible experience is BigData processing algorithms and has been with MapR since 2010.

Here are some more presentations from Ted Dunning.

  - Craig

Monday, June 18, 2012

Introducing Distributed Execution and MapReduce Framework

Read through Introducing Distributed Execution and MapReduce Framework today. I had read about some of the plans for Infinispan a while ago and was quite interested to see how it was coming along.

If you're not familiar with Infinispan, it's a Java-based, in-memory data grid sponsored by JBoss/Redhat. It's really a very cool project. What I think is most impressive about Infinispan is the ability to run Map/Reduce jobs across the data grid using the Callable interface. It looks like they have an extended DistributedCallable and DistributedExecutorServiceto help with running tasks on the data grid.

They are modeling their Map/Reduce framework after Hadoop while providing what could a more familiar Callable/ExecutorService model. It may lower the barrier to entry for a subset of developers more familiar with standard java threading models. I guess the only real drawback is that since this is an in-memory data grid, you're limited on how much data that you can traverse. However, it should be very fast since you're memory-centric instead of file-centric as with Hadoop.

  - Craig

Wednesday, June 13, 2012

ElasticSearch #5 of 9 Open Source Big Data Technologies to Watch

ElasticSearch shows up as #5 on the 9 Open Source Big Data Technologies to Watch! Hooray for ElasticSearch! It really is a great project if you're looking at search. Check it out!

  - Craig

Friday, June 8, 2012

Big Search with Big Data Principles - Eric Pugh

This is a presentation given by Eric Pugh at Lucene Revolution 2012. There is a lot of good information in Big Search with Big Data Principles, thought the audio is a bit hokey at the beginning.

Eric goes through some of his experiences with working with a large vendor, some of the obsticles encountered and solutions implemented.

Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cell during document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr. Instead we'll talk about how to apply principles of big data like "Bring the code to the data, not the data to the code" to Solr. How to answer the question "How many servers will I need?" when your volume of data is exploding. Some examples of models for predicting server and data growth, and how to look back and see how good your models are! You'll leave this session armed with an understanding of why Big Data is the buzzword of the year, and how you can apply some of the principles to your own search environment.

  - Craig

Thursday, June 7, 2012

3 Secrets to Lightning Fast Mobile Design at Instagram

This article comes from, 3 Secrets to Lightning Fast Mobile Design at Instagram. The article is based on a slide deck from Mike Krieger, co-founder of Instagram.

Very interestingly, they talk about focusing on the user experience and worrying about the coding second. The focus is to make the user feel productive by performing actions optimistically, then backing out the actions if they are canceled, like file sharing. They automatically start uploading the photo. If the user cancels or otherwise does not want to share the photo, then it is deleted on the back end. If the user clicks the like button or posts a comment, then the action is immediately noted in the UI while the action is completed in the background.

This really is in stark contrast to traditional apps where the app only goes to work after the user takes some action. The user is forced to wait while the app works, and this is bad for the user experience. This does put more load on the back end, but really improves the user experience. Speed is king!

  - Craig

Wednesday, June 6, 2012

Slides: Scaling Massive ElasticSearch Clusters

This set of slides, Slides: Scaling Massive ElasticSearch Clusters, comes from Rafal Kuc at Sematext. Rafal gave this talk just a day or 2 ago at Berlin Buzzwords.

ElasticSearch does not have as large of a community as SOLR, but is really gaining a lot of mind share and is receiving a lot of converts. ElasticSearch is really a great project. Check it out when you get a chance!

  - Craig

Tuesday, June 5, 2012

Search is the Dominant Metaphor for Working with Big Data

Found another good article on from Eric Pugh of the Apache Software Foundation. He wrote down some quick thoughts around his attendance of the recent LuceneRevolution 2012. He has a few more articles on here.

If you'r enot paying attending to search, you really need to be. Search absolutely is a core component of Big Data processing. It's great that you can process data with Hadoop, but what do you do with it? Search/Lucene/SOLR/ElasticSearch can be used on both the input and output side. You can pre-process your data with Lucene analyzers or use search to define the data sets you need to run through Hadoop. You can also have search as your endpoint, especially in a streaming process to make your processed data available to your users.

Search is king, and no where is that more evident than with Big Data!

  - Craig

Using Java to access MongoDB, Redis, CouchDB, Riak, Cassandra

I ran across this article ,Using Java to access MongoDB, Redis, CouchDB, Riak, Cassandra, on and thought it was interesting. It looks like this guy is playing around with building his own persistence layer in an append-only logger style, kinda cool.

One of the nice things about the article is that he shows a very basic persistence example for 5 different NoSQL data stores. I haven't seen anyone show side-by-side examples like this before. Cassandra seems to be the most involved in setting up a connection and performing a quick append, though none of the examples look very difficult.

Good luck Carlo Scarioni!

  - Craig

Saturday, June 2, 2012

Build your own twitter like real time analytics - a step by step guide

This is a quick little trip over to for an article entitled Build your own twitter like real time analytics - a step by step guide.

You can certainly build a batch-based system using hadoop or something similar. The real trick is to build a real-time analytics system. The article points to the following 3 features:

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics 
  3. Use Cloudify (  for managing and automating the deployment on private or public cloud 

I have been working on some batch processing techniques for processing data. I need to employ some different processing to move it to a streaming process.

  - Craig