Wednesday, July 18, 2012

Cameron Befus speaks at UHUG (Hadoop)

This evening, we heard from Cameron Befus, former CTO of Tynt which was purchased by 33 Across Inc. Here are some notes from the presentation on Hadoop.

Why use Hadoop?

A framework for parallel processing of large data sets.

Design considerations
 - System will manage and heal itself
 -  Performance to scale linearly
 -  Compute move algorithm to data

Transition cluster size - local -> cloud -> dedicated hardware.

Hadoop process optiized for data retrieval
 - schema on read
 - nosql databases
 - map reduce
 - asynchronus
 - parallel computing

Built for unreliable, commodity hardware.
Scale by adding boxes
More cost effective

Sends the program to the data.
 - Larger choke points occur in transferring large volumes of data around the cluster

#1 Data Driven Decision MakingEveryone wants to make good decisions and as everyone know, to make good decisions you need data, not just any data, the right data.

Walmart mined sales that occurred previous to a coming hurricane and found the highest selling products are:
  #1 batteries,
  #2 pop tarts.

This means $$$ to Walmart.

Google can predict flu trends around the world just from search queries. Query velocity matches medical data.

With the advent of large disks, it becomes cost effective to simply store everything, then use a system like hadoop to run through and process the data to find value.

Combining data sets can extract value when they may not be valuable on their own. Turning lead into gold as it were.

How big is big data?
  not just size,
  complexity
  rate of growth
  performance
  retention

Other uses,
  load testing
  number crunching
  building lucene indexes
  just about anything that can be easily parrallelized


  - Craig