This evening, we heard from Cameron Befus, former CTO of Tynt which was purchased by 33 Across Inc. Here are some notes from the presentation on Hadoop.
Why use Hadoop?
A framework for parallel processing of large data sets.
Design considerations
- System will manage and heal itself
- Performance to scale linearly
- Compute move algorithm to data
Transition cluster size - local -> cloud -> dedicated hardware.
Hadoop process optiized for data retrieval
- schema on read
- nosql databases
- map reduce
- asynchronus
- parallel computing
Built for unreliable, commodity hardware.
Scale by adding boxes
More cost effective
Sends the program to the data.
- Larger choke points occur in transferring large volumes of data around the cluster
#1 Data Driven Decision MakingEveryone wants to make good decisions and as everyone know, to make good decisions you need data, not just any data, the right data.
Walmart mined sales that occurred previous to a coming hurricane and found the highest selling products are:
#1 batteries,
#2 pop tarts.
This means $$$ to Walmart.
Google can predict flu trends around the world just from search queries. Query velocity matches medical data.
With the advent of large disks, it becomes cost effective to simply store everything, then use a system like hadoop to run through and process the data to find value.
Combining data sets can extract value when they may not be valuable on their own. Turning lead into gold as it were.
How big is big data?
not just size,
complexity
rate of growth
performance
retention
Other uses,
load testing
number crunching
building lucene indexes
just about anything that can be easily parrallelized
- Craig