NoSql Tips and Tricks

Elasticsearch 2.0 released!

2015-10-29T07:42:00.001-07:00

Elasticsearch 2.0 is finally out! They actually released a number of new releases including Kibana 4.2, Marvel 2.0 and Sense 2.0. Definitely a lot of new stuff!

Released Elasticsearch 2.0. A major milestone and achievement of the whole team, and wonderful contributions from the community. New type of aggregations called pipeline aggs, simplified query DSL by merging query and filter concepts, better compression options, hardened security by enabling security manager, hardening of FS behavior (fsync, more checksums, atomic renames), performance, consistent mapping behavior, and many more. Also, it bundles Lucene 5 release, which includes numerous improvements, several of them led by our team.

Watch the release webinar now!

- Craig

FastCompany: Googles InnerTube Project to Retool YouTube

2015-05-22T10:55:00.000-07:00

Fast Company: Google InnerTube project to retool YouTube is a great article talking about a lot of the more recent changes that Google has been making in order to fix some internal development issues and make YouTube more of a competitive destination for watching content.

It's funny hearing the talking about the mess of Frankenstein projects that had grown up over the years. It seems like every project that's in continuous development turns in a Frankenstein. It's always a constant battle to keep a project clean. If you don't, then up you end up reaching a point where you almost have to over from scratch to get the project where you you need it to be.

Enjoy!

- Craig

Nathan Paco - A New Year in Data Science: ML Unpaused

2015-01-10T13:59:00.001-08:00

A New Year in Data Science: ML Unpaused
A good read on Data Science and some of the advances being made more recently. He presents some history around machine learning and developments around the real job of Data Science.

There are some good insights around the need to focus on data and features instead of focusing on which algorithm to use. Many times people look more at how to build something instead of what problem they are trying to solve.

Nathan certainly knows what he's talking about. I met him last year at Hadoop Summit in San Jose. He has been working with the Spark guys and Databrix. Take a minute and look him up.

- Craig

Elasticsearch 5-node PicoCluster test - 1,000,000 tweets and 3495.1316MB in just under 9 hours.

2014-12-31T10:18:00.003-08:00

We've been doing some benchmarks with Elasticsearch on the 5-node PicoCluster. We were able to push 1,000,000 tweets and 3495.1316MB in just under 9 hours. The tweets are small, around 1KB each, but have a lot of fields and are pretty complicated. That's pretty good considering that the SD cards are not very good a small reads and writes.

Pushed 1000 docs in 34.457 secs at 29.021679 docs/sec 3.4431486 MB/sec, 980000 total docs in 520 minutes 33 seconds at 31.37682 docs/second 98.0% complete
Pushed 1000 docs in 36.632 secs at 27.298536 docs/sec 3.5004349 MB/sec, 981000 total docs in 521 minutes 9 seconds at 31.372042 docs/second 98.1% complete
Pushed 1000 docs in 34.607 secs at 28.89589 docs/sec 3.5908194 MB/sec, 982000 total docs in 521 minutes 44 seconds at 31.369303 docs/second 98.2% complete
Pushed 1000 docs in 30.67 secs at 32.605152 docs/sec 3.3349895 MB/sec, 983000 total docs in 522 minutes 15 seconds at 31.370514 docs/second 98.299995% complete
Pushed 1000 docs in 31.243 secs at 32.007168 docs/sec 3.431964 MB/sec, 984000 total docs in 522 minutes 46 seconds at 31.37115 docs/second 98.4% complete
Pushed 1000 docs in 28.858 secs at 34.652435 docs/sec 3.4087648 MB/sec, 985000 total docs in 523 minutes 15 seconds at 31.374163 docs/second 98.5% complete
Pushed 1000 docs in 29.598 secs at 33.786068 docs/sec 3.4104357 MB/sec, 986000 total docs in 523 minutes 44 seconds at 31.376436 docs/second 98.6% complete
Pushed 1000 docs in 32.356 secs at 30.90617 docs/sec 3.4084692 MB/sec, 987000 total docs in 524 minutes 17 seconds at 31.375952 docs/second 98.7% complete
Pushed 1000 docs in 37.807 secs at 26.450129 docs/sec 3.4255342 MB/sec, 988000 total docs in 524 minutes 55 seconds at 31.370039 docs/second 98.799995% complete
Pushed 1000 docs in 33.404 secs at 29.936535 docs/sec 3.4184904 MB/sec, 989000 total docs in 525 minutes 28 seconds at 31.36852 docs/second 98.9% complete
Pushed 1000 docs in 34.465 secs at 29.014942 docs/sec 3.4793549 MB/sec, 990000 total docs in 526 minutes 2 seconds at 31.36595 docs/second 99.0% complete
Pushed 1000 docs in 30.792 secs at 32.475967 docs/sec 3.4305592 MB/sec, 991000 total docs in 526 minutes 33 seconds at 31.367033 docs/second 99.1% complete
Pushed 1000 docs in 29.749 secs at 33.614574 docs/sec 3.4574842 MB/sec, 992000 total docs in 527 minutes 3 seconds at 31.369146 docs/second 99.2% complete
Pushed 1000 docs in 32.825 secs at 30.464584 docs/sec 3.370614 MB/sec, 993000 total docs in 527 minutes 36 seconds at 31.368208 docs/second 99.299995% complete
Pushed 1000 docs in 37.048 secs at 26.99201 docs/sec 3.451209 MB/sec, 994000 total docs in 528 minutes 13 seconds at 31.36309 docs/second 99.4% complete
Pushed 1000 docs in 35.307 secs at 28.322996 docs/sec 3.3885374 MB/sec, 995000 total docs in 528 minutes 48 seconds at 31.35971 docs/second 99.5% complete
Pushed 1000 docs in 37.64 secs at 26.567482 docs/sec 3.4242926 MB/sec, 996000 total docs in 529 minutes 26 seconds at 31.35403 docs/second 99.6% complete
Pushed 1000 docs in 28.108 secs at 35.57706 docs/sec 3.4203835 MB/sec, 997000 total docs in 529 minutes 54 seconds at 31.357765 docs/second 99.7% complete
Pushed 1000 docs in 28.886 secs at 34.618847 docs/sec 3.4412107 MB/sec, 998000 total docs in 530 minutes 23 seconds at 31.360725 docs/second 99.8% complete
Pushed 1000 docs in 40.074 secs at 24.953835 docs/sec 3.4108858 MB/sec, 999000 total docs in 531 minutes 3 seconds at 31.352667 docs/second 99.9% complete
Pushed 1000 docs in 0.0 secs at Infinity docs/sec 3.4554148 MB/sec, 1000000 total docs in 531 minutes 3 seconds at 31.38405 docs/second 100.0% complete

Pushed 1000000 total docs and 3495.1316MB in 531 minutes 3 seconds at 31.38405 per second,

10GB Terasort Benchmark on 5-node Raspberry PI Cluster - 2H 52m 56s

2014-12-18T17:25:00.000-08:00

We just set a new record for the 10GB terasort on a 5-node PicoCluster! We cut over an our off the benchmark time bringing the total to under 3 hours! Pretty amazing!

Hadoop job_201412181311_0002 on master User: hadoop
Job Name: TeraSort
Job File: hdfs://pi0:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201412181311_0002/job.xml
Submit Host: pi0
Submit Host Address: 10.1.10.120
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Thu Dec 18 14:54:20 MST 2014
Finished at: Thu Dec 18 17:47:16 MST 2014
Finished in: 2hrs, 52mins, 56sec
Job Cleanup: Successful

Kind

% Complete

Num Tasks

Pending

Running

Complete

Killed

Failed/Killed
Task Attempts

map

100.00%

0 / 0

reduce

100.00%

0 / 0

	Counter	Map	Reduce	Total
Map-Reduce Framework	Spilled Records	0	0	300,000,000
	Map output materialized bytes	0	0	10,200,038,400
	Reduce input records	0	0	100,000,000
	Virtual memory (bytes) snapshot	0	0	46,356,074,496
	Map input records	0	0	100,000,000
	SPLIT_RAW_BYTES	8,800	0	8,800
	Map output bytes	0	0	10,000,000,000
	Reduce shuffle bytes	0	0	10,200,038,400
	Physical memory (bytes) snapshot	0	0	32,931,528,704
	Map input bytes	0	0	10,000,000,000
	Reduce input groups	0	0	100,000,000
	Combine output records	0	0	0
	Reduce output records	0	0	100,000,000
	Map output records	0	0	100,000,000
	Combine input records	0	0	0
	CPU time spent (ms)	0	0	27,827,080
	Total committed heap usage (bytes)	0	0	32,344,113,152
File Input Format Counters	Bytes Read	0	0	10,000,144,320
FileSystemCounters	HDFS_BYTES_READ	10,000,153,120	0	10,000,153,120
	FILE_BYTES_WRITTEN	20,404,679,750	10,204,290,230	30,608,969,980
	FILE_BYTES_READ	10,265,248,834	10,200,000,960	20,465,249,794
	HDFS_BYTES_WRITTEN	0	10,000,000,000	10,000,000,000
File Output Format Counters	Bytes Written	0	0	10,000,000,000
Job Counters	Launched map tasks	0	0	80
	Launched reduce tasks	0	0	80
	SLOTS_MILLIS_REDUCES	0	0	28,079,434
	Total time spent by all reduces waiting after reserving slots (ms)	0	0	0
	SLOTS_MILLIS_MAPS	0	0	22,051,330
	Total time spent by all maps waiting after reserving slots (ms)	0	0	0
	Rack-local map tasks	0	0	30
	Data-local map tasks	0	0	50

Map Completion Graph - close

Reduce Completion Graph - close

Go back to JobTracker

This is Apache Hadoop release 1.2.1

The Daily WTF: The Robot Guys

2014-12-04T16:31:00.001-08:00

Not really Big Data, but pretty funny :)

http://thedailywtf.com/articles/the-robot-guys

- Craig

10GB Terasort Benchmark on 5-node Raspberry PI Cluster - 4H 12m 5s

2014-11-13T21:36:00.002-08:00

After more than a week of testing, tweeking, and retesting, we were able to successfully run a 10GB Terasort on a 5-node Raspberry PI cluster! Each node has 512MB ram and a 16GB SD card.

Hadoop job_201411131351_0001 on master

User: hadoop
Job Name: TeraSort
Job File: hdfs://pi0:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201411131351_0001/job.xml
Submit Host: pi0
Submit Host Address: 10.1.10.120
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Thu Nov 13 14:23:35 MST 2014
Finished at: Thu Nov 13 18:35:40 MST 2014
Finished in: 4hrs, 12mins, 5sec
Job Cleanup: Successful

Kind

% Complete

Num Tasks

Pending

Running

Complete

Killed

Failed/Killed
Task Attempts

map

100.00%

152

0 / 0

reduce

100.00%

152

0 / 0

	Counter	Map	Reduce	Total
File Input Format Counters	Bytes Read	0	0	10,000,298,372
Job Counters	SLOTS_MILLIS_MAPS	0	0	24,993,499
	Launched reduce tasks	0	0	152
	Total time spent by all reduces waiting after reserving slots (ms)	0	0	0
	Rack-local map tasks	0	0	144
	Total time spent by all maps waiting after reserving slots (ms)	0	0	0
	Launched map tasks	0	0	152
	Data-local map tasks	0	0	8
	SLOTS_MILLIS_REDUCES	0	0	34,824,665
File Output Format Counters	Bytes Written	0	0	10,000,000,000
FileSystemCounters	FILE_BYTES_READ	10,341,496,856	10,200,000,912	20,541,497,768
	HDFS_BYTES_READ	10,000,315,092	0	10,000,315,092
	FILE_BYTES_WRITTEN	20,409,243,506	10,208,123,719	30,617,367,225
	HDFS_BYTES_WRITTEN	0	10,000,000,000	10,000,000,000
Map-Reduce Framework	Map output materialized bytes	0	0	10,200,138,624
	Map input records	0	0	100,000,000
	Reduce shuffle bytes	0	0	10,200,138,624
	Spilled Records	0	0	300,000,000
	Map output bytes	0	0	10,000,000,000
	Total committed heap usage (bytes)	0	0	57,912,754,176
	CPU time spent (ms)	0	0	40,328,090
	Map input bytes	0	0	10,000,000,000
	SPLIT_RAW_BYTES	16,720	0	16,720
	Combine input records	0	0	0
	Reduce input records	0	0	100,000,000
	Reduce input groups	0	0	100,000,000
	Combine output records	0	0	0
	Physical memory (bytes) snapshot	0	0	52,945,952,768
	Reduce output records	0	0	100,000,000
	Virtual memory (bytes) snapshot	0	0	123,024,928,768
	Map output records	0	0	100,000,000

Map Completion Graph

Reduce Completion Graph

Go back to JobTracker

This is Apache Hadoop release 1.2.1

Installing Hadoop on a Raspberry PI cluster

2014-11-13T20:50:00.000-08:00

This document details setting up and testing Hadoop on a Raspberry PI cluster. The details are almost exactly the same for any Linux flavor, particularly Debian/Ubuntu. There are only a couple of PI specifics.

In this case, we have a 5-node cluster. If you cluster is a different size, make sure the hadoop distribution is on each of the machines, make sure you update the slaves file, and copy the config files to each machine as shown.

Machine Setup

This first thing to do would be to add dns entries for all of your cluster machines. If you have a dns sever available, then that's great. If you don't, then you can edit the hosts file and add the correct entries there. This has to be done for every machine in the cluster. My entries look like this:

10.1.10.120 pi0 master
10.1.10.121 pi1
10.1.10.122 pi2
10.1.10.123 pi3
10.1.10.124 pi4

For each machine, the hostname needs to be changed to reflect the hostnames defined above. The default Rasbian hostname is raspberrypi .

Edit the /etc/hostname file and change the entry to the hostname that you want. In my case, the first PI is pi0 . A quick reboot of each node should verify the correct hostname of each node.

We want to add a hadoop user to each machine. This is not strictly necessary for hadoop to run, but we do this to keep our hadoop stuff separate from anything else on these machines.

Add a new user to each machine to run hadoop.
pi@pi0 ~ $ sudo adduser hadoop 
Adding user `hadoop' ... 
Adding new group `hadoop' (1004) ... 
Adding new user `hadoop' (1001) with group `hadoop' ... 
Creating home directory `/home/hadoop' ... 
Copying files from `/etc/skel' ... 
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully 
Changing the user information for hadoop 
Enter the new value, or press ENTER for the default 
 Full Name []: Hadoop 
 Room Number []: 
 Work Phone []: 
 Home Phone []: 
 Other []: 
Is the information correct? [Y/n] Y

Log out as the pi user and log back in as the hadoop user on your master node. Create a private identity key and copy to your authorized_keys file.

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Copy the identity key to all of your other nodes.
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi1
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi2
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi3
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi4

You can also copy the keys directly to the new machine with something like this.

scp .ssh/id_dsa.pub hadoop@pi1:.ssh/authorized_keys

Hadoop Setup

Copy the distribution to the hadoop user account and decompress. I always create symlink to make things a bit easier: ln -s hadoop-1.x.x hadoop Now we need to edit some hadoop config files to make everything work.

cd hadoop/conf
ls

This should give you a listing that looks like this:

hadoop@raspberrypi ~/hadoop/conf $ ls 
capacity-scheduler.xml  core-site.xml       hadoop-env.sh               hadoop-policy.xml  
log4j.properties       mapred-site.xml  slaves                  ssl-server.xml.example 
configuration.xsl       fair-scheduler.xml  hadoop-metrics2.properties  hdfs-site.xml      
mapred-queue-acls.xml  masters          ssl-client.xml.example  taskcontroller.cfg

This files we're interested in are masters and slaves. Add the address of the node that will be the primary NameNode and JobTracker to the masters file. Add the addresses of the nodes that will be the DataNodes and TaskTrackers to the slaves file. You can also add the master node address to the slaves file if you want it to store data and run map reduce jobs, but I'm not doing that in this case. These files looks like this for my cluster:

hadoop@raspberrypi ~/hadoop/conf $ more masters 
pi0 
hadoop@raspberrypi ~/hadoop/conf $ more slaves 
pi1
pi2
pi3
pi4

These files need to be copied to all the nodes in the cluster. The easy way to do this would be as follows:

hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi1:hadoop/conf/. 
masters       100%   12     0.0KB/s   00:00                                        
slaves       100%   48     0.1KB/s   00:00    
hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi2:hadoop/conf/. 
masters       100%   12     0.0KB/s   00:00    
slaves       100%   48     0.1KB/s   00:00    
hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi3:hadoop/conf/. 
masters       100%   12     0.0KB/s   00:00    
slaves       100%   48     0.1KB/s   00:00    
hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi4:hadoop/conf/. 
masters       100%   12     0.0KB/s   00:00    
slaves       100%   48     0.1KB/s   00:00

We need to define the JAVA_HOME setting in hadoop-env.sh file. Since Hadoop runs on java, it needs to know where the executable is. The Oracle JDK comes standard with the latest Rasbian release. We can set JAVA_HOME as follows:

# The java implementation to use.  Required. 
export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle-armhf

Next we need to look at the core-site.xml, hdfs-site.xml and mapred-site.xml files. These are the configuration files for the distributed file system HDFS, and the code that runs the map reduce programs that we want to run. By default, both of these processes put their files in temporary file system space. That's fine to play around with, but as soon as you reboot, all of your data disappears and you have to start over again. We first need to tell each machine where the NameNode is. This is configured in the core-site.xml file.


<property>
 <name>fs.default.name</name>
 <value>hdfs://master:54310</value>
</property>

Next we need to tell each node where the
JobTracker is and a safe place to store data so that it stays permanent. This is configured in the mapred-site.xml file.

<property>
 <name>mapred.job.tracker</name>
 <value>master:54311</value>
</property>

<property>
 <name>mapred.local.dir</name>
 <value>/home/hadoop/mapred</value>
</property>

Lastly we need to specify safe directories to store our files for HDFS.

<property>
 <name>dfs.name.dir</name>
 <value>/home/hadoop/name</value>
</property>
<property>
 <name>dfs.data.dir</name>
 <value>/home/hadoop/data</value>
</property>

All of these files need to be copied to each of the other nodes in the cluster. We can do that like we did before with the masters and slaves files.

hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi1:hadoop/conf/. 
hadoop-env.sh       100% 2218     2.2KB/s   00:00    
core-site.xml      100%  268     0.3KB/s   00:00    
hdfs-site.xml      100%  347     0.3KB/s   00:00    
mapred-site.xml      100%  356     0.4KB/s   00:00    
hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi2:hadoop/conf/. 
hadoop-env.sh       100% 2218     2.2KB/s   00:00    
core-site.xml      100%  268     0.3KB/s   00:00    
hdfs-site.xml      100%  347     0.3KB/s   00:00    
mapred-site.xml      100%  356     0.4KB/s   00:00    
hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi3:hadoop/conf/. 
hadoop-env.sh       100% 2218     2.2KB/s   00:00    
core-site.xml      100%  268     0.3KB/s   00:00    
hdfs-site.xml      100%  347     0.3KB/s   00:00    
mapred-site.xml      100%  356     0.4KB/s   00:00    
hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi4:hadoop/conf/. 
hadoop-env.sh       100% 2218     2.2KB/s   00:00    
core-site.xml      100%  268     0.3KB/s   00:00    
hdfs-site.xml      100%  347     0.3KB/s   00:00    
mapred-site.xml      100%  356     0.4KB/s   00:00

We have one interesting task left. The hadoop script by default wants to run the datanode in server mode. The Oracle Java distribution does not support this option. We need to edit the bin/hadoop script to remove this option.

cd bin
vi hadoop

Search for “-server”
Change this:
    HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS" 
to this:
    HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"

We need to copy this file to all of the nodes in the cluster.

hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi1:hadoop/bin/. 
hadoop       100%   14KB  13.8KB/s   00:00    
hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi2:hadoop/bin/. 
hadoop       100%   14KB  13.8KB/s   00:00    
hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi3:hadoop/bin/. 
hadoop       100%   14KB  13.8KB/s   00:00    
hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi4:hadoop/bin/. 
hadoop       100%   14KB  13.8KB/s   00:00

Now we're ready to get things up and running! We need to format the NameNode this sets up the storage structure HDFS. This is done with the bin/hadoop namenode -format command as follows:

hadoop@pi0 ~/hadoop $ bin/hadoop namenode -format 
14/11/04 10:53:28 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = pi0/10.1.10.120 
STARTUP_MSG:   args = [-format] 
STARTUP_MSG:   version = 1.1.1 
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108; compiled by 'hortonfo' on Mon Nov 19 10:48:11 UTC 2012 
************************************************************/ 
14/11/04 10:53:30 INFO util.GSet: VM type       = 32-bit 
14/11/04 10:53:30 INFO util.GSet: 2% max memory = 19.335 MB 
14/11/04 10:53:30 INFO util.GSet: capacity      = 2^22 = 4194304 entries 
14/11/04 10:53:30 INFO util.GSet: recommended=4194304, actual=4194304 
14/11/04 10:53:34 INFO namenode.FSNamesystem: fsOwner=hadoop 
14/11/04 10:53:35 INFO namenode.FSNamesystem: supergroup=supergroup 
14/11/04 10:53:35 INFO namenode.FSNamesystem: isPermissionEnabled=true 
14/11/04 10:53:35 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 
14/11/04 10:53:35 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 
14/11/04 10:53:35 INFO namenode.NameNode: Caching file names occuring more than 10 times 
14/11/04 10:53:37 INFO common.Storage: Image file of size 112 saved in 0 seconds. 
14/11/04 10:53:38 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/hadoop/name/current/edits 
14/11/04 10:53:38 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/hadoop/name/current/edits 
14/11/04 10:53:39 INFO common.Storage: Storage directory /home/hadoop/name has been successfully formatted. 
14/11/04 10:53:39 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at pi0/10.1.10.120 
************************************************************/

Since we set up the SSH login niceness before, we cam start the entire cluster now with a single command!

hadoop@pi0 ~/hadoop $ bin/start-all.sh 
starting namenode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-namenode-pi0.out 
pi1: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi1.out 
pi3: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi3.out 
pi4: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi4.out 
pi2: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi2.out 
pi0: starting secondarynamenode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-secondarynamenode-pi0.out 
starting jobtracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-jobtracker-pi0.out 
pi1: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi1.out 
pi2: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi2.out 
pi4: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi4.out 
pi3: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi3.out

Open up a browser and go to the following URLs: pi0 is the master or namenode in your cluster.

http://pi0:50070/dfshealth.jsp
http://pi0:50030/jobtracker.jsp

Testing the Setup

Now we can run something simple and quick and make sure things are working.

hadoop@pi0 ~/hadoop $ bin/hadoop jar hadoop-examples-1.1.1.jar pi 4 1000 
Number of Maps  = 4 
Samples per Map = 1000 
Wrote input for Map #0 
Wrote input for Map #1 
Wrote input for Map #2 
Wrote input for Map #3 
Starting Job 
14/11/04 11:50:50 INFO mapred.FileInputFormat: Total input paths to process : 4 
14/11/04 11:50:55 INFO mapred.JobClient: Running job: job_201411041127_0001 
14/11/04 11:50:56 INFO mapred.JobClient:  map 0% reduce 0% 
14/11/04 11:51:49 INFO mapred.JobClient:  map 25% reduce 0% 
14/11/04 11:51:54 INFO mapred.JobClient:  map 100% reduce 0% 
14/11/04 11:52:22 INFO mapred.JobClient:  map 100% reduce 66% 
14/11/04 11:52:25 INFO mapred.JobClient:  map 100% reduce 100% 
14/11/04 11:52:44 INFO mapred.JobClient: Job complete: job_201411041127_0001 
14/11/04 11:52:44 INFO mapred.JobClient: Counters: 30 
14/11/04 11:52:44 INFO mapred.JobClient:   Job Counters 
14/11/04 11:52:44 INFO mapred.JobClient:     Launched reduce tasks=1 
14/11/04 11:52:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=129653 
14/11/04 11:52:44 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0 
14/11/04 11:52:44 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0 
14/11/04 11:52:44 INFO mapred.JobClient:     Launched map tasks=4 
14/11/04 11:52:44 INFO mapred.JobClient:     Data-local map tasks=4 
14/11/04 11:52:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=34464 
14/11/04 11:52:44 INFO mapred.JobClient:   File Input Format Counters 
14/11/04 11:52:44 INFO mapred.JobClient:     Bytes Read=472 
14/11/04 11:52:44 INFO mapred.JobClient:   File Output Format Counters 
14/11/04 11:52:44 INFO mapred.JobClient:     Bytes Written=97 
14/11/04 11:52:44 INFO mapred.JobClient:   FileSystemCounters 
14/11/04 11:52:44 INFO mapred.JobClient:     FILE_BYTES_READ=94 
14/11/04 11:52:44 INFO mapred.JobClient:     HDFS_BYTES_READ=956 
14/11/04 11:52:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=119665 
14/11/04 11:52:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215 
14/11/04 11:52:44 INFO mapred.JobClient:   Map-Reduce Framework 
14/11/04 11:52:44 INFO mapred.JobClient:     Map output materialized bytes=112 
14/11/04 11:52:44 INFO mapred.JobClient:     Map input records=4 
14/11/04 11:52:44 INFO mapred.JobClient:     Reduce shuffle bytes=112 
14/11/04 11:52:44 INFO mapred.JobClient:     Spilled Records=16 
14/11/04 11:52:44 INFO mapred.JobClient:     Map output bytes=72 
14/11/04 11:52:44 INFO mapred.JobClient:     Total committed heap usage (bytes)=819818496 
14/11/04 11:52:44 INFO mapred.JobClient:     CPU time spent (ms)=16580 
14/11/04 11:52:44 INFO mapred.JobClient:     Map input bytes=96 
14/11/04 11:52:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=484 
14/11/04 11:52:44 INFO mapred.JobClient:     Combine input records=0 
14/11/04 11:52:44 INFO mapred.JobClient:     Reduce input records=8 
14/11/04 11:52:44 INFO mapred.JobClient:     Reduce input groups=8 
14/11/04 11:52:44 INFO mapred.JobClient:     Combine output records=0 
14/11/04 11:52:44 INFO mapred.JobClient:     Physical memory (bytes) snapshot=586375168 
14/11/04 11:52:44 INFO mapred.JobClient:     Reduce output records=0 
14/11/04 11:52:44 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1716187136 
14/11/04 11:52:44 INFO mapred.JobClient:     Map output records=8 
Job Finished in 116.488 seconds 
Estimated value of Pi is 3.14000000000000000000

All About ElasticSearch Scripting

2014-08-26T12:28:00.001-07:00

This is taken from the ElasticSearch blog of a similar name, all about scripting.

There is a lot of good news in here. This is as of the 1.3 branch:

Dynamic scripting disable by default for security
Moving from MVEL to Groovy
Added Lucene Expressions
field_value_factor functions

Here's an earlier article about scripting and script security.

I personally haven't done that much with ElasticSearch scripting, but I need to start playing with it to see how it can help. I'd encourage you to do the same.

- Craig

ElasticSearch Resiliancy and Failover

2014-07-29T16:57:00.001-07:00

ElasticSearch is designed with the assumption that things break. Hardware fails and software crashes. That’s a fact of life. ElasticSearch mainly deals with this through the use of clustering. That means that many machines can be combined together to work as a single unit.

Data can be replicated or copied across multiple servers so that the loss of one or more nodes can be tolerated by the cluster and still have the cluster respond to requests in a timely manner. This obviously depends on the exact configuration.

The primary mechanism for ElasticSearch resiliency is divided into 3 parts - nodes, shards, and replicas. Nodes are the individual servers and as nodes are added to the cluster, data is spread across them in terms of shards and replicas. A shard is a slice or section of the data and is specified when an index is created and cannot be changed. A replica is simply copy of a shard. The number of replicas can be changed on the fly.

If a cluster is formed from a single node, then all primary shards on are on that node. If there is a failure of the node or of a shard, then data will likely be lost.

When you add a node to the cluster so you now have 2 nodes, then primary shards will be spread across both nodes. If a node is lost, then some data will not be available until that node is restored. If that node is lost, then data will be lost.

The next step would be to add a replica. Again, a replica is a copy of the data. In a 2-node scenario, the primary shards will be spread across both nodes, but the a replica of the same shard will be allocated to the alternate node. The effect is that both nodes will have a full copy of the data, but one node will not hold all of the primary shards. If a node is lost in this case, the remaining node can handle all requests. When the failed node is restored or a replacement node is added to the cluster, ElasticSearch will replicate shards to the second node and achieve equilibrium again.

Adding more nodes to the cluster will spread the 2 data copies across all 3 nodes now. The cluster can handle the failure of a single node, but not 2 nodes.

If a second replica is added (making 3 copies of the data), then each node would effectively have a copy of the data. This configuration should be able to handle the loss of 2 of the 3 nodes, though you really don’t want that to happen.

ElasticSearch has some features that allow you to influence shard allocation. There are a couple of different algorithms that it can use to control shard placement. Recently, they have added the size of the shard into the allocation strategy such that one node does not end of having lots of large shards while another node has primarily small shards.

ElasticSearch also has a rack awareness feature. This allows you to tell ElasticSearch that some nodes are physically close to each other while other nodes are physically far apart. For example, you can have some nodes in one data center and other nodes of the same cluster in another data center. ElasticSearch will try to keep requests localized as much as possible. Having a cluster split across data centers is not really recommended for performance reasons, but is an option.

ElasticSearch has a federated search feature. This allows 2 clusters to be in physically separate data centers while essentially a third cluster will arbitrate requests across clusters. This is a very welcome feature for ElasticSearch.

ElasticSearch has also added a snapshot-restore feature as of the 1.0 release. This allows an entire cluster to be backed up and restored, or one or more indices can be specified for a particular snapshot. Snapshots can be taken on a fully running cluster and can be scripted and taken periodically.

Once the initial snapshot occurs, subsequent snapshots taken are incremental in nature. http://www.elasticsearch.org/blog/introducing-snapshot-restore/

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html

One of the great things about snapshot/restore is that it can be used to keep a near-line cluster that mirrors the production cluster. If the production cluster goes down, the backup cluster can be brought back online at roughly the same point in time document wise.

This feature works by shutting the snapshots to the near-line cluster, then applying the snapshots to that cluster. Since the snapshots are incremental, the process should be pretty quick, even for clusters with a good amount of volume.

There are a variety of ways that ElasticSearch can be made resilient depending on the exact architecture. ElasticSearch continues to evolve in this regard as more and more companies rely on it for mission critical applications.

- Craig

What is Big Data?

2014-07-22T16:36:00.000-07:00

I've been wanting to write this for a long time, but now have time to put some thoughts down based on discussions and comments from various groups of technical and business people.

The question is, What is Big Data?

Is Big Data just that, big ... data? Well, there have always been very large data sets, certainly constrained to the technology limitations of the time. Storage systems that once were measured in gigabytes, then terabytes, are now measured in petabytes. Certainly out ability to store vast quantities of data have increased dramatically.

With the advent of the Internet and the World Wide Web, our ability to collect information has also grown dramatically as well. Where companies and individuals were limited to collecting data within their particular sphere, now collecting data from locations and individuals around the globe is a daily occurance.

I've hear people sometimes assert that Big Data is simply doing the same things we were before, just in a different way, maybe with different tools. I can understand that point of view. I think that is generally the first step in the process of moving towards a Big Data mindset.

I would put forth that Big Data really is something different than just working with large data sets. To me, Big Data is really a different mind set. We live in a vastly more connected world than just a decade ago. The daily effect of Data in people's lives in general is incredibly pervasive. It's almost impossible to escape.

Big Data is about thinking differently about your data.

Big Data is about connecting data sources. What other data is available that can enhance what you have? Can you add user behavior from your web site? How about advertising or industry data? How about public sources from government or private sources like Twitter, Facebook, search engines like Google, Bing, Yahoo and others?

Big Data is more about openness and sharing. Not everything can be shared and you really, really need to be careful of user identifiable information, but what can you share? What can you publish or what data sets can you contribute to that enrich the overall community?

Big Data many times involves working with unstructured data. SQL database by their nature, enforce very strict structures on your data. If your data doesn't conform, then you're kind of out of luck. There are some things that SQL databases do extremely well, but there are real limits.

There are many new tools and techniques for working with unstructured data and extracting value from them. So called NOSQL data stores are designed to work with data that has little or no structure, providing new capabilities. Open source search engines like ElasticSearch and SOLR provide incredible search and faceting/aggregation abilities.

We have many machine learning algorithms and tools that let us dissect and layer structure on top of our data to help us make sense of it. Algorithms help us to cluster, classify, and figure out which documents/data are similar to other ones.

We can process volumes of data in ways that we couldn't before. Traditional compute requires the data to be moved to the application/algorithm, then the answer was written back to storage. Now we have platforms like Hadoop that effectively distribute large data sets and move the algorithm to the data allowing it to be processed and answers written in place, or to be distributed elsewhere.

Does Big Data require big/large data to be useful? You can run most of these tools on your laptop, so no, not really. You can even run many tools like Map/Reduce on a $35 Raspberry Pi. With Big Data tools, we can do things on our laptops that required data centers in the past.

Big Data requires experimentation. It requires people to ask new questions and develop new answers. What can we do now that we couldn't do before? It requires a different mindset.

- Craig

Elasticsearch 1.2.2 released

2014-07-22T15:11:00.001-07:00

Spreading the word about the latest ElasticSearch release. The latest release is based on Lucene 4.8.1. and includes some nice bug fixes.

There are fixes for possible translog corruption and some file locking issues with previous Lucene releases. It also includes a fix for caching issues with percolation of nested documents.

Great work by the the ElasticSeach team. Be sure and git the new release! http://www.elasticsearch.org/downloads/1-2-2/

- Craig

Beta launch of ExecuteSalces.com

2014-07-22T14:13:00.002-07:00

https://beta.executesales.com/ is not open and available for Beta. ExecuteSales is a platform connecting companies and sales service providers, enabling both to generate new business and achieve higher sales revenue.

I've been working on the search core of ExecuteSales for the last 18 months along with some other fantastic team members and owners and am very excited to get to beta! Beat is totally free so check it out!

- Craig

Blue Raspberry Pi Cluster - 10 Nodes Running ElasticSearch, Hadoop, and Tomcat!

2014-04-08T10:55:00.001-07:00

This is my latest project. It's a 10-node Raspberry Pi Cluster! This was built as part of the Utah Big Mountain Data Conference. It is a competition prize and will be given away as a promotional item from NosqlRevolution LLC which is a Nosql/ElasticSearch/Hadoop/Machine Learning consulting company that I founded.

The idea behind the box is to be able to show and demonstrate many of the concepts that are being talked about during the conference. It also gives an idea of how an individual may be able to work and study these concepts in a very small form factor.

From the Application node, the USB and HDMI ports are extended to the outside of the box. A network port from the 16-port switch is also extended. You can plug in a keyboard, mouse, video, and network and then use the box similar as you would a Linux PC.

The cluster is currently pulling Big Data related tweets directly into ElasticSearch via the Twitter river plugin. Periodically, some of the tweets are pulled into the Hadoop cluster for some basic processing, then written back to ElasticSearch for display. A Java REST application with an AngularJS front end provides a search interface for the tweets and displays the results and basic trending provided by Hadoop.

There you go. A single box that let's us run all of this software and demonstrate and end-to-end "Big Data" system.

Stats:

10 Raspberry Pi Model B with 512MB ram and 32GB CD each.
Total cluster memory - 5GB
Total cluster storage - 320GB
5 Hadoop nodes - 1 name node and 4 data nodes
4 ElasticSearch nodes
1 Application node running tomcat
1 16-port network switch

Please join us on April 12, 2014 in Salt Lake City, UT. See www.uhug.org and utahbigmountaindata.com and utahcodecamp.com for more information.

- Craig

OSv: The Open Source Cloud Operating System That is Not Linux

2014-01-15T15:46:00.001-08:00

I just ran across this today and found it very interesting. OSv: The Open Source Cloud Operating System That is Not Linux on linux.com. It looks like they are building a stripped down OS that is designed to run JVM applications. Since the JVM runs Java, Ruby, Scall, Javascript and others, it gives a lot of options to run your applications. Of course, not everything will run on the JVM, so they indicate they'll be working with other projects to port them to the JVM.

Part of what makes this interesting is that OSv is a single-user, single-application OS layer than runs either on KVM or XEN. To quote one of the authors:

Avi: The traditional OS supports multiple users with multiple applications running on top. OSV doesn't do any of that. It trusts the hypervisor to provide the multi-tenancy and it just runs a single one. We do a smaller job and that allows us to do it well and allows the administration of that system to be easeir as well because there's simply less to administer.

From the web site:

OSv reduces the memory and cpu overhead imposed by traditional OS. Scheduling is lightweight, the application and the kernel cooperate, memory pools are shared. It provides unparalleled short latencies and constant predictable performance, translated directly to capex saving by reduction of the number of OS instances/sizes.

I'm excited to check this out since you can run it easily in an existing Linux environment. I'd really like to see how ElasticSearch performs on OSv. You can find out more information on their site, osv.io.

- Craig

Companion slides for ElasticSearch talk at the Utah Big Mountain Data Conference.

2013-09-18T18:50:00.000-07:00

Craig Brown speaks on ElasticSearch from imarcticblue

Craig Brown speaking on ElasticSearch at Utah Big Mountain Data Conference

2013-09-17T19:48:00.002-07:00

Enjoy!

- Craig

Acunu - Eric Evans - Castle-enhanced Cassandra

2013-03-12T17:54:00.001-07:00

Castle-enhanced Cassandra is a 20 minute video comes from Berlin Buzzwords and was given by Eric Evans of Acunu.

Castle is a new FLOSS storage backend for Linux as a LKM. It's designed to work with the SSTables that Casandra uses for storage. It's a write-optimized storage system that's optimized to work with both rotational disks and SSDs.

It looks like a great project, and glad something like this is open source. It would be interesting to see of other projects can take advantage of this project. There are other NOSQL projects that, while not using SSTs, use similar write-append strategies.

Thanks Acunu!

- Craig

Wired - Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon

2013-03-06T21:56:00.000-08:00

Nice article on Wired.com that revolves around the open source Apache incubator project Mesos.

Mesos is considered to be cluster management software. It is designed to take your entire data center and virtualize the resources for applications running under Mesos.

You don't have to set up a dedicated cluster of machines for a single purpose like running Hadoop. Mesos allows you to set up multiple Hadoop cluster instances over the same hardware set to make more effecient use of resources.

Google's system is referred to as Borg, which is fitting. It is being upgraded soon to Omega. The underlying goal is the same, to effeciently use computing resources for all of the required tasks.

Mesos is being supported by a number of engineers and companies including Twitter and some former Google engineers who worked on Borg.

- Craig

Testing MapReduce with MRUnit By Mansoor Ashraf

2013-02-05T20:26:00.000-08:00

http://m-mansur-ashraf.blogspot.com/2013/02/testing-mapreduce-with-mrunit.html.

Here is the home page for MRUnit. http://mrunit.apache.org/

I ran across this blog today and thought it was really worthwhile.I don't know how many of you are into writing MR jobs, but having MRUnit is an invaluable piece of kit to have available.

Many MR jobs are very time consuming and even running small jobs can be problematic in terms of time spinning up a job and reviewing the results, but there may not be resources available to run your job and test the results. On top of that, there can be additional expenses involved if you're running something like Elastic Map Reduce on AWS.

MRUnit is also a good learning tool, giving you a change try try out MR features or test new algorithms in a quick and safe environment. Unit tests in general allow other people to come up to speed on how your code functions quickly, provides usage documentation, and some protection against your code being broken.

Go ahead and give it at try.

- Craig

Querying 24 Billion Records in 900ms (ElasticSearch)

2012-08-09T11:40:00.000-07:00

This is a video I cam across that's from Berlin Buzzwords Querying 24 Billion Records in 900ms. The presentation is only 20 minutes long, but includes slides which is very helpful.

The speaker gives a lot of good information about trying to scale out ES on AWS to handle 24B records. They were able to do it successfully, which is excellent. They ended up moving to dedicated hardware to reduce the AWS cost.

- Craig

foursquare now uses ElasticSearch

2012-08-09T11:36:00.001-07:00

foursquare now uses ElasticSearch! This is pretty exciting news! It's great to see people switching over to ElasticSeach and seeing such great success. The article doesn't have a ton of details, but definitely worth looking at.

- Craig

NoSQLUnit 0.3.2 Released

2012-08-08T14:18:00.000-07:00

This is something quite interesting that I ran across today - NoSQLUnit 0.3.2. I didn't realize that there was a project out there like this, very cool!

Some NoSQL projects are easier to run embedded than others, and this could certainly help that. You always need help classes to facilitate running unit tests against a data source. I've writing unit tests against ElasticSearch, which has a local/memory mode that is great for writing tests against. Still, you have to generate your test data and load it into the search engine for testing.

Great project!

- Craig

Time To Build Your Big-Data Muscles

2012-07-20T21:52:00.001-07:00

Time To Build Your Big-Data Muscles from fastcompany.com talks mainly about getting educated for a Big Data job. They provide an interesting statistic that for every 100 open Big Data jobs, there are only 2 qualified candidates. Definitely a good place to be if you're looking for work in this area.

Having a background in mathematics and engineering are your best bests. Load up on statistics if you're able to. I'm more on the engineering side myself. My work involves taking the algorithms and principles that have been developed and applying them to the business data sets. There is so much work to do just in the application of algorithms alone.

- Craig

Cameron Befus speaks at UHUG (Hadoop)

2012-07-18T21:05:00.002-07:00

This evening, we heard from Cameron Befus, former CTO of Tynt which was purchased by 33 Across Inc. Here are some notes from the presentation on Hadoop.

Why use Hadoop?

A framework for parallel processing of large data sets.

Design considerations
- System will manage and heal itself
- Performance to scale linearly
- Compute move algorithm to data

Transition cluster size - local -> cloud -> dedicated hardware.

Hadoop process optiized for data retrieval
- schema on read
- nosql databases
- map reduce
- asynchronus
- parallel computing

Built for unreliable, commodity hardware.
Scale by adding boxes
More cost effective

Sends the program to the data.
- Larger choke points occur in transferring large volumes of data around the cluster

#1 Data Driven Decision MakingEveryone wants to make good decisions and as everyone know, to make good decisions you need data, not just any data, the right data.

Walmart mined sales that occurred previous to a coming hurricane and found the highest selling products are:
#1 batteries,
#2 pop tarts.

This means $$$ to Walmart.

Google can predict flu trends around the world just from search queries. Query velocity matches medical data.

With the advent of large disks, it becomes cost effective to simply store everything, then use a system like hadoop to run through and process the data to find value.

Combining data sets can extract value when they may not be valuable on their own. Turning lead into gold as it were.

How big is big data?
not just size,
complexity
rate of growth
performance
retention

Other uses,
load testing
number crunching
building lucene indexes
just about anything that can be easily parrallelized

- Craig