tag:blogger.com,1999:blog-66543215924274085062024-03-12T17:51:29.212-07:00NoSql Tips and TricksCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comBlogger84125tag:blogger.com,1999:blog-6654321592427408506.post-11770387974071928772015-10-29T07:42:00.001-07:002015-10-29T07:42:26.136-07:00Elasticsearch 2.0 released!Elasticsearch 2.0 is finally out! They actually released a <a href="https://www.elastic.co/blog/release-we-have?blade=tw">number of new releases</a> including Kibana 4.2, Marvel 2.0 and Sense 2.0. Definitely a lot of new stuff! <br />
<br />
<blockquote class="tr_bq">
<strong>Released</strong> <strong><a href="https://www.elastic.co/blog/elasticsearch-2-0-0-released">Elasticsearch 2.0</a></strong>.
A major milestone and achievement of the whole team, and wonderful
contributions from the community. New type of aggregations called
pipeline aggs, simplified query DSL by merging query and filter
concepts, better compression options, hardened security by enabling
security manager, hardening of FS behavior (fsync, more checksums,
atomic renames), performance, consistent mapping behavior, and many
more. Also, it bundles Lucene 5 release, which includes numerous
improvements, several of them led by our team.</blockquote>
<br />
<a href="https://www.youtube.com/watch?v=KlXG2XLHYc0">Watch the release webinar now!</a><br />
<br />
- Craig Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-74185419561967722472015-05-22T10:55:00.000-07:002015-05-22T10:55:34.892-07:00FastCompany: Googles InnerTube Project to Retool YouTube<a href="http://m.fastcompany.com/3044995/to-take-on-hbo-and-netflix-youtube-had-to-rewire-itself">Fast Company: Google InnerTube project to retool YouTube</a> is a great article talking about a lot of the more recent changes that Google has been making in order to fix some internal development issues and make YouTube more of a competitive destination for watching content.<br />
<br />
It's funny hearing the talking about the mess of Frankenstein projects that had grown up over the years. It seems like every project that's in continuous development turns in a Frankenstein. It's always a constant battle to keep a project clean. If you don't, then up you end up reaching a point where you almost have to over from scratch to get the project where you you need it to be.<br />
<br />
Enjoy!<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-41301056248385896742015-01-10T13:59:00.001-08:002015-01-10T13:59:10.602-08:00Nathan Paco - A New Year in Data Science: ML Unpaused<a href="http://www.slideshare.net/pacoid/a-new-year-in-data-science-ml-unpaused">A New Year in Data Science: ML Unpaused</a><br />
A good read on Data Science and some of the advances being made more recently. He presents some history around machine learning and developments around the real job of Data Science.<br />
<br />
There are some good insights around the need to focus on data and features instead of focusing on which algorithm to use. Many times people look more at how to build something instead of what problem they are trying to solve. <br />
<br />
Nathan certainly knows what he's talking about. I met him last year at Hadoop Summit in San Jose. He has been working with the Spark guys and Databrix. Take a minute and look him up.<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-84648061047098768122014-12-31T10:18:00.003-08:002014-12-31T10:27:53.763-08:00Elasticsearch 5-node PicoCluster test - 1,000,000 tweets and 3495.1316MB in just under 9 hours.We've been doing some benchmarks with Elasticsearch on the 5-node PicoCluster. We were able to push 1,000,000 tweets and 3495.1316MB in just under 9 hours. The tweets are small, around 1KB each, but have a lot of fields and are pretty complicated. That's pretty good considering that the SD cards are not very good a small reads and writes.<br />
<br />
<br />
Pushed 1000 docs in 34.457 secs at 29.021679 docs/sec 3.4431486 MB/sec, 980000 total docs in 520 minutes 33 seconds at 31.37682 docs/second 98.0% complete<br />
Pushed 1000 docs in 36.632 secs at 27.298536 docs/sec 3.5004349 MB/sec, 981000 total docs in 521 minutes 9 seconds at 31.372042 docs/second 98.1% complete<br />
Pushed 1000 docs in 34.607 secs at 28.89589 docs/sec 3.5908194 MB/sec, 982000 total docs in 521 minutes 44 seconds at 31.369303 docs/second 98.2% complete<br />
Pushed 1000 docs in 30.67 secs at 32.605152 docs/sec 3.3349895 MB/sec, 983000 total docs in 522 minutes 15 seconds at 31.370514 docs/second 98.299995% complete<br />
Pushed 1000 docs in 31.243 secs at 32.007168 docs/sec 3.431964 MB/sec, 984000 total docs in 522 minutes 46 seconds at 31.37115 docs/second 98.4% complete<br />
Pushed 1000 docs in 28.858 secs at 34.652435 docs/sec 3.4087648 MB/sec, 985000 total docs in 523 minutes 15 seconds at 31.374163 docs/second 98.5% complete<br />
Pushed 1000 docs in 29.598 secs at 33.786068 docs/sec 3.4104357 MB/sec, 986000 total docs in 523 minutes 44 seconds at 31.376436 docs/second 98.6% complete<br />
Pushed 1000 docs in 32.356 secs at 30.90617 docs/sec 3.4084692 MB/sec, 987000 total docs in 524 minutes 17 seconds at 31.375952 docs/second 98.7% complete<br />
Pushed 1000 docs in 37.807 secs at 26.450129 docs/sec 3.4255342 MB/sec, 988000 total docs in 524 minutes 55 seconds at 31.370039 docs/second 98.799995% complete<br />
Pushed 1000 docs in 33.404 secs at 29.936535 docs/sec 3.4184904 MB/sec, 989000 total docs in 525 minutes 28 seconds at 31.36852 docs/second 98.9% complete<br />
Pushed 1000 docs in 34.465 secs at 29.014942 docs/sec 3.4793549 MB/sec, 990000 total docs in 526 minutes 2 seconds at 31.36595 docs/second 99.0% complete<br />
Pushed 1000 docs in 30.792 secs at 32.475967 docs/sec 3.4305592 MB/sec, 991000 total docs in 526 minutes 33 seconds at 31.367033 docs/second 99.1% complete<br />
Pushed 1000 docs in 29.749 secs at 33.614574 docs/sec 3.4574842 MB/sec, 992000 total docs in 527 minutes 3 seconds at 31.369146 docs/second 99.2% complete<br />
Pushed 1000 docs in 32.825 secs at 30.464584 docs/sec 3.370614 MB/sec, 993000 total docs in 527 minutes 36 seconds at 31.368208 docs/second 99.299995% complete<br />
Pushed 1000 docs in 37.048 secs at 26.99201 docs/sec 3.451209 MB/sec, 994000 total docs in 528 minutes 13 seconds at 31.36309 docs/second 99.4% complete<br />
Pushed 1000 docs in 35.307 secs at 28.322996 docs/sec 3.3885374 MB/sec, 995000 total docs in 528 minutes 48 seconds at 31.35971 docs/second 99.5% complete<br />
Pushed 1000 docs in 37.64 secs at 26.567482 docs/sec 3.4242926 MB/sec, 996000 total docs in 529 minutes 26 seconds at 31.35403 docs/second 99.6% complete<br />
Pushed 1000 docs in 28.108 secs at 35.57706 docs/sec 3.4203835 MB/sec, 997000 total docs in 529 minutes 54 seconds at 31.357765 docs/second 99.7% complete<br />
Pushed 1000 docs in 28.886 secs at 34.618847 docs/sec 3.4412107 MB/sec, 998000 total docs in 530 minutes 23 seconds at 31.360725 docs/second 99.8% complete<br />
Pushed 1000 docs in 40.074 secs at 24.953835 docs/sec 3.4108858 MB/sec, 999000 total docs in 531 minutes 3 seconds at 31.352667 docs/second 99.9% complete<br />
Pushed 1000 docs in 0.0 secs at Infinity docs/sec 3.4554148 MB/sec, 1000000 total docs in 531 minutes 3 seconds at 31.38405 docs/second 100.0% complete<br />
<br />
Pushed 1000000 total docs and 3495.1316MB in 531 minutes 3 seconds at 31.38405 per second, Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-25299947958360510512014-12-18T17:25:00.000-08:002014-12-18T17:25:19.294-08:0010GB Terasort Benchmark on 5-node Raspberry PI Cluster - 2H 52m 56s
<h1>
<span style="font-size: small;"><span style="font-weight: normal;">We just set a new record for the 10GB terasort on a 5-node PicoCluster! We cut over an our off the benchmark time bringing the total to under 3 hours! Pretty amazing! </span></span></h1>
Hadoop job_201412181311_0002 on <a href="http://pi0:50030/jobtracker.jsp">master</a>
<b>User:</b> hadoop<br />
<b>Job Name:</b> TeraSort<br />
<b>Job File:</b> <a href="http://pi0:50030/jobconf.jsp?jobid=job_201412181311_0002">hdfs://pi0:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201412181311_0002/job.xml</a><br />
<b>Submit Host:</b> pi0<br />
<b>Submit Host Address:</b> 10.1.10.120<br />
<b>Job-ACLs: All users are allowed</b><br /><b>Job Setup:</b><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201412181311_0002&type=setup&pagenum=1&state=completed"> Successful</a><br />
<b>Status:</b> Succeeded<br />
<b>Started at:</b> Thu Dec 18 14:54:20 MST 2014<br />
<b>Finished at:</b> Thu Dec 18 17:47:16 MST 2014<br />
<b>Finished in:</b> 2hrs, 52mins, 56sec<br />
<b>Job Cleanup:</b><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201412181311_0002&type=cleanup&pagenum=1&state=completed"> Successful</a><br />
<hr />
<table border="2" cellpadding="5" cellspacing="2"><tbody>
<tr><th>Kind</th><th>% Complete</th><th>Num Tasks</th><th>Pending</th><th>Running</th><th>Complete</th><th>Killed</th><th><a href="http://pi0:50030/jobfailures.jsp?jobid=job_201412181311_0002">Failed/Killed<br />Task Attempts</a></th></tr>
<tr><th><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201412181311_0002&type=map&pagenum=1">map</a></th><td align="right">100.00%<table border="1px" style="width: 80pxpx;"><tbody>
<tr><td cellspacing="0" class="perc_filled" width="100%"><br /></td></tr>
</tbody></table>
</td><td align="right">80</td><td align="right">0</td><td align="right">0</td><td align="right"><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201412181311_0002&type=map&pagenum=1&state=completed">80</a></td><td align="right">0</td><td align="right">0 / 0</td></tr>
<tr><th><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201412181311_0002&type=reduce&pagenum=1">reduce</a></th><td align="right">100.00%<table border="1px" style="width: 80pxpx;"><tbody>
<tr><td cellspacing="0" class="perc_filled" width="100%"><br /></td></tr>
</tbody></table>
</td><td align="right">80</td><td align="right">0</td><td align="right">0</td><td align="right"><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201412181311_0002&type=reduce&pagenum=1&state=completed">80</a></td><td align="right">0</td><td align="right">0 / 0</td></tr>
</tbody></table>
<br />
<table border="2" cellpadding="5" cellspacing="2">
<tbody>
<tr>
<th><br /></th>
<th>Counter</th>
<th>Map</th>
<th>Reduce</th>
<th>Total</th>
</tr>
<tr>
<td rowspan="17">
Map-Reduce Framework</td>
<td>Spilled Records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">300,000,000</td>
</tr>
<tr>
<td>Map output materialized bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,200,038,400</td>
</tr>
<tr>
<td>Reduce input records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Virtual memory (bytes) snapshot</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">46,356,074,496</td>
</tr>
<tr>
<td>Map input records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>SPLIT_RAW_BYTES</td>
<td align="right">8,800</td>
<td align="right">0</td>
<td align="right">8,800</td>
</tr>
<tr>
<td>Map output bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td>Reduce shuffle bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,200,038,400</td>
</tr>
<tr>
<td>Physical memory (bytes) snapshot</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">32,931,528,704</td>
</tr>
<tr>
<td>Map input bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td>Reduce input groups</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Combine output records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>Reduce output records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Map output records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Combine input records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>CPU time spent (ms)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">27,827,080</td>
</tr>
<tr>
<td>Total committed heap usage (bytes)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">32,344,113,152</td>
</tr>
<tr>
<td rowspan="1">
File Input Format Counters </td>
<td>Bytes Read</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,144,320</td>
</tr>
<tr>
<td rowspan="4">
FileSystemCounters</td>
<td>HDFS_BYTES_READ</td>
<td align="right">10,000,153,120</td>
<td align="right">0</td>
<td align="right">10,000,153,120</td>
</tr>
<tr>
<td>FILE_BYTES_WRITTEN</td>
<td align="right">20,404,679,750</td>
<td align="right">10,204,290,230</td>
<td align="right">30,608,969,980</td>
</tr>
<tr>
<td>FILE_BYTES_READ</td>
<td align="right">10,265,248,834</td>
<td align="right">10,200,000,960</td>
<td align="right">20,465,249,794</td>
</tr>
<tr>
<td>HDFS_BYTES_WRITTEN</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td rowspan="1">
File Output Format Counters </td>
<td>Bytes Written</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td rowspan="8">
Job Counters </td>
<td>Launched map tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">80</td>
</tr>
<tr>
<td>Launched reduce tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">80</td>
</tr>
<tr>
<td>SLOTS_MILLIS_REDUCES</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">28,079,434</td>
</tr>
<tr>
<td>Total time spent by all reduces waiting after reserving slots (ms)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>SLOTS_MILLIS_MAPS</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">22,051,330</td>
</tr>
<tr>
<td>Total time spent by all maps waiting after reserving slots (ms)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>Rack-local map tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">30</td>
</tr>
<tr>
<td>Data-local map tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">50</td>
</tr>
</tbody></table>
<hr />
Map Completion Graph -
<a href="http://pi0:50030/jobdetails.jsp?jobid=job_201412181311_0002&refresh=0&map.graph=off"> close </a>
<br />
<hr />
Reduce Completion Graph -
<a href="http://pi0:50030/jobdetails.jsp?jobid=job_201412181311_0002&refresh=0&reduce.graph=off"> close </a>
<br />
<hr />
<hr />
<hr />
<a href="http://pi0:50030/jobtracker.jsp">Go back to JobTracker</a><br /><hr />
This is <a href="http://hadoop.apache.org/">Apache Hadoop</a> release 1.2.1
<table border="0"> <tbody>
<tr>
</tr>
</tbody></table>
Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-41331968052878155662014-12-04T16:31:00.001-08:002014-12-04T16:31:31.744-08:00The Daily WTF: The Robot GuysNot really Big Data, but pretty funny :)<br />
<br />
http://thedailywtf.com/articles/the-robot-guys<br />
<br />
- Craig <br />
<br />Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-29940014857441028592014-11-13T21:36:00.002-08:002014-11-13T21:36:42.289-08:0010GB Terasort Benchmark on 5-node Raspberry PI Cluster - 4H 12m 5s
After more than a week of testing, tweeking, and retesting, we were able to successfully run a 10GB Terasort on a 5-node Raspberry PI cluster! Each node has 512MB ram and a 16GB SD card.<br />
<br />
<h1>
Hadoop job_201411131351_0001 on <a href="http://pi0:50030/jobtracker.jsp">master</a></h1>
<b>User:</b> hadoop<br />
<b>Job Name:</b> TeraSort<br />
<b>Job File:</b> <a href="http://pi0:50030/jobconf.jsp?jobid=job_201411131351_0001">hdfs://pi0:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201411131351_0001/job.xml</a><br />
<b>Submit Host:</b> pi0<br />
<b>Submit Host Address:</b> 10.1.10.120<br />
<b>Job-ACLs: All users are allowed</b><br /><b>Job Setup:</b><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201411131351_0001&type=setup&pagenum=1&state=completed"> Successful</a><br />
<b>Status:</b> Succeeded<br />
<b>Started at:</b> Thu Nov 13 14:23:35 MST 2014<br />
<b>Finished at:</b> Thu Nov 13 18:35:40 MST 2014<br />
<b>Finished in:</b> 4hrs, 12mins, 5sec<br />
<b>Job Cleanup:</b><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201411131351_0001&type=cleanup&pagenum=1&state=completed"> Successful</a><br />
<hr />
<table border="2" cellpadding="5" cellspacing="2"><tbody>
<tr><th>Kind</th><th>% Complete</th><th>Num Tasks</th><th>Pending</th><th>Running</th><th>Complete</th><th>Killed</th><th><a href="http://pi0:50030/jobfailures.jsp?jobid=job_201411131351_0001">Failed/Killed<br />Task Attempts</a></th></tr>
<tr><th><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201411131351_0001&type=map&pagenum=1">map</a></th><td align="right">100.00%<table border="1px" style="width: 80pxpx;"><tbody>
<tr><td cellspacing="0" class="perc_filled" width="100%"><br /></td></tr>
</tbody></table>
</td><td align="right">152</td><td align="right">0</td><td align="right">0</td><td align="right"><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201411131351_0001&type=map&pagenum=1&state=completed">152</a></td><td align="right">0</td><td align="right">0 / 0</td></tr>
<tr><th><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201411131351_0001&type=reduce&pagenum=1">reduce</a></th><td align="right">100.00%<table border="1px" style="width: 80pxpx;"><tbody>
<tr><td cellspacing="0" class="perc_filled" width="100%"><br /></td></tr>
</tbody></table>
</td><td align="right">152</td><td align="right">0</td><td align="right">0</td><td align="right"><a href="http://pi0:50030/jobtasks.jsp?jobid=job_201411131351_0001&type=reduce&pagenum=1&state=completed">152</a></td><td align="right">0</td><td align="right">0 / 0</td></tr>
</tbody></table>
<br />
<table border="2" cellpadding="5" cellspacing="2">
<tbody>
<tr>
<th><br /></th>
<th>Counter</th>
<th>Map</th>
<th>Reduce</th>
<th>Total</th>
</tr>
<tr>
<td rowspan="1">
File Input Format Counters </td>
<td>Bytes Read</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,298,372</td>
</tr>
<tr>
<td rowspan="8">
Job Counters </td>
<td>SLOTS_MILLIS_MAPS</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">24,993,499</td>
</tr>
<tr>
<td>Launched reduce tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">152</td>
</tr>
<tr>
<td>Total time spent by all reduces waiting after reserving slots (ms)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>Rack-local map tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">144</td>
</tr>
<tr>
<td>Total time spent by all maps waiting after reserving slots (ms)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>Launched map tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">152</td>
</tr>
<tr>
<td>Data-local map tasks</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">8</td>
</tr>
<tr>
<td>SLOTS_MILLIS_REDUCES</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">34,824,665</td>
</tr>
<tr>
<td rowspan="1">
File Output Format Counters </td>
<td>Bytes Written</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td rowspan="4">
FileSystemCounters</td>
<td>FILE_BYTES_READ</td>
<td align="right">10,341,496,856</td>
<td align="right">10,200,000,912</td>
<td align="right">20,541,497,768</td>
</tr>
<tr>
<td>HDFS_BYTES_READ</td>
<td align="right">10,000,315,092</td>
<td align="right">0</td>
<td align="right">10,000,315,092</td>
</tr>
<tr>
<td>FILE_BYTES_WRITTEN</td>
<td align="right">20,409,243,506</td>
<td align="right">10,208,123,719</td>
<td align="right">30,617,367,225</td>
</tr>
<tr>
<td>HDFS_BYTES_WRITTEN</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td rowspan="17">
Map-Reduce Framework</td>
<td>Map output materialized bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,200,138,624</td>
</tr>
<tr>
<td>Map input records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Reduce shuffle bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,200,138,624</td>
</tr>
<tr>
<td>Spilled Records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">300,000,000</td>
</tr>
<tr>
<td>Map output bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td>Total committed heap usage (bytes)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">57,912,754,176</td>
</tr>
<tr>
<td>CPU time spent (ms)</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">40,328,090</td>
</tr>
<tr>
<td>Map input bytes</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">10,000,000,000</td>
</tr>
<tr>
<td>SPLIT_RAW_BYTES</td>
<td align="right">16,720</td>
<td align="right">0</td>
<td align="right">16,720</td>
</tr>
<tr>
<td>Combine input records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>Reduce input records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Reduce input groups</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Combine output records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
<tr>
<td>Physical memory (bytes) snapshot</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">52,945,952,768</td>
</tr>
<tr>
<td>Reduce output records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
<tr>
<td>Virtual memory (bytes) snapshot</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">123,024,928,768</td>
</tr>
<tr>
<td>Map output records</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">100,000,000</td>
</tr>
</tbody></table>
<hr />
Map Completion Graph
<hr />
Reduce Completion Graph
<hr />
<hr />
<hr />
<a href="http://pi0:50030/jobtracker.jsp">Go back to JobTracker</a><br /><hr />
This is <a href="http://hadoop.apache.org/">Apache Hadoop</a> release 1.2.1
<table border="0"> <tbody>
<tr>
</tr>
</tbody></table>
<table border="0"><tbody>
<tr></tr>
</tbody></table>
Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-81512162661524176942014-11-13T20:50:00.000-08:002014-11-13T20:50:14.621-08:00Installing Hadoop on a Raspberry PI clusterThis document details setting up and testing Hadoop on a Raspberry PI cluster. The details are almost exactly the same for any Linux flavor, particularly Debian/Ubuntu. There are only a couple of PI specifics.<br />
<br />
In this case, we have a 5-node cluster. If you cluster is a different size, make sure the hadoop distribution is on each of the machines, make sure you update the slaves file, and copy the config files to each machine as shown.<br />
<br />
<h3>
Machine Setup</h3>
This first thing to do would be to add dns entries for all of your cluster machines. If you have a dns sever available, then that's great. If you don't, then you can edit the hosts file and add the correct entries there. This has to be done for every machine in the cluster. My entries look like this:<br />
<br />
<pre>10.1.10.120 pi0 master
10.1.10.121 pi1
10.1.10.122 pi2
10.1.10.123 pi3
10.1.10.124 pi4</pre>
<pre> </pre>
For each machine, the hostname needs to be changed to reflect the hostnames defined above. The default Rasbian hostname is<code> raspberrypi</code>
.<br />
<br />
Edit the<code> /etc/hostname</code>
file and change the entry to the hostname that you want. In my case, the first PI is<code> pi0</code>
. A quick reboot of each node should verify the correct hostname of each node.<br />
<br />
<pre> </pre>
We want to add a hadoop user to each machine. This is not strictly necessary for hadoop to run, but we do this to keep our hadoop stuff separate from anything else on these machines.<br />
<br />
<pre>Add a new user to each machine to run hadoop.
pi@pi0 ~ $ sudo adduser hadoop
Adding user `hadoop' ...
Adding new group `hadoop' (1004) ...
Adding new user `hadoop' (1001) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
Full Name []: Hadoop
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] Y </pre>
<pre> </pre>
<pre> </pre>
Log out as the pi user and log back in as the hadoop user on your master node.
Create a private identity key and copy to your authorized_keys file.
<br />
<pre> </pre>
<pre>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Copy the identity key to all of your other nodes.
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi1
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi2
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi3
ssh-copy-id -i .ssh/id_dsa.pub hadoop@pi4</pre>
<pre> </pre>
<pre> </pre>
<pre></pre>
<pre></pre>
You can also copy the keys directly to the new machine with something like this.<br />
<br />
<pre>scp .ssh/id_dsa.pub hadoop@pi1:.ssh/authorized_keys</pre>
<pre> </pre>
<h3>
Hadoop Setup </h3>
<pre></pre>
<pre></pre>
Copy the distribution to the hadoop user account and decompress.
I always create symlink to make things a bit easier:
ln -s hadoop-1.x.x hadoop
Now we need to edit some hadoop config files to make everything work.<br />
<br />
<pre>cd hadoop/conf
ls
</pre>
<br />
<br />
This should give you a listing that looks like this:
<br />
<pre>hadoop@raspberrypi ~/hadoop/conf $ ls
capacity-scheduler.xml core-site.xml hadoop-env.sh hadoop-policy.xml
log4j.properties mapred-site.xml slaves ssl-server.xml.example
configuration.xsl fair-scheduler.xml hadoop-metrics2.properties hdfs-site.xml
mapred-queue-acls.xml masters ssl-client.xml.example taskcontroller.cfg
</pre>
<br />
<br />
This files we're interested in are masters and slaves. Add the address of the node that will be the primary NameNode and JobTracker to the masters file. Add the addresses of the nodes that will be the DataNodes and TaskTrackers to the slaves file. You can also add the master node address to the slaves file if you want it to store data and run map reduce jobs, but I'm not doing that in this case.
These files looks like this for my cluster:
<br />
<pre> </pre>
<pre>hadoop@raspberrypi ~/hadoop/conf $ more masters
pi0
hadoop@raspberrypi ~/hadoop/conf $ more slaves
pi1
pi2
pi3
pi4
</pre>
<br />
<br />
These files need to be copied to all the nodes in the cluster. The easy way to do this would be as follows:<br />
<br />
<pre>hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi1:hadoop/conf/.
masters 100% 12 0.0KB/s 00:00
slaves 100% 48 0.1KB/s 00:00
hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi2:hadoop/conf/.
masters 100% 12 0.0KB/s 00:00
slaves 100% 48 0.1KB/s 00:00
hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi3:hadoop/conf/.
masters 100% 12 0.0KB/s 00:00
slaves 100% 48 0.1KB/s 00:00
hadoop@pi0 ~/hadoop/conf $ scp masters slaves pi4:hadoop/conf/.
masters 100% 12 0.0KB/s 00:00
slaves 100% 48 0.1KB/s 00:00
</pre>
<br />
<br />
We need to define the JAVA_HOME setting in hadoop-env.sh file. Since Hadoop runs on java, it needs to know where the executable is. The Oracle JDK comes standard with the latest Rasbian release. We can set JAVA_HOME as follows:<br />
<br />
<pre># The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle-armhf
</pre>
<br />
<br />
Next we need to look at the core-site.xml, hdfs-site.xml and mapred-site.xml files. These are the configuration files for the distributed file system HDFS, and the code that runs the map reduce programs that we want to run. By default, both of these processes put their files in temporary file system space. That's fine to play around with, but as soon as you reboot, all of your data disappears and you have to start over again.
We first need to tell each machine where the NameNode is. This is configured in the core-site.xml file.
<br />
<pre><property></property><property>
<code class="western"><property></code>
<code class="western"> </code><code class="western"><name>fs.default.name</name></code>
<code class="western"> </code><code class="western"><value>hdfs://master:54310</value></code>
<code class="western"></property></code>
</property>
</pre>
Next we need to tell each node where the<br />
JobTracker is and a safe place to store data so that it stays permanent. This is configured in the mapred-site.xml file.<br />
<br />
<pre><style type="text/css"> </style><code class="western"><property></code>
<code class="western"> </code><code class="western"><name>mapred.job.tracker</name></code>
<code class="western"> </code><code class="western"><value>master:54311</value></code>
<code class="western"></property></code>
<code class="western"> </code></pre>
<pre><code class="western"><property></code>
<code class="western"> </code><code class="western"><name>mapred.</code><code class="western">local.dir</code><code class="western"></name></code>
<code class="western"> </code><code class="western"><value>/</code><code class="western">home/hadoop/mapred</code><code class="western"></value></code>
<code class="western"></property></code>
</pre>
<pre></pre>
<br />
Lastly we need to specify safe directories to store our files for HDFS.<br />
<br />
<pre><code class="western"><property></code>
<code class="western"> </code><code class="western"><name></code>dfs.name.dir<code class="western"></name></code>
<code class="western"> </code><code class="western"><value>/</code><code class="western">home/hadoop/name</code><code class="western"></value></code>
<code class="western"></property></code>
<code class="western"><property></code>
<code class="western"> </code><code class="western"><name></code>dfs.data.dir<code class="western"></name></code>
<code class="western"> </code><code class="western"><value>/</code><code class="western">home/hadoop/data</code><code class="western"></value></code>
<code class="western"></property></code> </pre>
<pre> </pre>
<pre> </pre>
All of these files need to be copied to each of the other nodes in the cluster. We can do that like we did before with the masters and slaves files.
<br />
<pre> </pre>
<pre>hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi1:hadoop/conf/.
hadoop-env.sh 100% 2218 2.2KB/s 00:00
core-site.xml 100% 268 0.3KB/s 00:00
hdfs-site.xml 100% 347 0.3KB/s 00:00
mapred-site.xml 100% 356 0.4KB/s 00:00
hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi2:hadoop/conf/.
hadoop-env.sh 100% 2218 2.2KB/s 00:00
core-site.xml 100% 268 0.3KB/s 00:00
hdfs-site.xml 100% 347 0.3KB/s 00:00
mapred-site.xml 100% 356 0.4KB/s 00:00
hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi3:hadoop/conf/.
hadoop-env.sh 100% 2218 2.2KB/s 00:00
core-site.xml 100% 268 0.3KB/s 00:00
hdfs-site.xml 100% 347 0.3KB/s 00:00
mapred-site.xml 100% 356 0.4KB/s 00:00
hadoop@pi0 ~/hadoop/conf $ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml hadoop@pi4:hadoop/conf/.
hadoop-env.sh 100% 2218 2.2KB/s 00:00
core-site.xml 100% 268 0.3KB/s 00:00
hdfs-site.xml 100% 347 0.3KB/s 00:00
mapred-site.xml 100% 356 0.4KB/s 00:00
</pre>
<br />
<br />
We have one interesting task left. The hadoop script by default wants to run the datanode in server mode. The Oracle Java distribution does not support this option. We need to edit the bin/hadoop script to remove this option.<br />
<br />
<pre>cd bin
vi hadoop
Search for “-server”
Change this:
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
to this:
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS" </pre>
<pre> </pre>
<pre> </pre>
We need to copy this file to all of the nodes in the cluster.<br />
<br />
<pre>hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi1:hadoop/bin/.
hadoop 100% 14KB 13.8KB/s 00:00
hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi2:hadoop/bin/.
hadoop 100% 14KB 13.8KB/s 00:00
hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi3:hadoop/bin/.
hadoop 100% 14KB 13.8KB/s 00:00
hadoop@pi0 ~/hadoop/bin $ scp hadoop hadoop@pi4:hadoop/bin/.
hadoop 100% 14KB 13.8KB/s 00:00
</pre>
<br />
<br />
Now we're ready to get things up and running!
We need to format the NameNode this sets up the storage structure HDFS. This is done with the bin/hadoop namenode -format command as follows:<br />
<br />
<pre>hadoop@pi0 ~/hadoop $ bin/hadoop namenode -format
14/11/04 10:53:28 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = pi0/10.1.10.120
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.1.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108; compiled by 'hortonfo' on Mon Nov 19 10:48:11 UTC 2012
************************************************************/
14/11/04 10:53:30 INFO util.GSet: VM type = 32-bit
14/11/04 10:53:30 INFO util.GSet: 2% max memory = 19.335 MB
14/11/04 10:53:30 INFO util.GSet: capacity = 2^22 = 4194304 entries
14/11/04 10:53:30 INFO util.GSet: recommended=4194304, actual=4194304
14/11/04 10:53:34 INFO namenode.FSNamesystem: fsOwner=hadoop
14/11/04 10:53:35 INFO namenode.FSNamesystem: supergroup=supergroup
14/11/04 10:53:35 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/11/04 10:53:35 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/11/04 10:53:35 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/11/04 10:53:35 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/11/04 10:53:37 INFO common.Storage: Image file of size 112 saved in 0 seconds.
14/11/04 10:53:38 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/hadoop/name/current/edits
14/11/04 10:53:38 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/hadoop/name/current/edits
14/11/04 10:53:39 INFO common.Storage: Storage directory /home/hadoop/name has been successfully formatted.
14/11/04 10:53:39 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at pi0/10.1.10.120
************************************************************/</pre>
<pre> </pre>
<pre> </pre>
Since we set up the SSH login niceness before, we cam start the entire cluster now with a single command!
<br />
<pre> </pre>
<pre>hadoop@pi0 ~/hadoop $ bin/start-all.sh
starting namenode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-namenode-pi0.out
pi1: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi1.out
pi3: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi3.out
pi4: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi4.out
pi2: starting datanode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-datanode-pi2.out
pi0: starting secondarynamenode, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-secondarynamenode-pi0.out
starting jobtracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-jobtracker-pi0.out
pi1: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi1.out
pi2: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi2.out
pi4: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi4.out
pi3: starting tasktracker, logging to /home/hadoop/hadoop-1.1.1/libexec/../logs/hadoop-hadoop-tasktracker-pi3.out
</pre>
<br />
<br />
Open up a browser and go to the following URLs: pi0 is the master or namenode in your cluster.
<br />
<pre>http://pi0:50070/dfshealth.jsp
http://pi0:50030/jobtracker.jsp
</pre>
<br />
<h3>
Testing the Setup</h3>
Now we can run something simple and quick and make sure things are working.<br />
<br />
<pre>hadoop@pi0 ~/hadoop $ bin/hadoop jar hadoop-examples-1.1.1.jar pi 4 1000
Number of Maps = 4
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Starting Job
14/11/04 11:50:50 INFO mapred.FileInputFormat: Total input paths to process : 4
14/11/04 11:50:55 INFO mapred.JobClient: Running job: job_201411041127_0001
14/11/04 11:50:56 INFO mapred.JobClient: map 0% reduce 0%
14/11/04 11:51:49 INFO mapred.JobClient: map 25% reduce 0%
14/11/04 11:51:54 INFO mapred.JobClient: map 100% reduce 0%
14/11/04 11:52:22 INFO mapred.JobClient: map 100% reduce 66%
14/11/04 11:52:25 INFO mapred.JobClient: map 100% reduce 100%
14/11/04 11:52:44 INFO mapred.JobClient: Job complete: job_201411041127_0001
14/11/04 11:52:44 INFO mapred.JobClient: Counters: 30
14/11/04 11:52:44 INFO mapred.JobClient: Job Counters
14/11/04 11:52:44 INFO mapred.JobClient: Launched reduce tasks=1
14/11/04 11:52:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=129653
14/11/04 11:52:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/11/04 11:52:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/11/04 11:52:44 INFO mapred.JobClient: Launched map tasks=4
14/11/04 11:52:44 INFO mapred.JobClient: Data-local map tasks=4
14/11/04 11:52:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=34464
14/11/04 11:52:44 INFO mapred.JobClient: File Input Format Counters
14/11/04 11:52:44 INFO mapred.JobClient: Bytes Read=472
14/11/04 11:52:44 INFO mapred.JobClient: File Output Format Counters
14/11/04 11:52:44 INFO mapred.JobClient: Bytes Written=97
14/11/04 11:52:44 INFO mapred.JobClient: FileSystemCounters
14/11/04 11:52:44 INFO mapred.JobClient: FILE_BYTES_READ=94
14/11/04 11:52:44 INFO mapred.JobClient: HDFS_BYTES_READ=956
14/11/04 11:52:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=119665
14/11/04 11:52:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
14/11/04 11:52:44 INFO mapred.JobClient: Map-Reduce Framework
14/11/04 11:52:44 INFO mapred.JobClient: Map output materialized bytes=112
14/11/04 11:52:44 INFO mapred.JobClient: Map input records=4
14/11/04 11:52:44 INFO mapred.JobClient: Reduce shuffle bytes=112
14/11/04 11:52:44 INFO mapred.JobClient: Spilled Records=16
14/11/04 11:52:44 INFO mapred.JobClient: Map output bytes=72
14/11/04 11:52:44 INFO mapred.JobClient: Total committed heap usage (bytes)=819818496
14/11/04 11:52:44 INFO mapred.JobClient: CPU time spent (ms)=16580
14/11/04 11:52:44 INFO mapred.JobClient: Map input bytes=96
14/11/04 11:52:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=484
14/11/04 11:52:44 INFO mapred.JobClient: Combine input records=0
14/11/04 11:52:44 INFO mapred.JobClient: Reduce input records=8
14/11/04 11:52:44 INFO mapred.JobClient: Reduce input groups=8
14/11/04 11:52:44 INFO mapred.JobClient: Combine output records=0
14/11/04 11:52:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=586375168
14/11/04 11:52:44 INFO mapred.JobClient: Reduce output records=0
14/11/04 11:52:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1716187136
14/11/04 11:52:44 INFO mapred.JobClient: Map output records=8
Job Finished in 116.488 seconds
Estimated value of Pi is 3.14000000000000000000
</pre>
Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-2599123504111716392014-08-26T12:28:00.001-07:002014-08-26T12:47:19.941-07:00All About ElasticSearch ScriptingThis is taken from the ElasticSearch blog of a similar name, <a href="http://www.elasticsearch.org/blog/scripting/">all about scripting</a>.<br />
<br />
There is a lot of good news in here. This is as of the 1.3 branch:<br />
<br />
<ul>
<li>Dynamic scripting disable by default for security</li>
<li>Moving from MVEL to Groovy</li>
<li>Added Lucene Expressions</li>
<li><code>field_value_factor</code> functions</li>
</ul>
<br />
Here's an earlier article about <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html">scripting and script security</a>.<br />
<br />
I personally haven't done that much with ElasticSearch scripting, but I need to start playing with it to see how it can help. I'd encourage you to do the same.<br />
<br />
- Craig Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-76441810672393168902014-07-29T16:57:00.001-07:002014-07-29T16:57:28.172-07:00ElasticSearch Resiliancy and Failover<div dir="ltr" id="docs-internal-guid-27040a6a-848d-aa57-0dc8-8535cc9c58ca" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">ElasticSearch is designed with the assumption that things break. Hardware fails and software crashes. That’s a fact of life. ElasticSearch mainly deals with this through the use of clustering. That means that many machines can be combined together to work as a single unit.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Data can be replicated or copied across multiple servers so that the loss of one or more nodes can be tolerated by the cluster and still have the cluster respond to requests in a timely manner. This obviously depends on the exact configuration.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">The primary mechanism for ElasticSearch resiliency is divided into 3 parts - nodes, shards, and replicas. Nodes are the individual servers and as nodes are added to the cluster, data is spread across them in terms of shards and replicas. A shard is a slice or section of the data and is specified when an index is created and cannot be changed. A replica is simply copy of a shard. The number of replicas can be changed on the fly.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">If a cluster is formed from a single node, then all primary shards on are on that node. If there is a failure of the node or of a shard, then data will likely be lost.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">When you add a node to the cluster so you now have 2 nodes, then primary shards will be spread across both nodes. If a node is lost, then some data will not be available until that node is restored. If that node is lost, then data will be lost.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">The next step would be to add a replica. Again, a replica is a copy of the data. In a 2-node scenario, the primary shards will be spread across both nodes, but the a replica of the same shard will be allocated to the alternate node. The effect is that both nodes will have a full copy of the data, but one node will not hold all of the primary shards. If a node is lost in this case, the remaining node can handle all requests. When the failed node is restored or a replacement node is added to the cluster, ElasticSearch will replicate shards to the second node and achieve equilibrium again.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Adding more nodes to the cluster will spread the 2 data copies across all 3 nodes now. The cluster can handle the failure of a single node, but not 2 nodes.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">If a second replica is added (making 3 copies of the data), then each node would effectively have a copy of the data. This configuration should be able to handle the loss of 2 of the 3 nodes, though you really don’t want that to happen.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">ElasticSearch has some features that allow you to influence shard allocation. There are a couple of different algorithms that it can use to control shard placement. Recently, they have added the size of the shard into the allocation strategy such that one node does not end of having lots of large shards while another node has primarily small shards.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">ElasticSearch also has a rack awareness feature. This allows you to tell ElasticSearch that some nodes are physically close to each other while other nodes are physically far apart. For example, you can have some nodes in one data center and other nodes of the same cluster in another data center. ElasticSearch will try to keep requests localized as much as possible. Having a cluster split across data centers is not really recommended for performance reasons, but is an option.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">ElasticSearch has a federated search feature. This allows 2 clusters to be in physically separate data centers while essentially a third cluster will arbitrate requests across clusters. This is a very welcome feature for ElasticSearch.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">ElasticSearch has also added a snapshot-restore feature as of the 1.0 release. This allows an entire cluster to be backed up and restored, or one or more indices can be specified for a particular snapshot. Snapshots can be taken on a fully running cluster and can be scripted and taken periodically.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Once the initial snapshot occurs, subsequent snapshots taken are incremental in nature. </span><a href="http://www.elasticsearch.org/blog/introducing-snapshot-restore/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">http://www.elasticsearch.org/blog/introducing-snapshot-restore/</span></a></div>
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html</span></a></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">One of the great things about snapshot/restore is that it can be used to keep a near-line cluster that mirrors the production cluster. If the production cluster goes down, the backup cluster can be brought back online at roughly the same point in time document wise.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">This feature works by shutting the snapshots to the near-line cluster, then applying the snapshots to that cluster. Since the snapshots are incremental, the process should be pretty quick, even for clusters with a good amount of volume.</span></div>
<br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">There are a variety of ways that ElasticSearch can be made resilient depending on the exact architecture. ElasticSearch continues to evolve in this regard as more and more companies rely on it for mission critical applications.</span></div>
<br /> - CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-21443182808644685342014-07-22T16:36:00.000-07:002015-05-28T12:35:44.061-07:00What is Big Data?I've been wanting to write this for a long time, but now have time to put some thoughts down based on discussions and comments from various groups of technical and business people.<br />
<br />
The question is, What is Big Data?<br />
<br />
Is Big Data just that, big ... data? Well, there have always been very large data sets, certainly constrained to the technology limitations of the time. Storage systems that once were measured in gigabytes, then terabytes, are now measured in petabytes. Certainly out ability to store vast quantities of data have increased dramatically.<br />
<br />
With the advent of the Internet and the World Wide Web, our ability to collect information has also grown dramatically as well. Where companies and individuals were limited to collecting data within their particular sphere, now collecting data from locations and individuals around the globe is a daily occurance.<br />
<br />
I've hear people sometimes assert that Big Data is simply doing the same things we were before, just in a different way, maybe with different tools. I can understand that point of view. I think that is generally the first step in the process of moving towards a Big Data mindset.<br />
<br />
I would put forth that Big Data really is something different than just working with large data sets. To me, Big Data is really a different mind set. We live in a vastly more connected world than just a decade ago. The daily effect of Data in people's lives in general is incredibly pervasive. It's almost impossible to escape.<br />
<br />
Big Data is about thinking differently about your data.<br />
<br />
Big Data is about connecting data sources. What other data is available that can enhance what you have? Can you add user behavior from your web site? How about advertising or industry data? How about public sources from government or private sources like Twitter, Facebook, search engines like Google, Bing, Yahoo and others?<br />
<br />
Big Data is more about openness and sharing. Not everything can be shared and you really, really need to be careful of user identifiable information, but what can you share? What can you publish or what data sets can you contribute to that enrich the overall community?<br />
<br />
Big Data many times involves working with unstructured data. SQL database by their nature, enforce very strict structures on your data. If your data doesn't conform, then you're kind of out of luck. There are some things that SQL databases do extremely well, but there are real limits.<br />
<br />
There are many new tools and techniques for working with unstructured data and extracting value from them. So called NOSQL data stores are designed to work with data that has little or no structure, providing new capabilities. Open source search engines like ElasticSearch and SOLR provide incredible search and faceting/aggregation abilities.<br />
<br />
We have many machine learning algorithms and tools that let us dissect and layer structure on top of our data to help us make sense of it. Algorithms help us to cluster, classify, and figure out which documents/data are similar to other ones.<br />
<br />
We can process volumes of data in ways that we couldn't before. Traditional compute requires the data to be moved to the application/algorithm, then the answer was written back to storage. Now we have platforms like Hadoop that effectively distribute large data sets and move the algorithm to the data allowing it to be processed and answers written in place, or to be distributed elsewhere.<br />
<br />
Does Big Data require big/large data to be useful? You can run most of these tools on your laptop, so no, not really. You can even run many tools like Map/Reduce on a $35 Raspberry Pi. With Big Data tools, we can do things on our laptops that required data centers in the past.<br />
<br />
Big Data requires experimentation. It requires people to ask new questions and develop new answers. What can we do now that we couldn't do before? It requires a different mindset.<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-57903363383203640282014-07-22T15:11:00.001-07:002014-07-22T15:11:04.607-07:00Elasticsearch 1.2.2 releasedSpreading the word about the latest <a href="http://www.elasticsearch.org/blog/elasticsearch-1-2-2-released/">ElasticSearch </a>release. The latest release is based on Lucene 4.8.1. and includes some nice bug fixes.<br />
<br />
There are fixes for possible translog corruption and some file locking issues with previous Lucene releases. It also includes a fix for caching issues with percolation of nested documents.<br />
<br />
Great work by the the ElasticSeach team. Be sure and git the new release! <a href="http://www.elasticsearch.org/downloads/1-2-2/">http://www.elasticsearch.org/downloads/1-2-2/</a><br />
<br />
- Craig<br />
<br />
<br />Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-19927064460989974002014-07-22T14:13:00.002-07:002014-07-22T14:14:04.211-07:00Beta launch of ExecuteSalces.com<a href="https://beta.executesales.com/">https://beta.executesales.com/</a> is not open and available for Beta. ExecuteSales is a platform connecting companies and sales service providers, enabling both
to generate new business and achieve higher sales revenue.<br />
<br />
I've been working on the search core of ExecuteSales for the last 18 months along with some other fantastic team members and owners and am very excited to get to beta! Beat is totally free so check it out!<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-49978103183337084272014-04-08T10:55:00.001-07:002014-04-08T10:59:01.694-07:00Blue Raspberry Pi Cluster - 10 Nodes Running ElasticSearch, Hadoop, and Tomcat!This is my latest project. It's a 10-node Raspberry Pi Cluster! This was built as part of the Utah Big Mountain Data Conference. It is a competition prize and will be given away as a promotional item from <a href="http://nosqlrevolution.com/">NosqlRevolution LLC</a> which is a Nosql/ElasticSearch/Hadoop/Machine Learning consulting company that I founded.<br />
<br />
The idea behind the box is to be able to show and demonstrate many of the concepts that are being talked about during the conference. It also gives an idea of how an individual may be able to work and study these concepts in a very small form factor.<br />
<br />
From the Application node, the USB and HDMI ports are extended to the outside of the box. A network port from the 16-port switch is also extended. You can plug in a keyboard, mouse, video, and network and then use the box similar as you would a Linux PC.<br />
<br />
The cluster is currently pulling Big Data related tweets directly into ElasticSearch via the Twitter river plugin. Periodically, some of the tweets are pulled into the Hadoop cluster for some basic processing, then written back to ElasticSearch for display. A Java REST application with an AngularJS front end provides a search interface for the tweets and displays the results and basic trending provided by Hadoop.<br />
<br />
There you go. A single box that let's us run all of this software and demonstrate and end-to-end "Big Data" system.<br />
<br />
Stats:<br />
<br />
<ul>
<li>10 Raspberry Pi Model B with 512MB ram and 32GB CD each.</li>
<li>Total cluster memory - 5GB</li>
<li>Total cluster storage - 320GB</li>
<li>5 Hadoop nodes - 1 name node and 4 data nodes</li>
<li>4 ElasticSearch nodes</li>
<li>1 Application node running tomcat</li>
<li>1 16-port network switch</li>
</ul>
<br />
<br />
Please join us on April 12, 2014 in Salt Lake City, UT. See <a href="http://www.uhug.org/">www.uhug.org</a> and <a href="http://utahbigmountaindata.com/">utahbigmountaindata.com</a> and <a href="http://utahcodecamp.com/Event/BigMountainDataSpring2014/Sessions">utahcodecamp.com</a> for more information.<br />
<div>
<br />
- Craig<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgX5tdJIBXTXywMaFoECDu7Bv6j6_uhAR67x0fsvvT1MWHY53Dn7rPiNkaLci8Ulu-iFQd5LR6txaOLzQBB3FKkExlYAvaEWGGqm34a4UgR2heBkqpDx9A0hLxhym1t9_NHCSJaCcvNhuk/s1600/IMG_3640.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgX5tdJIBXTXywMaFoECDu7Bv6j6_uhAR67x0fsvvT1MWHY53Dn7rPiNkaLci8Ulu-iFQd5LR6txaOLzQBB3FKkExlYAvaEWGGqm34a4UgR2heBkqpDx9A0hLxhym1t9_NHCSJaCcvNhuk/s1600/IMG_3640.JPG" height="480" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6RY9cBzBERa42eP3D3FpKrJ5hyRMQq1riLLak5esupqZVkNwG4gqbhx43WBG8d4G2VB1ihPKaf7qzSbF6tHj6Il5WVFJ0R1qcNW5MwwlcgWtww6Ac06QbTr12jJ5EI3o5s_hXi1ofLIM/s1600/IMG_3621.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6RY9cBzBERa42eP3D3FpKrJ5hyRMQq1riLLak5esupqZVkNwG4gqbhx43WBG8d4G2VB1ihPKaf7qzSbF6tHj6Il5WVFJ0R1qcNW5MwwlcgWtww6Ac06QbTr12jJ5EI3o5s_hXi1ofLIM/s1600/IMG_3621.JPG" height="480" width="640" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgh5lyl8f-AwPH_2d4Lx7t6ZExGZURv9aAKij-t83v7MGyPOQUgSHQtEIH0FkqvOhbFfo4ad_AjfV1hslfnqtX7s27SYnK8RUzuM3H4hj-k9N2Wp62VrF2i71y_TlZpXLigExwNp3zrVBQ/s1600/IMG_3645.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgh5lyl8f-AwPH_2d4Lx7t6ZExGZURv9aAKij-t83v7MGyPOQUgSHQtEIH0FkqvOhbFfo4ad_AjfV1hslfnqtX7s27SYnK8RUzuM3H4hj-k9N2Wp62VrF2i71y_TlZpXLigExwNp3zrVBQ/s1600/IMG_3645.JPG" height="480" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNf7FI6tr7fHKmEdA9cPRkRVnHUWxWPhlLOIN4s3WC-ZiL-R5G85TJPLSEBt_IqTyCGdxZQjARYpt9sYwkaTWqX2MMihpoT1N63AkyKSbeMojnIPXsgnwOrdWB2jGZbNswiYhIqyW1US4/s1600/IMG_3618.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNf7FI6tr7fHKmEdA9cPRkRVnHUWxWPhlLOIN4s3WC-ZiL-R5G85TJPLSEBt_IqTyCGdxZQjARYpt9sYwkaTWqX2MMihpoT1N63AkyKSbeMojnIPXsgnwOrdWB2jGZbNswiYhIqyW1US4/s1600/IMG_3618.JPG" height="480" width="640" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDAvp0KViT27FG1Mq7mvL9OqjeWmNPXOWlNJK2iy0aO-ULq2Rgpckhgh3BhWVvO10GivvPdgvqjI3sctJf4TyHpPXLc6xYKXkav-xfGL-OyPW377V9XLjAEoVA1ojIcAQW1Q0LtwAmwAw/s1600/IMG_3648.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDAvp0KViT27FG1Mq7mvL9OqjeWmNPXOWlNJK2iy0aO-ULq2Rgpckhgh3BhWVvO10GivvPdgvqjI3sctJf4TyHpPXLc6xYKXkav-xfGL-OyPW377V9XLjAEoVA1ojIcAQW1Q0LtwAmwAw/s1600/IMG_3648.JPG" height="480" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWAnoHZslOP2z7-qClx6X9Ct1NZkPDceeDZM4LR-1SISeCcdpTFWlDVQTE1V66VRxLX0OyReUckoRPzzXF0fK1iG0eLdyBhzlyDl-VAtqybjjMCcc9GaVhPF1F2I0Kh1L70FpuVLeRPkM/s1600/IMG_3653.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWAnoHZslOP2z7-qClx6X9Ct1NZkPDceeDZM4LR-1SISeCcdpTFWlDVQTE1V66VRxLX0OyReUckoRPzzXF0fK1iG0eLdyBhzlyDl-VAtqybjjMCcc9GaVhPF1F2I0Kh1L70FpuVLeRPkM/s1600/IMG_3653.JPG" height="480" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqgIgNz5FeAgFIFLUTHOcSkA-Jj01p_TUvw_nlmelcUktMvO9Si7ZJUhlgZ4zxl4grNs4vx37hjERPVo8AJIj_XdxEv7GOCOii64PXA_h3ydrR4mJo5MiUL-zmfCs0k1QIykkuvL-mgo8/s1600/IMG_3657.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqgIgNz5FeAgFIFLUTHOcSkA-Jj01p_TUvw_nlmelcUktMvO9Si7ZJUhlgZ4zxl4grNs4vx37hjERPVo8AJIj_XdxEv7GOCOii64PXA_h3ydrR4mJo5MiUL-zmfCs0k1QIykkuvL-mgo8/s1600/IMG_3657.JPG" height="480" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyMwwAHoXc4YLgMme_XLkp-M3ON3HnJ5Dy5VdqxncutcQrsRmqxzKUr9HG5qJr6-7bvfISbM3QmAW7htqliqJ5haXyhHmgkJ0keiXiJgCVpa1gietCDiLQuyi4-DcObu0l9MG9pByJUlk/s1600/IMG_3670.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyMwwAHoXc4YLgMme_XLkp-M3ON3HnJ5Dy5VdqxncutcQrsRmqxzKUr9HG5qJr6-7bvfISbM3QmAW7htqliqJ5haXyhHmgkJ0keiXiJgCVpa1gietCDiLQuyi4-DcObu0l9MG9pByJUlk/s1600/IMG_3670.JPG" height="480" width="640" /></a></div>
<br /></div>
Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-52444845414191422582014-01-15T15:46:00.001-08:002014-01-15T15:52:09.678-08:00OSv: The Open Source Cloud Operating System That is Not LinuxI just ran across this today and found it very interesting. <a href="http://www.linux.com/news/featured-blogs/200-libby-clark/748578-osv-the-open-source-cloud-operating-system-that-is-not-linux">OSv: The Open Source Cloud Operating System That is Not Linux</a> on <a href="http://www.linux.com/">linux.com</a>. It looks like they are building a stripped down OS that is designed to run JVM applications. Since the JVM runs Java, Ruby, Scall, Javascript and others, it gives a lot of options to run your applications. Of course, not everything will run on the JVM, so they indicate they'll be working with other projects to port them to the JVM.<br />
<br />
Part of what makes this interesting is that <span style="font-size: x-small;"><span><span id="docs-internal-guid-48d4dfa8-1988-af88-54f2-3bf399ec60bb"><span style="background-color: transparent; color: black; vertical-align: baseline; white-space: pre-wrap;">OS</span><span style="background-color: transparent; color: black; vertical-align: super; white-space: pre-wrap;">v</span></span></span></span> is a single-user, single-application OS layer than runs either on KVM or XEN. To quote one of the authors:<br />
<blockquote class="tr_bq">
<span style="font-size: 11px; line-height: 18px;">Avi: The traditional
OS supports multiple users with multiple applications running on top.
OSV doesn't do any of that. It trusts the hypervisor to provide the
multi-tenancy and it just runs a single one. We do a smaller job and
that allows us to do it well and allows the administration of that
system to be easeir as well because there's simply less to administer.</span></blockquote>
<br />
From the web site:<br />
<blockquote class="tr_bq">
<span style="font-size: x-small;"><span><span id="docs-internal-guid-48d4dfa8-1988-af88-54f2-3bf399ec60bb"><span style="background-color: transparent; color: black; vertical-align: baseline; white-space: pre-wrap;">OS</span><span style="background-color: transparent; color: black; vertical-align: super; white-space: pre-wrap;">v</span><span style="background-color: transparent; color: black; vertical-align: baseline; white-space: pre-wrap;"> reduces the memory and cpu overhead imposed by traditional OS.
Scheduling is lightweight, the application and the kernel cooperate,
memory pools are shared. It</span></span> provides unparalleled short
latencies and constant predictable performance, translated directly to
capex saving by reduction of the number of OS instances/sizes.</span></span> </blockquote>
<br />
I'm excited to check this out since you can run it easily in an existing Linux environment. I'd really like to see how ElasticSearch performs on <span style="font-size: x-small;"><span><span id="docs-internal-guid-48d4dfa8-1988-af88-54f2-3bf399ec60bb"><span style="background-color: transparent; color: black; vertical-align: baseline; white-space: pre-wrap;">OS</span><span style="background-color: transparent; color: black; vertical-align: super; white-space: pre-wrap;">v</span></span></span></span>. You can find out more information on their site, <a href="http://osv.io/">osv.io</a>. <br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-32279126280691111552013-09-18T18:50:00.000-07:002013-09-18T18:50:02.532-07:00Companion slides for ElasticSearch talk at the Utah Big Mountain Data Conference.<br />
<br />
<iframe allowfullscreen="" frameborder="0" height="356" marginheight="0" marginwidth="0" mozallowfullscreen="" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/26330952" style="border-width: 1px 1px 0; border: 1px solid #CCC; margin-bottom: 5px;" webkitallowfullscreen="" width="427"> </iframe> <br />
<div style="margin-bottom: 5px;">
<b> <a href="https://www.slideshare.net/imarcticblue/craig-brown-speaks-on-elast" target="_blank" title="Craig Brown speaks on ElasticSearch">Craig Brown speaks on ElasticSearch</a> </b> from <b><a href="http://www.slideshare.net/imarcticblue" target="_blank">imarcticblue</a></b> </div>
Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-59150946905671830422013-09-17T19:48:00.002-07:002013-09-17T19:48:39.955-07:00Craig Brown speaking on ElasticSearch at Utah Big Mountain Data Conference<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.youtube.com/embed/77UbSZKWhPA?feature=player_embedded' frameborder='0'></iframe></div>
Enjoy!<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-47108994382450325392013-03-12T17:54:00.001-07:002013-03-12T17:54:46.034-07:00Acunu - Eric Evans - Castle-enhanced Cassandra<a href="http://vimeo.com/43903966#">Castle-enhanced Cassandra</a> is a 20 minute video comes from Berlin Buzzwords and was given by Eric Evans of Acunu.<br />
<br />
Castle is a new FLOSS storage backend for Linux as a LKM. It's designed to work with the SSTables that Casandra uses for storage. It's a write-optimized storage system that's optimized to work with both rotational disks and SSDs.<br />
<br />
It looks like a great project, and glad something like this is open source. It would be interesting to see of other projects can take advantage of this project. There are other NOSQL projects that, while not using SSTs, use similar write-append strategies.<br />
<br />
Thanks Acunu!<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-71050622410778927152013-03-06T21:56:00.000-08:002013-03-06T21:56:44.840-08:00Wired - Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon<a href="http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/">Nice article</a> on Wired.com that revolves around the open source Apache incubator project <a href="http://incubator.apache.org/mesos/">Mesos</a>.<br />
<br />
Mesos is considered to be cluster management software. It is designed to take your entire data center and virtualize the resources for applications running under Mesos.<br />
<br />
You don't have to set up a dedicated cluster of machines for a single purpose like running Hadoop. Mesos allows you to set up multiple Hadoop cluster instances over the same hardware set to make more effecient use of resources.<br />
<br />
Google's system is referred to as Borg, which is fitting. It is being upgraded soon to Omega. The underlying goal is the same, to effeciently use computing resources for all of the required tasks.<br />
<br />
Mesos is being supported by a number of engineers and companies including Twitter and some former Google engineers who worked on Borg.<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-63772373513925404592013-02-05T20:26:00.000-08:002013-02-05T20:26:00.180-08:00Testing MapReduce with MRUnit By Mansoor Ashraf http://m-mansur-ashraf.blogspot.com/2013/02/testing-mapreduce-with-mrunit.html.<br />
<br />
Here is the home page for MRUnit. http://mrunit.apache.org/<br />
<br />
I ran across this blog today and thought it was really worthwhile.I don't know how many of you are into writing MR jobs, but having MRUnit is an invaluable piece of kit to have available.<br />
<br />
Many MR jobs are very time consuming and even running small jobs can be problematic in terms of time spinning up a job and reviewing the results, but there may not be resources available to run your job and test the results. On top of that, there can be additional expenses involved if you're running something like Elastic Map Reduce on AWS.<br />
<br />
MRUnit is also a good learning tool, giving you a change try try out MR features or test new algorithms in a quick and safe environment. Unit tests in general allow other people to come up to speed on how your code functions quickly, provides usage documentation, and some protection against your code being broken.<br />
<br />
Go ahead and give it at try.<br />
<br />
- Craig<br />
<br />Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-30331811211471442502012-08-09T11:40:00.000-07:002012-08-09T11:40:00.727-07:00Querying 24 Billion Records in 900ms (ElasticSearch)This is a video I cam across that's from Berlin Buzzwords <a href="http://www.elasticsearch.org/videos/2012/06/05/querying-24-billion-records-in-900ms.html">Querying 24 Billion Records in 900ms</a>. The presentation is only 20 minutes long, but includes slides which is very helpful.<br />
<br />
The speaker gives a lot of good information about trying to scale out ES on AWS to handle 24B records. They were able to do it successfully, which is excellent. They ended up moving to dedicated hardware to reduce the AWS cost.<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-29033117038384744482012-08-09T11:36:00.001-07:002012-08-09T11:36:47.914-07:00foursquare now uses ElasticSearch<a href="http://engineering.foursquare.com/2012/08/09/foursquare-now-uses-elastic-search-and-on-a-related-note-slashem-also-works-with-elastic-search/">foursquare now uses ElasticSearch</a>! This is pretty exciting news! It's great to see people switching over to ElasticSeach and seeing such great success. The article doesn't have a ton of details, but definitely worth looking at.<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-58646306412121800942012-08-08T14:18:00.000-07:002012-08-08T14:18:48.436-07:00NoSQLUnit 0.3.2 ReleasedThis is something quite interesting that I ran across today - <a href="http://java.dzone.com/articles/nosqlunit-032-released">NoSQLUnit 0.3.2</a>. I didn't realize that there was a project out there like this, very cool!<br />
<br />
Some NoSQL projects are easier to run embedded than others, and this could certainly help that. You always need help classes to facilitate running unit tests against a data source. I've writing unit tests against ElasticSearch, which has a local/memory mode that is great for writing tests against. Still, you have to generate your test data and load it into the search engine for testing.<br />
<br />
Great project!<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-58273628396442249762012-07-20T21:52:00.001-07:002012-07-20T21:52:09.673-07:00Time To Build Your Big-Data Muscles<a href="http://www.fastcompany.com/1842928/time-to-build-your-big-data-muscles">Time To Build Your Big-Data Muscles </a>from fastcompany.com talks mainly about getting educated for a Big Data job. They provide an interesting statistic that for every 100 open Big Data jobs, there are only 2 qualified candidates. Definitely a good place to be if you're looking for work in this area.<br />
<br />
Having a background in mathematics and engineering are your best bests. Load up on statistics if you're able to. I'm more on the engineering side myself. My work involves taking the algorithms and principles that have been developed and applying them to the business data sets. There is so much work to do just in the application of algorithms alone.<br />
<br />
- CraigCraig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.comtag:blogger.com,1999:blog-6654321592427408506.post-84473850608854455092012-07-18T21:05:00.002-07:002012-07-18T21:09:01.123-07:00Cameron Befus speaks at UHUG (Hadoop)This evening, we heard from Cameron Befus, former CTO of Tynt which was purchased by 33 Across Inc. Here are some notes from the presentation on Hadoop.<br />
<br />
Why use Hadoop?<br />
<br />
A framework for parallel processing of large data sets.<br />
<br />
Design considerations<br />
- System will manage and heal itself<br />
- Performance to scale linearly<br />
- Compute move algorithm to data<br />
<br />
Transition cluster size - local -> cloud -> dedicated hardware.<br />
<br />
Hadoop process optiized for data retrieval<br />
- schema on read<br />
- nosql databases<br />
- map reduce<br />
- asynchronus<br />
- parallel computing<br />
<br />
Built for unreliable, commodity hardware.<br />
Scale by adding boxes<br />
More cost effective<br />
<br />
Sends the program to the data.<br />
- Larger choke points occur in transferring large volumes of data around the cluster<br />
<br />
#1 Data Driven Decision MakingEveryone wants to make good decisions and as everyone know, to make good decisions you need data, not just any data, the right data.<br />
<br />
Walmart mined sales that occurred previous to a coming hurricane and found the highest selling products are:<br />
#1 batteries,<br />
#2 pop tarts.<br />
<br />
This means $$$ to Walmart.<br />
<br />
Google can predict flu trends around the world just from search queries. Query velocity matches medical data.<br />
<br />
With the advent of large disks, it becomes cost effective to simply store everything, then use a system like hadoop to run through and process the data to find value.<br />
<br />
Combining data sets can extract value when they may not be valuable on their own. Turning lead into gold as it were.<br />
<br />
How big is big data?<br />
not just size,<br />
complexity<br />
rate of growth<br />
performance<br />
retention<br />
<br />
Other uses,<br />
load testing<br />
number crunching<br />
building lucene indexes<br />
just about anything that can be easily parrallelized<br />
<br />
<br />
- Craig<br />
<br />Craig Brownhttp://www.blogger.com/profile/11290264435311956668noreply@blogger.com