Thursday, October 29, 2015

Elasticsearch 2.0 released!

Elasticsearch 2.0 is finally out! They actually released a number of new releases including Kibana 4.2, Marvel 2.0 and Sense 2.0. Definitely a lot of new stuff!

Released Elasticsearch 2.0. A major milestone and achievement of the whole team, and wonderful contributions from the community. New type of aggregations called pipeline aggs, simplified query DSL by merging query and filter concepts, better compression options, hardened security by enabling security manager, hardening of FS behavior (fsync, more checksums, atomic renames), performance, consistent mapping behavior, and many more. Also, it bundles Lucene 5 release, which includes numerous improvements, several of them led by our team.

Watch the release webinar now!

  - Craig

Friday, May 22, 2015

FastCompany: Googles InnerTube Project to Retool YouTube

Fast Company: Google InnerTube project to retool YouTube is a great article talking about a lot of the more recent changes that Google has been making in order to fix some internal development issues and make YouTube more of a competitive destination for watching content.

It's funny hearing the talking about the mess of Frankenstein projects that had grown up over the years. It seems like every project that's in continuous development turns in a Frankenstein. It's always a constant battle to keep a project clean. If you don't, then up you end up reaching a point where you almost have to over from scratch to get the project where you you need it to be.

Enjoy!

  - Craig

Saturday, January 10, 2015

Nathan Paco - A New Year in Data Science: ML Unpaused

A New Year in Data Science: ML Unpaused
A good read on Data Science and some of the advances being made more recently. He presents some history around machine learning and developments around the real job of Data Science.

There are some good insights around the need to focus on data and features instead of focusing on which algorithm to use. Many times people look more at how to build something instead of what problem they are trying to solve.

Nathan certainly knows what he's talking about. I met him last year at Hadoop Summit in San Jose. He has been working with the Spark guys and Databrix. Take a minute and look him up.

  - Craig

Wednesday, December 31, 2014

Elasticsearch 5-node PicoCluster test - 1,000,000 tweets and 3495.1316MB in just under 9 hours.

We've been doing some benchmarks with Elasticsearch on the 5-node PicoCluster. We were able to push 1,000,000 tweets and 3495.1316MB in just under 9 hours. The tweets are small, around 1KB each, but have a lot of fields and are pretty complicated. That's pretty good considering that the SD cards are not very good a small reads and writes.


Pushed 1000 docs in 34.457 secs at 29.021679 docs/sec 3.4431486 MB/sec, 980000 total docs in 520 minutes 33 seconds at 31.37682 docs/second 98.0% complete
Pushed 1000 docs in 36.632 secs at 27.298536 docs/sec 3.5004349 MB/sec, 981000 total docs in 521 minutes 9 seconds at 31.372042 docs/second 98.1% complete
Pushed 1000 docs in 34.607 secs at 28.89589 docs/sec 3.5908194 MB/sec, 982000 total docs in 521 minutes 44 seconds at 31.369303 docs/second 98.2% complete
Pushed 1000 docs in 30.67 secs at 32.605152 docs/sec 3.3349895 MB/sec, 983000 total docs in 522 minutes 15 seconds at 31.370514 docs/second 98.299995% complete
Pushed 1000 docs in 31.243 secs at 32.007168 docs/sec 3.431964 MB/sec, 984000 total docs in 522 minutes 46 seconds at 31.37115 docs/second 98.4% complete
Pushed 1000 docs in 28.858 secs at 34.652435 docs/sec 3.4087648 MB/sec, 985000 total docs in 523 minutes 15 seconds at 31.374163 docs/second 98.5% complete
Pushed 1000 docs in 29.598 secs at 33.786068 docs/sec 3.4104357 MB/sec, 986000 total docs in 523 minutes 44 seconds at 31.376436 docs/second 98.6% complete
Pushed 1000 docs in 32.356 secs at 30.90617 docs/sec 3.4084692 MB/sec, 987000 total docs in 524 minutes 17 seconds at 31.375952 docs/second 98.7% complete
Pushed 1000 docs in 37.807 secs at 26.450129 docs/sec 3.4255342 MB/sec, 988000 total docs in 524 minutes 55 seconds at 31.370039 docs/second 98.799995% complete
Pushed 1000 docs in 33.404 secs at 29.936535 docs/sec 3.4184904 MB/sec, 989000 total docs in 525 minutes 28 seconds at 31.36852 docs/second 98.9% complete
Pushed 1000 docs in 34.465 secs at 29.014942 docs/sec 3.4793549 MB/sec, 990000 total docs in 526 minutes 2 seconds at 31.36595 docs/second 99.0% complete
Pushed 1000 docs in 30.792 secs at 32.475967 docs/sec 3.4305592 MB/sec, 991000 total docs in 526 minutes 33 seconds at 31.367033 docs/second 99.1% complete
Pushed 1000 docs in 29.749 secs at 33.614574 docs/sec 3.4574842 MB/sec, 992000 total docs in 527 minutes 3 seconds at 31.369146 docs/second 99.2% complete
Pushed 1000 docs in 32.825 secs at 30.464584 docs/sec 3.370614 MB/sec, 993000 total docs in 527 minutes 36 seconds at 31.368208 docs/second 99.299995% complete
Pushed 1000 docs in 37.048 secs at 26.99201 docs/sec 3.451209 MB/sec, 994000 total docs in 528 minutes 13 seconds at 31.36309 docs/second 99.4% complete
Pushed 1000 docs in 35.307 secs at 28.322996 docs/sec 3.3885374 MB/sec, 995000 total docs in 528 minutes 48 seconds at 31.35971 docs/second 99.5% complete
Pushed 1000 docs in 37.64 secs at 26.567482 docs/sec 3.4242926 MB/sec, 996000 total docs in 529 minutes 26 seconds at 31.35403 docs/second 99.6% complete
Pushed 1000 docs in 28.108 secs at 35.57706 docs/sec 3.4203835 MB/sec, 997000 total docs in 529 minutes 54 seconds at 31.357765 docs/second 99.7% complete
Pushed 1000 docs in 28.886 secs at 34.618847 docs/sec 3.4412107 MB/sec, 998000 total docs in 530 minutes 23 seconds at 31.360725 docs/second 99.8% complete
Pushed 1000 docs in 40.074 secs at 24.953835 docs/sec 3.4108858 MB/sec, 999000 total docs in 531 minutes 3 seconds at 31.352667 docs/second 99.9% complete
Pushed 1000 docs in 0.0 secs at Infinity docs/sec 3.4554148 MB/sec, 1000000 total docs in 531 minutes 3 seconds at 31.38405 docs/second 100.0% complete

Pushed 1000000 total docs and 3495.1316MB in 531 minutes 3 seconds at 31.38405 per second,

Thursday, December 18, 2014

10GB Terasort Benchmark on 5-node Raspberry PI Cluster - 2H 52m 56s

We just set a new record for the 10GB terasort on a 5-node PicoCluster! We cut over an our off the benchmark time bringing the total to under 3 hours! Pretty amazing!

Hadoop job_201412181311_0002 on master User: hadoop
Job Name: TeraSort
Job File: hdfs://pi0:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201412181311_0002/job.xml
Submit Host: pi0
Submit Host Address: 10.1.10.120
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Thu Dec 18 14:54:20 MST 2014
Finished at: Thu Dec 18 17:47:16 MST 2014
Finished in: 2hrs, 52mins, 56sec
Job Cleanup: Successful

Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%

80008000 / 0
reduce100.00%

80008000 / 0


Counter Map Reduce Total
Map-Reduce Framework Spilled Records 0 0 300,000,000
Map output materialized bytes 0 0 10,200,038,400
Reduce input records 0 0 100,000,000
Virtual memory (bytes) snapshot 0 0 46,356,074,496
Map input records 0 0 100,000,000
SPLIT_RAW_BYTES 8,800 0 8,800
Map output bytes 0 0 10,000,000,000
Reduce shuffle bytes 0 0 10,200,038,400
Physical memory (bytes) snapshot 0 0 32,931,528,704
Map input bytes 0 0 10,000,000,000
Reduce input groups 0 0 100,000,000
Combine output records 0 0 0
Reduce output records 0 0 100,000,000
Map output records 0 0 100,000,000
Combine input records 0 0 0
CPU time spent (ms) 0 0 27,827,080
Total committed heap usage (bytes) 0 0 32,344,113,152
File Input Format Counters Bytes Read 0 0 10,000,144,320
FileSystemCounters HDFS_BYTES_READ 10,000,153,120 0 10,000,153,120
FILE_BYTES_WRITTEN 20,404,679,750 10,204,290,230 30,608,969,980
FILE_BYTES_READ 10,265,248,834 10,200,000,960 20,465,249,794
HDFS_BYTES_WRITTEN 0 10,000,000,000 10,000,000,000
File Output Format Counters Bytes Written 0 0 10,000,000,000
Job Counters Launched map tasks 0 0 80
Launched reduce tasks 0 0 80
SLOTS_MILLIS_REDUCES 0 0 28,079,434
Total time spent by all reduces waiting after reserving slots (ms) 0 0 0
SLOTS_MILLIS_MAPS 0 0 22,051,330
Total time spent by all maps waiting after reserving slots (ms) 0 0 0
Rack-local map tasks 0 0 30
Data-local map tasks 0 0 50

Map Completion Graph - close

Reduce Completion Graph - close



Go back to JobTracker

This is Apache Hadoop release 1.2.1

Thursday, December 4, 2014

The Daily WTF: The Robot Guys

Not really Big Data, but pretty funny :)

http://thedailywtf.com/articles/the-robot-guys

  - Craig

Thursday, November 13, 2014

10GB Terasort Benchmark on 5-node Raspberry PI Cluster - 4H 12m 5s

After more than a week of testing, tweeking, and retesting, we were able to successfully run a 10GB Terasort on a 5-node Raspberry PI cluster! Each node has 512MB ram and a 16GB SD card.

Hadoop job_201411131351_0001 on master

User: hadoop
Job Name: TeraSort
Job File: hdfs://pi0:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201411131351_0001/job.xml
Submit Host: pi0
Submit Host Address: 10.1.10.120
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Thu Nov 13 14:23:35 MST 2014
Finished at: Thu Nov 13 18:35:40 MST 2014
Finished in: 4hrs, 12mins, 5sec
Job Cleanup: Successful

Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%

1520015200 / 0
reduce100.00%

1520015200 / 0


Counter Map Reduce Total
File Input Format Counters Bytes Read 0 0 10,000,298,372
Job Counters SLOTS_MILLIS_MAPS 0 0 24,993,499
Launched reduce tasks 0 0 152
Total time spent by all reduces waiting after reserving slots (ms) 0 0 0
Rack-local map tasks 0 0 144
Total time spent by all maps waiting after reserving slots (ms) 0 0 0
Launched map tasks 0 0 152
Data-local map tasks 0 0 8
SLOTS_MILLIS_REDUCES 0 0 34,824,665
File Output Format Counters Bytes Written 0 0 10,000,000,000
FileSystemCounters FILE_BYTES_READ 10,341,496,856 10,200,000,912 20,541,497,768
HDFS_BYTES_READ 10,000,315,092 0 10,000,315,092
FILE_BYTES_WRITTEN 20,409,243,506 10,208,123,719 30,617,367,225
HDFS_BYTES_WRITTEN 0 10,000,000,000 10,000,000,000
Map-Reduce Framework Map output materialized bytes 0 0 10,200,138,624
Map input records 0 0 100,000,000
Reduce shuffle bytes 0 0 10,200,138,624
Spilled Records 0 0 300,000,000
Map output bytes 0 0 10,000,000,000
Total committed heap usage (bytes) 0 0 57,912,754,176
CPU time spent (ms) 0 0 40,328,090
Map input bytes 0 0 10,000,000,000
SPLIT_RAW_BYTES 16,720 0 16,720
Combine input records 0 0 0
Reduce input records 0 0 100,000,000
Reduce input groups 0 0 100,000,000
Combine output records 0 0 0
Physical memory (bytes) snapshot 0 0 52,945,952,768
Reduce output records 0 0 100,000,000
Virtual memory (bytes) snapshot 0 0 123,024,928,768
Map output records 0 0 100,000,000

Map Completion Graph
Reduce Completion Graph


Go back to JobTracker

This is Apache Hadoop release 1.2.1