Monday, May 30, 2011

NoSQL Cassandra write load performance, ranged queries, and Acunu

I just read this 2-part series from the Acunu blog.
Cassandra under heavy write load, Part I
Cassandra under heavy write load, Part II

Acunu has some really interesting new tech to underlie Cassandra with some great results for inserts and ranged queries under load.

My personal takeaway, I didn't realize that Cassandra is write-optimized so heavily. I think this would be a very important consideration. You always need to match your DB to the kind of load you are expecting. Whether you're write-heayy, read-heavy, in a balanced scenario, you need to make sure your DB is tuned and scaled for it.

In this particular testing scenario, the focus was on ranged queries as it appears that normal reads are not affected by substantial write bursts. If you're relying heavily on ranged queries however, you really need to understand how your app is going to be affected due to Cassandra trying to perform SSTable compaction under load. It looks to be a significant impact on response time.

Acunu is definitely worth a look if Cassandra is your weapon of choice. Along with that, check out the latest updates for Cassandra. It looks like the team is also actively focused on making Cassandra a better and faster product with each iteration.

  - Craig

Saturday, May 28, 2011

Acunu storage platform for Cassandra

Data Startup Acunu Opens The Castle Gates brings to light a start up that's building a new storage platform, focused primarily on Cassandra at this time. They've recently open sourced their platform and they're looking for people to start working with it.

If you're working with Cassandra, this looks like it's really worth checking out.

It looks like you need to sign up for the beta program to try it. has the instructions to install. There are a variety of options including a USB stick image, or VM images for a variety of popular platforms including ZenServer, KVM, VirtualBox, and VMWare Fusion.

  - Craig

Saturday, May 21, 2011

AWS and Elastic Beanstalk

I decided to play around with AWS and in particular, their Elastic Beanstalk product. It is designed to make it very simple and easy to deploy a war file to AWS and have the system take care of the deployment. You still have a console to be able to control the app, or just let AWS take care of it for you.

This is a link to a small deployment I just completed It was generated from NetBeans 7.0 and is based on JSF2, CDI, and Primefaces 2.2.1. It is running on Tomcat 7 on AWS. Everything worked first deploy, very cool!

To quote from and it really works about that easy.

To deploy Java applications using Elastic Beanstalk, you simply:
  • Create your application as you normally would using any editor or IDE (e.g. Eclipse).
  • Package your deployable code into a standard Java Web Application Archive (WAR file).
  • Upload your WAR file to Elastic Beanstalk using the AWS Management Console, the AWS Toolkit for Eclipse, the web service APIs, or the Command Line Tools.
  • Deploy your application. Behind the scenes, Elastic Beanstalk handles the provisioning of a load balancer and the deployment of your WAR file to one or more EC2 instances running the Apache Tomcat application server.
  • Within a few minutes you will be able to access your application at a customized URL (e.g.

You can sign up for an account for free, has the details. Basically you get an EC2 Linux micro instance to play with play EBS, S3, and SimpleDB to play with at no cost. This makes it super easy to play with and get some experience on probably the best cloud out there right now. I was really surprised with how simple it was to get started and deploy a war file.

There is a very nice JDK to help you out and there is a plugin for Eclipse to help you out. I prefer NetBeans for my own stuff however. It looks like the Eclipse integration allows you to deploy directly from the IDE. I used the web console since NetBeans does not have this integration. You basically give your app a name which becomes the 3rd level domain, a description, and then locate the war file on your file system. AWS takes care of the rest.

It took maybe 5 minutes from the time I uploaded the war file until the app was completely active. You can also provide a health check url for AWS so it can check to see if your app is active. Presumably it can notify you if the health check fails. Very nice.

Amazon has an excellent and easy to use product here. I'd advise you to check it out, especially since you can use it for free!

  - Craig

Friday, May 20, 2011

NoSQL: Respect the problem

It's Time to Drop the "F" Bomb - or "Lies, Damn Lies, and NoSQL."

This article comes from the Basho Blog. his is my favorite quote from the article, talking about some of the challenges that are faced in this space.
Immutable laws are not marketing. And therefore, marketing can’t release you from the bonds of immutable laws. You can’t solve the intractable problems of distributed systems so eloquently summarized with three letters – C-A-P – by Google’s cloud architect (and Basho Board member) Dr. Eric Brewer (a man both lauded and whose full impact on our world has not yet been reckoned), with specious claims about downloads and full consistency.
To wit:
  • Memory is not storage.
  • Trading the RDBMS world for uptime is hard. There are no half-steps. No transitional phases.
  • The geometry of a spinning disk matters for your app. You can’t escape this.
  • Your hardware RAID controller is not perfect. It screws things up and needs to be debugged.
  • Replication between two data centers is hard, let alone replication between three or 15 data centers.
  • Easily adding nodes to a cluster under load impacts performance for a period determined by the amount of data stored on the existing nodes and the load on the system…and the kind of servers you are using…and a dozen other things. It looks easy in the beginning.

  - Craig

Tuesday, May 10, 2011

Data Deduplication and NoSql

Paper: A Study of Practical Deduplication

This is another article. This really has huge implications for big data systems. The more data these systems hold, the more important this becomes. My favorite for this is ZFS. I run Solaris 10 on my own systems that I haven't tried using the dedup feature yet. I've read several articles on ZFS dedup including one on using dedup with the compression feature which is a bit tricky.

Viddler Architecture -

Viddler Architecture - 7 Million Embeds a Day and 1500 Req/Sec Peak

This is a really interesting article on the architecture. They are using HDFS and Amazon S3 for storage. Amazon EC2 servers do all of the encoding work. They have experimented with Cassandra for storing dashboard positions.

Overall, their architecture is all of the place, but they are working on improving that situation.

The real takeaway is their focus on customer service:
What matters in the end is what the users sees, not the architecture. Iterate and focus on customer experience above all else. Customer service is even valued above sane or maintainable architecture. Build only what is needed. They could not have kick started this company maintaining 100% employee ownership without running ultra scrappy. They are now taking what was learned in scrappy stage and building a very resilient multi-site architecture in a top-tier facility. Their system is not the most efficient, or the prettiest, the path they took is the customer needs something so they built it. They go after what the customer needs. The way they went about selecting hardware, and building software, with an emphasis on getting the job done, is what built the company.