Monday, August 22, 2011

NOSQL job trends

NOSQL job trends

Saw this on DZONE and wanted to pass it along. There certainly are a number of companies that are looking for skilled people to work with some of these technologies. I think it would be interesting to see how these positions are being filled? I personally know very few people that have any experience with these data stores, let alone any significant experience.

  - Craig

Saturday, August 20, 2011

Netlix Culture

Netflix Culture

I think this has probably made some rounds, but I found this very interesting. I'm not sure how completely this actually is implemented at Netflix, but I sure like the theory :)

  - Craig

Tuesday, August 16, 2011

LevelDB: A Fast Persistent Key-Value Store

LevelDB: A Fast Persistent Key-Value Store

News about LevelDB has been making the rounds lately. It's another Google project and it definitely looks interesting. LevelDB is a key-value store in library form. It does not appear to be a complete product in and of itself, but can be utilized by many other products.

One very interesting note is the fact the Basho is working on using LevelDB as one of the back end storage options for Riak! It looks like LevelDB would work well for those that are using the InnoDB back end currently. The Bitcast back end keeps all of it's keys in memory and thereby has some limitations as to how many values can be stored per node. InnoDB doesn't have this same limitation and neither does LevelDB.

  - Craig

Monday, August 15, 2011

Backblaze Blog - Petabytes on a Budget

Petabytes on a Budget v2.0: Revealing More Secrets

This really isn't  about big nosql as such, I just was really fascinated by the fact that this company was putting their "secret sauce" out into the public. I mean, how many companies would do that! Obviously you would need their software too in order to recreate their entire service, but very forward thinking none the less. Good job guys!

  - Craig


Wednesday, July 27, 2011

Wednesday, July 20, 2011

NoSQL Presentation Tonight and Tomorrow at UJUG

I'm presenting at Provo UJUG tonight and Murry UJUG tomorrow night. I'll be presenting on the overall NoSQL Landscape.

This is a really high-level view designed to give users some background to the space, a comparison to the SQL space, and a breakdown of the NoSQL space, along with some detail of each category.

I'll post the slides and video of the presentations a bit later.

  - Craig

Hello youwho.com!

Sorry for the absence of the last 7 weeks or so. I left a great job at Overstock.com and am now gainfully employed at a new startup called youwho.com. This new position has been taking a lot of my time, but I think I'm finally getting settled in so I can get back to blogging again :)

  - Craig

Monday, May 30, 2011

NoSQL Cassandra write load performance, ranged queries, and Acunu

I just read this 2-part series from the Acunu blog.
Cassandra under heavy write load, Part I
Cassandra under heavy write load, Part II

Acunu has some really interesting new tech to underlie Cassandra with some great results for inserts and ranged queries under load.

My personal takeaway, I didn't realize that Cassandra is write-optimized so heavily. I think this would be a very important consideration. You always need to match your DB to the kind of load you are expecting. Whether you're write-heayy, read-heavy, in a balanced scenario, you need to make sure your DB is tuned and scaled for it.

In this particular testing scenario, the focus was on ranged queries as it appears that normal reads are not affected by substantial write bursts. If you're relying heavily on ranged queries however, you really need to understand how your app is going to be affected due to Cassandra trying to perform SSTable compaction under load. It looks to be a significant impact on response time.

Acunu is definitely worth a look if Cassandra is your weapon of choice. Along with that, check out the latest updates for Cassandra. It looks like the team is also actively focused on making Cassandra a better and faster product with each iteration.

  - Craig

Saturday, May 28, 2011

Acunu storage platform for Cassandra

Data Startup Acunu Opens The Castle Gates brings to light a start up that's building a new storage platform, focused primarily on Cassandra at this time. They've recently open sourced their platform and they're looking for people to start working with it. http://www.acunu.com/2011/05/open-castle/#more-1930.

If you're working with Cassandra, this looks like it's really worth checking out. http://www.acunu.com/product/benefits/.

It looks like you need to sign up for the beta program to try it. http://www.acunu.com/product/beta/

http://www.acunu.com/docs/html/installation.html has the instructions to install. There are a variety of options including a USB stick image, or VM images for a variety of popular platforms including ZenServer, KVM, VirtualBox, and VMWare Fusion.

  - Craig

Saturday, May 21, 2011

AWS and Elastic Beanstalk

I decided to play around with AWS and in particular, their Elastic Beanstalk product. It is designed to make it very simple and easy to deploy a war file to AWS and have the system take care of the deployment. You still have a console to be able to control the app, or just let AWS take care of it for you.

This is a link to a small deployment I just completed http://awshelloworldeb.elasticbeanstalk.com/. It was generated from NetBeans 7.0 and is based on JSF2, CDI, and Primefaces 2.2.1. It is running on Tomcat 7 on AWS. Everything worked first deploy, very cool!

To quote from http://aws.amazon.com/elasticbeanstalk/ and it really works about that easy.

To deploy Java applications using Elastic Beanstalk, you simply:
  • Create your application as you normally would using any editor or IDE (e.g. Eclipse).
  • Package your deployable code into a standard Java Web Application Archive (WAR file).
  • Upload your WAR file to Elastic Beanstalk using the AWS Management Console, the AWS Toolkit for Eclipse, the web service APIs, or the Command Line Tools.
  • Deploy your application. Behind the scenes, Elastic Beanstalk handles the provisioning of a load balancer and the deployment of your WAR file to one or more EC2 instances running the Apache Tomcat application server.
  • Within a few minutes you will be able to access your application at a customized URL (e.g. http://myapp.elasticbeanstalk.com/).

You can sign up for an account for free, http://aws.amazon.com/free/ has the details. Basically you get an EC2 Linux micro instance to play with play EBS, S3, and SimpleDB to play with at no cost. This makes it super easy to play with and get some experience on probably the best cloud out there right now. I was really surprised with how simple it was to get started and deploy a war file.

There is a very nice JDK to help you out and there is a plugin for Eclipse to help you out. I prefer NetBeans for my own stuff however. It looks like the Eclipse integration allows you to deploy directly from the IDE. I used the web console since NetBeans does not have this integration. You basically give your app a name which becomes the 3rd level domain, a description, and then locate the war file on your file system. AWS takes care of the rest.

It took maybe 5 minutes from the time I uploaded the war file until the app was completely active. You can also provide a health check url for AWS so it can check to see if your app is active. Presumably it can notify you if the health check fails. Very nice.

Amazon has an excellent and easy to use product here. I'd advise you to check it out, especially since you can use it for free!

  - Craig

Friday, May 20, 2011

NoSQL: Respect the problem

It's Time to Drop the "F" Bomb - or "Lies, Damn Lies, and NoSQL."

This article comes from the Basho Blog. his is my favorite quote from the article, talking about some of the challenges that are faced in this space.
Immutable laws are not marketing. And therefore, marketing can’t release you from the bonds of immutable laws. You can’t solve the intractable problems of distributed systems so eloquently summarized with three letters – C-A-P – by Google’s cloud architect (and Basho Board member) Dr. Eric Brewer (a man both lauded and whose full impact on our world has not yet been reckoned), with specious claims about downloads and full consistency.
To wit:
  • Memory is not storage.
  • Trading the RDBMS world for uptime is hard. There are no half-steps. No transitional phases.
  • The geometry of a spinning disk matters for your app. You can’t escape this.
  • Your hardware RAID controller is not perfect. It screws things up and needs to be debugged.
  • Replication between two data centers is hard, let alone replication between three or 15 data centers.
  • Easily adding nodes to a cluster under load impacts performance for a period determined by the amount of data stored on the existing nodes and the load on the system…and the kind of servers you are using…and a dozen other things. It looks easy in the beginning.

  - Craig

Tuesday, May 10, 2011

Data Deduplication and NoSql

Paper: A Study of Practical Deduplication

This is another highscalability.com article. This really has huge implications for big data systems. The more data these systems hold, the more important this becomes. My favorite for this is ZFS. I run Solaris 10 on my own systems that I haven't tried using the dedup feature yet. I've read several articles on ZFS dedup including one on using dedup with the compression feature which is a bit tricky.

Viddler Architecture - highscalability.com

Viddler Architecture - 7 Million Embeds a Day and 1500 Req/Sec Peak

This is a really interesting article on the Viddler.com architecture. They are using HDFS and Amazon S3 for storage. Amazon EC2 servers do all of the encoding work. They have experimented with Cassandra for storing dashboard positions.

Overall, their architecture is all of the place, but they are working on improving that situation.

The real takeaway is their focus on customer service:
What matters in the end is what the users sees, not the architecture. Iterate and focus on customer experience above all else. Customer service is even valued above sane or maintainable architecture. Build only what is needed. They could not have kick started this company maintaining 100% employee ownership without running ultra scrappy. They are now taking what was learned in scrappy stage and building a very resilient multi-site architecture in a top-tier facility. Their system is not the most efficient, or the prettiest, the path they took is the customer needs something so they built it. They go after what the customer needs. The way they went about selecting hardware, and building software, with an emphasis on getting the job done, is what built the company.

Saturday, April 30, 2011

Setting up a Riak NoSQL cluster in 7 minutes


This is my first video on NoSQL technology. It involves setting up a 3-node Riak cluster, complete with software installation, config file edits, starting each node and then connecting the cluster. I was trying to do it in under 5 minutes, but I did it in under 7 minutes instead. That's still very good. It took me several evenings to set up a 3-node Hadoop cluster the first time I did it :)

The video is not the best quality but I think it still works. I'll try this again another time, but with a different recording method. 

NoSQL cluster setup


I thought I'd share a photo up my cluster setup. I have 7 Sun V20z servers. They are older, but total investment is < $1500 so that is pretty good. 6 are dual-single core servers and 1 is dual-dual. I can upgrade 2 more to dual-dual, but I need a firmware upgrade that is only available on service contract now. Some of the servers have 4G ram with 2-36GB 10K rpm drives and the others have16GB with 2-73GB 15K rpm drives. That doesn't give me much storage space, but a lot of ram. That's good enough for the kind of testing I'm doing.

They are all running Solaris 10 U9 which is the first Oracle Solaris release. A couple of things I really like about Solaris are ZFS and Zones. Zones is an OS level virtualization technology. Zones are very lightweight and easy to set up and administer, especially in conjunction with ZFS. I currently have zones set up for Riak and Hadoop. I plan on setting up a number of NoSQL data stores and seeing how they operate, administer, and run on my cluster.

NoSQL Landscape Presentation UJUG July 21, 2011

Come and see me give my NoSQL Landscape presentation on July 21, 2011 for UJUG. I'll post the slides here on the blog for distribution.

CAP Theorem Diagram for distribution


There are a number of CAP Theorem diagrams that people have put together. This is the one that I've done for a presentation I'm going to be doing soon for UJUG. This one I'm making available to anyone who wants it under Creative Commons. All I am asking in return is credit back to my blog.

I've read a number of blogs myself to garner this information. This is my best effort to explain CAP Theorem as I understand it.

I've done my best to note and give credit for other CAP Theorem articles that have been written by others and will continue to do so.

NoSQL Required Reading

I came across this today. It's an older post, but has some excellent resource links. I've read some of these, but need to read the rest.

NoSQl Required Reading

Thursday, April 28, 2011

Orderly for NoSQL

I picked this up from highscalability.com. It looks like it's not a separate NoSQL data store but is designed to work with Hadoop and others. Looks like it provides key support and is focused on compact representation while preserving sort order. I'm not sure how mature this product is, but looks like it just recently joined the open source community.

Here's a couple of quotes I picked up.

"The goal of this project is to produce extremely space efficient
byte serializations for common data types while ensuring that the
resulting byte array sorts correctly."

"Orderly is more of a focused solution on producing byte arrays
for use in projects like HBase, without requiring those projects
to integrate a serialization system."

Best article on CAP Theorem

This is the best article that I've found on CAP. It's thorough and has been well reviewed by experts.

you-cant-sacrifice-partition-tolerance

The take away for me is this:

1) On a non-Partitioned or non-distributed system, you can guarantee Consistency and Availability.

2) Once you introduce Partitions, then you have to choose between Consistency and Availability. You can't have both. Most companies will chose Availability over Consistency.

note: Some systems claim that they can guarantee C, A and P. The caveat is that the guarantee does not apply to a single piece of data. It's still CA, CP, or AP, but the system can support multiple tunings at the same time.

Saturday, April 23, 2011

Integrating NoSQL into your SQL application

There are basically 4 ways you can integrate the use of NoSQL in your SQL application.


1) Just stick to SQL.
The issue here is that there is likely something you need to do with your data that your SQL database is not efficient with. That may be why you're looking at some NoSQL data store in the first place. Just sticking with your SQL database may be valid depending on your needs. It is well known, safe, and has excellent tools support. There are definite drawbacks to splitting your data and bringing in a new and unknown data store. However, you're likely giving up something by sticking to SQL only.


2) Convert everything to NoSQL.
This might be a real stretch, especially for a transactional application. Your data will need to be denormalized as a start. You really need to understand the limits of the NoSQL data store as well as the advantages that you'll gain. Going to NoSQL can be a bit of a culture shock. You're likely going to be better off looking to a column store for your choice as some things will at least look familiar to you. One of the big advantages is that you're not splitting your data between multiple stores.

3) Split your data, use SQL for transactional things and NoSQL for it's specialty.
This is probably the most common use case as you're getting the best of both worlds here. Let the SQL database do what it does best, usually transactions and speed of development with the great tool support. Pick a specific NoSQL data store and let it do what it does best. Perhaps a graph data store to handle relationship data or a document data store to handle BLOB/CLOB storage.
The really big drawback here is that you're splitting your data. Whenever you do that, consistency becomes a real issue though this may be tempered by the character of your data. Companies handle this problem in a variety of ways, but that's a whole discussion all by itself.

4) Use map/reduce on NoSQL to process data, then feed results back to SQL or NoSQL.
This is also a very common use case. This could be either a variation or enhancement to item 3 depending on who it's used. Again, we're using each data store type for it's strengths. If you feed the results back to your SQL database, then you're mostly avoiding the consistency problem of split data. Otherwise, you're going to have to deal with consistency problems as in item 3.
The big advantage here is that you have the most flexibility with this setup. You can process or otherwise transform your input data into a more usable form for your application then import it into the data store of your choice. Map/reduce is your best friend here. You need to remember that you're going to add some latency to your data flow because of this process flow.

Monday, April 18, 2011

Saturday, April 16, 2011

Introducing Doozer
This is an interesting data store I came across. It looks like it fills a similar roll to Apache Zookeeper. HBase uses Zookeeper in order to manage multiple nodes. While Zookeeper is written is Java, Doozer is written in Go with clients in Ruby and Erlang. It looks like the comm protocol is a combination of 9p (Plan9) and Protocol Buffers. Interesting.

Doozer's focus is on providing a highly-consistent and highly-available data store. To quote "Doozer is a highly-available, completely consistent store for small amounts of extremely important data."

Sounds like it's worth a look.

Comparison of Java and Erlang VMs

Erlang vs Java memory architecture

This is definitively an interesting article. I'm a java developer by trade but am learning Erlang as well. It's a real mind-bend going from an imperative to a functional language. You really have to understand how each style of language wants to work in order to get the most out of it. It's much more than just syntax.

There are several NoSQL data stores that are written in both of these languages as well as map/reduce in both languages. I'm pretty familiar with how the Java VM works and am interested in learning more about the Erlang VM and how this affects each of these data stores.

Friday, April 1, 2011

Generating a Type 1 UUID with Ethernet Address

Now that we know how to get get our hardware or MAC address, we can use it to generate a type 1 UUID. The reason that you would add the ethernet address to the UUID would be to guarantee uniqueness when you're generating UUIDs on multiple machines. You want to guarantee uniqueness across those machines.

The drawback is that the address becomes part of the UUID which could be a security concern. The other possibility is to generate a fake address for each machine which could also guarantee uniqueness while getting around the possible security issue.

Here's the code to do it. Yes, this is using CDI to inject the UUID. You can replicate this using any DI framework or by simply using the class directly.


@Named
@ApplicationScoped
public class UuidService implements Serializable {
    private UUIDGenerator uuidGen = null;
    private EthernetAddress ethernet = null;

   /**
    * Set up the uuid generator
    */
   @PostConstruct
   private void init() {
        uuidGen = UUIDGenerator.getInstance();
        try {
            Enumeration<NetworkInterface> ns = NetworkInterface.getNetworkInterfaces();
            while (ns.hasMoreElements()) {           
                NetworkInterface n = ns.nextElement();
                if (n.isUp() && !n.isLoopback()) {
                    // Grab the first external network interface and get the hardware address for UUID
                    byte[] byteAddress = n.getHardwareAddress();
                    ethernet = new EthernetAddress(byteAddress);
                }
            }
        } catch (SocketException ex) {
            Logger.getLogger(UuidService.class.getName()).log(Level.SEVERE, null, ex);
        }
   }
  
    /**
     * UUID
     * @return
     */
    public String nextId() {
        return uuidGen.generateTimeBasedUUID(ethernet).toString();
    }
}


Here is the injection part.


    @Inject UuidService service;
    User u = new User(service.nextId());

Working with java.net.NetworkInterface in JDK6

JDK6 has a neat little class that lets you get information about all of your network interfaces. We'll look at some code that print out most of the available information, including deciding the hardware or MAC address. If comes to us as a byte[] of length. We also encounter an ip address encoded as a byte[] that we'll decode.


try {
    Enumeration<NetworkInterface> ns = NetworkInterface.getNetworkInterfaces();
    while (ns.hasMoreElements()) {           
        NetworkInterface n = ns.nextElement();
        System.out.println("** Ethernet Name=" + n.getName());
        System.out.println("** Ethernet DisplayName=" + n.getDisplayName());
        byte[] b = n.getHardwareAddress();
        System.out.println("** Ethernet HardwareAddress=" + formatHardwareAddress(b));
        Enumeration<InetAddress> p = n.getInetAddresses();
        while (p.hasMoreElements()) {
            InetAddress ia = p.nextElement();
            System.out.println("**** Address CanonicalHostName=" + ia.getCanonicalHostName());
            System.out.println("**** Address HostAddress=" + ia.getHostAddress());
            System.out.println("**** Address HostName=" + ia.getHostName());
            System.out.println("**** Address Raw=" + formatRawIpAddress(ia.getAddress()));
        }
        for (InterfaceAddress ia: n.getInterfaceAddresses()) {
            System.out.println("**** Interface Address=" + ia.getAddress());
            System.out.println("**** Interface Broadcast=" + ia.getBroadcast());
            System.out.println("**** Interface NetworkPrefixLength=" + ia.getNetworkPrefixLength());
        }
    }
} catch (SocketException ex) {
    Logger.getLogger(UuidService.class.getName()).log(Level.SEVERE, null, ex);
}


First, we'll get a list of available NetworkInterfaces. We'll then iterate through and print out that information. Normally you'll see your external interface along with localhost.

We need 2 more methods to round out this example. We need to be able to decode both the hardware address and the byte[] encoded ip address.


private String formatHardwareAddress(byte[] hardwareAddress) {
    if (hardwareAddress == null) { return ""; }
       
    StringBuilder sb = new StringBuilder();
    for (int i=0; i<hardwareAddress.length -1; i++) {
        sb.append(Integer.toString(hardwareAddress[i] & 0xff, 16).toUpperCase()).append(":");
        }
    sb.append(Integer.toString(hardwareAddress[hardwareAddress.length-1] & 0xff, 16).toUpperCase());
       
    return (sb.toString());
}



private String formatRawIpAddress(byte[] rawAddress) {
    if (rawAddress == null) { return ""; }
       
    StringBuilder sb = new StringBuilder();
    for (int i=0; i<rawAddress.length -1; i++) {
        sb.append(Integer.toString(rawAddress[i]).toUpperCase()).append(".");
    }
    sb.append(Integer.toString(rawAddress[rawAddress.length-1]).toUpperCase());
       
    return (sb.toString());
}


There you have it. Run the code and see what you get on your own machine.

Thursday, March 31, 2011

The MySpace decline and Microsoft

8 Lessons We Can Learn from the MySpace Incident - Balance, Vision, Fearlessness

I like to read http://highscalability.com/ and this is one set of articles I wanted to comment one. This article is a summary of a lot of reader comments. There are links at the bottom of this article that refer to the original articles.

One of the questions posed is "Did the Microsoft software stack kill MySpace?" The consensus seems to be no. A number of people brought up examples a high-traffic sites running a Microsoft stack. Curiously, all of the examples were Microsoft sites.

I'd agree with most people that it was the business culture that killed the popularity of the site. Specifically the lack of innovation that was driven by technical problems along with an outsourced infrastructure. It seems that MySpace really did not have a technical or innovation driven culture and that really hurt.

My comment about the Microsoft stack is that all of the well-known, high-traffic sites that I'm aware of really own their stack and make it do what they want despite the difficulties. Whatever stack you choose can completely own you. You buy into all of it's strengths and weaknesses. If your stack is not flexible enough or does not have the features you need, then you are either stuck or you need to go outside of that stack to get what you need. Some MySpace outsourced their stack to Microsoft to run, I think that really limited what MySpace could do. However, I think for this case, Microsoft is probably more of a savior than a hindrance. Unfortunately for big M$, MySpace is a lot of bad news and they are directly attached to it.

The point I really wanted to bring out as for as the NoSql world is concerned, is that Microsoft really is not a player here. Most NoSql data stores run very specifically on Linux/Unix platforms and do not target windows. Even Hadoop with it's huge ecosystem only runs on Windows with cygwin and then only for development, not production.

MySpace appears to be running everything MS Sql server which would make sense being a M$ stack. Google, Facebook, Twitter and others run multiple data stores depending on their needs. While you could set up any sql database to manage relationships between users. However, if you want to figure out the intersection between 2 users groups of friends, you want to use a graph database because it is designed to model relationships very efficiently.

My question is that is this a problem for M$ if they are not part of this space? There are only very few large sites that known to run a M$ stack. Most of the NoSql space is simply on in the M$ corner. Is NoSql a big enough space for M$ to even pay attention too?

My opinion is that it will be a problem. Some of this software is going to be ported to and strongly support on Windows. The mind share is still strongly elsewhere and where those leaders go, the rest will follow. Many companies are just becoming aware of this space and are intrigued by how this software can impact their business.

It will be incredibly interesting to see how all of this plays out so stay tuned ...

Monday, March 28, 2011

UUID key generation

Many times when you're inserting data into a NoSql style data store, you need to provide your own keys. There are several ways this can be done in Java.

Java 5 has java.util.UUID http://download.oracle.com/javase/1.5.0/docs/api/java/util/UUID.html
  • http://jug.safehaus.org/ seems to be popular and is what I’m using. You need to provide the hardware ethernet address on your own for type 1 UUIDs.
  • http://johannburkard.de/software/uuid/ seems like an interesting alternative as well. It uses some OS specific utilities to obtain the hardware ethernet address for use in the UUID.
UUID generation is pretty straight forward. These examples are for  JUG.

private UUIDGenerator uuidGen = UUIDGenerator.getInstance();
// Generate a type 1 UUID without a network interface reference.
String myUuid1 = uuidGen.generateTimeBasedUUID().toString();


That’s all that is needed to get your UUID to use as a key in your data store.
Next post we’ll look at using JDK 6 to get the network address to use along with the type 1 UUID.