Saturday, April 30, 2011

Setting up a Riak NoSQL cluster in 7 minutes


This is my first video on NoSQL technology. It involves setting up a 3-node Riak cluster, complete with software installation, config file edits, starting each node and then connecting the cluster. I was trying to do it in under 5 minutes, but I did it in under 7 minutes instead. That's still very good. It took me several evenings to set up a 3-node Hadoop cluster the first time I did it :)

The video is not the best quality but I think it still works. I'll try this again another time, but with a different recording method. 

NoSQL cluster setup


I thought I'd share a photo up my cluster setup. I have 7 Sun V20z servers. They are older, but total investment is < $1500 so that is pretty good. 6 are dual-single core servers and 1 is dual-dual. I can upgrade 2 more to dual-dual, but I need a firmware upgrade that is only available on service contract now. Some of the servers have 4G ram with 2-36GB 10K rpm drives and the others have16GB with 2-73GB 15K rpm drives. That doesn't give me much storage space, but a lot of ram. That's good enough for the kind of testing I'm doing.

They are all running Solaris 10 U9 which is the first Oracle Solaris release. A couple of things I really like about Solaris are ZFS and Zones. Zones is an OS level virtualization technology. Zones are very lightweight and easy to set up and administer, especially in conjunction with ZFS. I currently have zones set up for Riak and Hadoop. I plan on setting up a number of NoSQL data stores and seeing how they operate, administer, and run on my cluster.

NoSQL Landscape Presentation UJUG July 21, 2011

Come and see me give my NoSQL Landscape presentation on July 21, 2011 for UJUG. I'll post the slides here on the blog for distribution.

CAP Theorem Diagram for distribution


There are a number of CAP Theorem diagrams that people have put together. This is the one that I've done for a presentation I'm going to be doing soon for UJUG. This one I'm making available to anyone who wants it under Creative Commons. All I am asking in return is credit back to my blog.

I've read a number of blogs myself to garner this information. This is my best effort to explain CAP Theorem as I understand it.

I've done my best to note and give credit for other CAP Theorem articles that have been written by others and will continue to do so.

NoSQL Required Reading

I came across this today. It's an older post, but has some excellent resource links. I've read some of these, but need to read the rest.

NoSQl Required Reading

Thursday, April 28, 2011

Orderly for NoSQL

I picked this up from highscalability.com. It looks like it's not a separate NoSQL data store but is designed to work with Hadoop and others. Looks like it provides key support and is focused on compact representation while preserving sort order. I'm not sure how mature this product is, but looks like it just recently joined the open source community.

Here's a couple of quotes I picked up.

"The goal of this project is to produce extremely space efficient
byte serializations for common data types while ensuring that the
resulting byte array sorts correctly."

"Orderly is more of a focused solution on producing byte arrays
for use in projects like HBase, without requiring those projects
to integrate a serialization system."

Best article on CAP Theorem

This is the best article that I've found on CAP. It's thorough and has been well reviewed by experts.

you-cant-sacrifice-partition-tolerance

The take away for me is this:

1) On a non-Partitioned or non-distributed system, you can guarantee Consistency and Availability.

2) Once you introduce Partitions, then you have to choose between Consistency and Availability. You can't have both. Most companies will chose Availability over Consistency.

note: Some systems claim that they can guarantee C, A and P. The caveat is that the guarantee does not apply to a single piece of data. It's still CA, CP, or AP, but the system can support multiple tunings at the same time.

Saturday, April 23, 2011

Integrating NoSQL into your SQL application

There are basically 4 ways you can integrate the use of NoSQL in your SQL application.


1) Just stick to SQL.
The issue here is that there is likely something you need to do with your data that your SQL database is not efficient with. That may be why you're looking at some NoSQL data store in the first place. Just sticking with your SQL database may be valid depending on your needs. It is well known, safe, and has excellent tools support. There are definite drawbacks to splitting your data and bringing in a new and unknown data store. However, you're likely giving up something by sticking to SQL only.


2) Convert everything to NoSQL.
This might be a real stretch, especially for a transactional application. Your data will need to be denormalized as a start. You really need to understand the limits of the NoSQL data store as well as the advantages that you'll gain. Going to NoSQL can be a bit of a culture shock. You're likely going to be better off looking to a column store for your choice as some things will at least look familiar to you. One of the big advantages is that you're not splitting your data between multiple stores.

3) Split your data, use SQL for transactional things and NoSQL for it's specialty.
This is probably the most common use case as you're getting the best of both worlds here. Let the SQL database do what it does best, usually transactions and speed of development with the great tool support. Pick a specific NoSQL data store and let it do what it does best. Perhaps a graph data store to handle relationship data or a document data store to handle BLOB/CLOB storage.
The really big drawback here is that you're splitting your data. Whenever you do that, consistency becomes a real issue though this may be tempered by the character of your data. Companies handle this problem in a variety of ways, but that's a whole discussion all by itself.

4) Use map/reduce on NoSQL to process data, then feed results back to SQL or NoSQL.
This is also a very common use case. This could be either a variation or enhancement to item 3 depending on who it's used. Again, we're using each data store type for it's strengths. If you feed the results back to your SQL database, then you're mostly avoiding the consistency problem of split data. Otherwise, you're going to have to deal with consistency problems as in item 3.
The big advantage here is that you have the most flexibility with this setup. You can process or otherwise transform your input data into a more usable form for your application then import it into the data store of your choice. Map/reduce is your best friend here. You need to remember that you're going to add some latency to your data flow because of this process flow.

Monday, April 18, 2011

Saturday, April 16, 2011

Introducing Doozer
This is an interesting data store I came across. It looks like it fills a similar roll to Apache Zookeeper. HBase uses Zookeeper in order to manage multiple nodes. While Zookeeper is written is Java, Doozer is written in Go with clients in Ruby and Erlang. It looks like the comm protocol is a combination of 9p (Plan9) and Protocol Buffers. Interesting.

Doozer's focus is on providing a highly-consistent and highly-available data store. To quote "Doozer is a highly-available, completely consistent store for small amounts of extremely important data."

Sounds like it's worth a look.

Comparison of Java and Erlang VMs

Erlang vs Java memory architecture

This is definitively an interesting article. I'm a java developer by trade but am learning Erlang as well. It's a real mind-bend going from an imperative to a functional language. You really have to understand how each style of language wants to work in order to get the most out of it. It's much more than just syntax.

There are several NoSQL data stores that are written in both of these languages as well as map/reduce in both languages. I'm pretty familiar with how the Java VM works and am interested in learning more about the Erlang VM and how this affects each of these data stores.

Friday, April 1, 2011

Generating a Type 1 UUID with Ethernet Address

Now that we know how to get get our hardware or MAC address, we can use it to generate a type 1 UUID. The reason that you would add the ethernet address to the UUID would be to guarantee uniqueness when you're generating UUIDs on multiple machines. You want to guarantee uniqueness across those machines.

The drawback is that the address becomes part of the UUID which could be a security concern. The other possibility is to generate a fake address for each machine which could also guarantee uniqueness while getting around the possible security issue.

Here's the code to do it. Yes, this is using CDI to inject the UUID. You can replicate this using any DI framework or by simply using the class directly.


@Named
@ApplicationScoped
public class UuidService implements Serializable {
    private UUIDGenerator uuidGen = null;
    private EthernetAddress ethernet = null;

   /**
    * Set up the uuid generator
    */
   @PostConstruct
   private void init() {
        uuidGen = UUIDGenerator.getInstance();
        try {
            Enumeration<NetworkInterface> ns = NetworkInterface.getNetworkInterfaces();
            while (ns.hasMoreElements()) {           
                NetworkInterface n = ns.nextElement();
                if (n.isUp() && !n.isLoopback()) {
                    // Grab the first external network interface and get the hardware address for UUID
                    byte[] byteAddress = n.getHardwareAddress();
                    ethernet = new EthernetAddress(byteAddress);
                }
            }
        } catch (SocketException ex) {
            Logger.getLogger(UuidService.class.getName()).log(Level.SEVERE, null, ex);
        }
   }
  
    /**
     * UUID
     * @return
     */
    public String nextId() {
        return uuidGen.generateTimeBasedUUID(ethernet).toString();
    }
}


Here is the injection part.


    @Inject UuidService service;
    User u = new User(service.nextId());

Working with java.net.NetworkInterface in JDK6

JDK6 has a neat little class that lets you get information about all of your network interfaces. We'll look at some code that print out most of the available information, including deciding the hardware or MAC address. If comes to us as a byte[] of length. We also encounter an ip address encoded as a byte[] that we'll decode.


try {
    Enumeration<NetworkInterface> ns = NetworkInterface.getNetworkInterfaces();
    while (ns.hasMoreElements()) {           
        NetworkInterface n = ns.nextElement();
        System.out.println("** Ethernet Name=" + n.getName());
        System.out.println("** Ethernet DisplayName=" + n.getDisplayName());
        byte[] b = n.getHardwareAddress();
        System.out.println("** Ethernet HardwareAddress=" + formatHardwareAddress(b));
        Enumeration<InetAddress> p = n.getInetAddresses();
        while (p.hasMoreElements()) {
            InetAddress ia = p.nextElement();
            System.out.println("**** Address CanonicalHostName=" + ia.getCanonicalHostName());
            System.out.println("**** Address HostAddress=" + ia.getHostAddress());
            System.out.println("**** Address HostName=" + ia.getHostName());
            System.out.println("**** Address Raw=" + formatRawIpAddress(ia.getAddress()));
        }
        for (InterfaceAddress ia: n.getInterfaceAddresses()) {
            System.out.println("**** Interface Address=" + ia.getAddress());
            System.out.println("**** Interface Broadcast=" + ia.getBroadcast());
            System.out.println("**** Interface NetworkPrefixLength=" + ia.getNetworkPrefixLength());
        }
    }
} catch (SocketException ex) {
    Logger.getLogger(UuidService.class.getName()).log(Level.SEVERE, null, ex);
}


First, we'll get a list of available NetworkInterfaces. We'll then iterate through and print out that information. Normally you'll see your external interface along with localhost.

We need 2 more methods to round out this example. We need to be able to decode both the hardware address and the byte[] encoded ip address.


private String formatHardwareAddress(byte[] hardwareAddress) {
    if (hardwareAddress == null) { return ""; }
       
    StringBuilder sb = new StringBuilder();
    for (int i=0; i<hardwareAddress.length -1; i++) {
        sb.append(Integer.toString(hardwareAddress[i] & 0xff, 16).toUpperCase()).append(":");
        }
    sb.append(Integer.toString(hardwareAddress[hardwareAddress.length-1] & 0xff, 16).toUpperCase());
       
    return (sb.toString());
}



private String formatRawIpAddress(byte[] rawAddress) {
    if (rawAddress == null) { return ""; }
       
    StringBuilder sb = new StringBuilder();
    for (int i=0; i<rawAddress.length -1; i++) {
        sb.append(Integer.toString(rawAddress[i]).toUpperCase()).append(".");
    }
    sb.append(Integer.toString(rawAddress[rawAddress.length-1]).toUpperCase());
       
    return (sb.toString());
}


There you have it. Run the code and see what you get on your own machine.