Tuesday, July 22, 2014

What is Big Data?

I've been wanting to write this for a long time, but now have time to put some thoughts down based on discussions and comments from various groups of technical and business people.

The question is, What is Big Data?

Is Big Data just that, big ... data? Well, there have always been very large data sets, certainly constrained to the technology limitations of the time. Storage systems that once were measured in gigabytes, then terabytes, are now measured in petabytes. Certainly out ability to store vast quantities of data have increased dramatically.

With the advent of the Internet and the World Wide Web, our ability to collect information has also grown dramatically as well. Where companies and individuals were limited to collecting data within their particular sphere, now collecting data from locations and individuals around the globe is a daily occurance.

I've hear people sometimes assert that Big Data is simply doing the same things we were before, just in a different way, maybe with different tools. I can understand that point of view. I think that is generally the first step in the process of moving towards a Big Data mindset.

I would put forth that Big Data really is something different than just working with large data sets. To me, Big Data is really a different mind set. We live in a vastly more connected world than just a decade ago. The daily effect of Data in people's lives in general is incredibly pervasive. It's almost impossible to escape.

Big Data is about thinking differently about your data.

Big Data is about connecting data sources. What other data is available that can enhance what you have? Can you add user behavior from your web site? How about advertising or industry data? How about public sources from government or private sources like Twitter, Facebook, search engines like Google, Bing, Yahoo and others?

Big Data is more about openness and sharing. Not everything can be shared and you really, really need to be careful of user identifiable information, but what can you share? What can you publish or what data sets can you contribute to that enrich the overall community?

Big Data many times involves working with unstructured data. SQL database by their nature, enforce very strict structures on your data. If your data doesn't conform, then you're kind of out of luck. There are some things that SQL databases do extremely well, but there are real limits.

There are many new tools and techniques for working with unstructured data and extracting value from them. So called NOSQL data stores are designed to work with data that has little or no structure, providing new capabilities. Open source search engines like ElasticSearch and SOLR provide incredible search and faceting/aggregation abilities.

We have many machine learning algorithms and tools that let us dissect and layer structure on top of our data to help us make sense of it. Algorithms help us to cluster, classify, and figure out which documents/data are similar to other ones.

We can process volumes of data in ways that we couldn't before. Traditional compute requires the data to be moved to the application/algorithm, then the answer was written back to storage. Now we have platforms like Hadoop that effectively distribute large data sets and move the algorithm to the data allowing it to be processed and answers written in place, or to be distributed elsewhere.

Does Big Data require big/large data to be useful? You can run most of these tools on your laptop, so no, not really. You can even run many tools like Map/Reduce on a $35 Raspberry Pi. With Big Data tools, we can do things on our laptops that required data centers in the past.

Big Data requires experimentation. It requires people to ask new questions and develop new answers. What can we do now that we couldn't do before? It requires a different mindset.

  - Craig