Sometimes the buzz around a new technology is SO strong that it’s hard to see the trees or the forest. Take big data for example. Crack open the NY Times, Computer World, or whatever technology magazine you’d like and you’ll often read about the “Next Big Technology Wave” or how there’s a shortage of big data professionals. And in the end you have to ask yourself, what exactly is big data? Here’s the long and short of it from my viewpoint.
Big data is a bit of a buzzword and technology meme that describes a scenario where an organization has SO much data, in a LOT of different formats and it’s being generated so quickly, they can’t keep up with it. Or certainly they can’t derive value out of it. Because while many companies may store the information, most aren’t doing any real analytics on it because there is just TOO much. Enter big data which was made possible due to cheap storage, cheap compute cycles (distributed computing-cloud), and software that can process huge amounts of information.
While big data is not solely focused on unstructured data, that category of data does pose the biggest hurdles and challenge. Relational databases are built from the ground up to store corporate information in neat rows and columns for fast, optimized queries. But as the world evolved into Web 2.0, there is so much user generated data based upon internet activities (Facebook, Yahoo, Google) that it’s been nearly impossible to keep up with, never mind use. If you had someone analyze all of the search queries you’ve made on Google or Bing, what would they know about you? My guess is the answer would be “plenty”!
So you’ve got cheap storage and computing to hold all this wonderful information, but how do you analyze it and use it? Technology companies have tried to uncover this holy grail of information processing, and there are several proprietary offerings that fall into the big data category: EMC Greenplum, Oracle’s Big Data Appliance, SAP Hana, and more. But as you can guess, most vendors are going to be biased towards their technology, and given the complexity of these potential solutions, they are not inexpensive.
Enter Apache Hadoop! An open-source project housed at the Apache Foundation and based upon technical papers written by Google, Hadoop founder Doug Cutting (currently an Architect at Cloudera) leveraged his search background and expertise to create a big data framework focused on storing, processing, and analyzing large streams of data. The two key technologies within Apache Hadoop are the Hadoop Distributed File System (scalable, high availability, distributed data storage) and MapReduce (application framework for parallel processing of data). As you can guess, given its origins as a non-vendor technology, Apache Hadoop had an agnostic advantage in solving a problem that spans multiple technologies and vendors. And given the fact that open source software is freely distributed and used, the only real hurdle to using Apache Hadoop is LEARNING how to use the technology. So Apache Hadoop became a leader in the big data space due to its agnosticism and slowly began to be supported by more traditional, proprietary software vendors. Microsoft, IBM, and EMC all stepped up to work with and integrate Hadoop technology into their offerings.
Now that Hadoop is embraced by many across the technology landscape, who do you, as a business customer, go to learn how to Hadoop it? While the software is available through the Apache Software Foundation, commercial distributions of Apache Hadoop sprang up from the fertile big data earth. Cloudera, Hortonworks, and MapR are three examples of Apache Hadoop vendors that stepped up to act as both contributors and supporters to the open-source project. They also packaged a commercial distribution providing support, tools, and training for Apache Hadoop to make it more usable and stable for corporate environments; a similar situation to Red Hat leading the adoption of enterprise-ready Linux in companies today.
So there you go, the Apache Hadoop tree in the middle of a huge big data forest! In the end, big data is really a super-set of technologies designed to solve a business problem: far too much data and information, too little knowledge. Add in a little bit of distributed storage, a little bit of parallel processing, a robust open-source application and as usual in the IT world, it’s pretty complicated. And big data technology is evolving quickly as we speak. But the reality is that the problems that are being solved by big data won’t be going away any time soon. I can’t imagine any of us spending less time on the internet, moving forward. Just hold on tight because this big data, Apache Hadoop ride is just getting started…