In the world of big data, most industry experts agree that Hadoop is the tool of choice for ingestion, analysis, and interpretation of the massive amounts of data that nearly every business finds itself swimming in. Business leaders have discovered that this data has real and profound impacts on the bottom line, and as a result, more and more IT departments are tasked with creating and maintaining software to pull value out of that data.
Either by using the framework itself or one of the multitude of “ecosystem” projects around its periphery, people use Hadoop as an extract, transform, load (ETL) platform (sqoop, HParser), a NoSQL database (HBase), a file server, an OLAP replacement, a Business Intelligence framework, a real-time data ingestion service (flume, chukwa), a distributed coordination service (zookeeper), a real-time big data query engine (Impala, Drill), and a couple dozen other technologies that we don’t have space to even introduce.
For a long time in the Hadoop world, individuals have built and chained these pieces together in ways that solve new and novel problems, making Hadoop something of the Swiss Army Knife of IT. Let’s look at each of these uses in more detail.
Cost Effective Storage at the Core
Regardless of whether you decide to use stock Hadoop Distributed File System (HDFS) or a replacement storage engine, the end result is that you still end up with a very cost-effective, highly redundant, scalable storage solution.
If you find yourself with a lot of unused disk space after adding more nodes for processing, you will see that Hadoop is a great place to store much of the data that you stored in your NetApp or EMC file servers. The Hadoop ecosystem provides FuseDFS, HttpFS, WebDAV, FTP, and many other access protocols and packages.
Cost-Savings AND a Strategic Business Advantage
The reality is that you’re more likely to analyze data once it’s stored in your Hadoop cluster. The MapReduce algorithms that perform that analysis are often quite trivial pieces of code, usually no more than a few hundred lines at most, and a variety of alternate interfaces like Hive and Pig make that data even more accessible.
One analysis role that Hadoop commonly finds itself tasked with is Online Analytics Processing (OLAP). OLAP is the act of materializing views of your data in a data warehouse in a way that makes specific lookups very fast. With a bit of data re-organization, Hadoop can often serve a great many of these OLAP queries faster, more flexibly, and more cost-effectively.
Not your Father’s Data Analysis Oldsmobile
Whether you’re a large enterprise or a scrappy startup, your first encounter with Hadoop may well come via HBase, Zookeeper, or Impala. HBase is a distributed NoSQL datastore that rides on top of HDFS. It is a column-oriented, key-value store that delivers extremely high throughput on massive amounts of data. To achieve that high throughput, it actually uses its own mechanisms (mainly caching, in-memory operations, and specialized coordination) instead of Hadoop’s MapReduce.
The Hadoop Evolution: From Batch to Real-Time Queries
Impala is a Cloudera-developed, real-time query engine that, like HBase, uses its own processing architecture instead of MapReduce. Whereas a MapReduce job might take minutes or hours to complete, an Impala query might return in milliseconds, allowing internal or external users to query HDFS or HBase in real-time. Although Cloudera expects its first production-ready code drop of Impala sometime in Q1 of 2013, many organizations are already rumored to be deploying it in production.
Big Data Analysis meets Open-source Innovation
Just as Linux spawned decades of innovation and real-world problem-solving in the infrastructure domain, Hadoop’s influence in the big data domain will be real, lasting, and felt for decades to come.