So you’ve done your research and quickly realized that Hadoop is going to be at the very core of your company’s Big Data Platform, given that data storage and processing will never get more cost effective than open-source software running on commodity hardware. The next level down the rabbit hole has you in a quandary, though. Beyond Hadoop, you start running into technologies like Hive, Impala, Pig, Storm, YARN and other elements from Hadoop’s periodic table of technologies.
You also start to become highly suspicious: Is all of this as easy and low cost as it seems? Are the claims Hadoop makes valid? What dragons lurk in the shadows?
In this short blog post, we hope to shine some light on the most common — and perhaps biggest surprises (both good and bad) — that new users of Hadoop run into in their first 12–18 months.
The first surprise that many users encounter is that MapReduce (the half of Hadoop dedicated to processing your big data sets) is slow. The primary reasons for this are that MapReduce involves a lot of disk and network I/O and that JVM (java virtual machine) spin up/teardown is computationally expensive (every map and reduce task runs in a separate JVM). Even the most simple MapReduce job that does nothing but pass data straight through (basically just performing a sort) looks like this:
- Mappers read data from disk (disk read and some network I/O)
- Mappers write intermediate data to (disk write)
- Reducers copy data from mappers (possibly a lot of network I/O)
- Reducers write data back to disk (disk write and some network I/O)
This is a long-standing complaint about Hadoop, and fortunately the last couple of years have seen a flurry of “in memory” analysis frameworks (like Impala, Storm, Spark, Kafka, etc.) come to the rescue and find use in production workloads.
Chances are if you’re spinning up a cluster today, your users are going to want some extra memory in those processing slave machines to run one or more of these in-memory engines next to MapReduce. If you have extra dollars to spend on your cluster, your money will be best spent on extra memory and/or higher-speed networking equipment.
You’ll likely still be using MapReduce heavily for after-the-fact batch processing, but despite the Java-heavy focus of most of the MapReduce documentation, your main interface will probably be Hive, Pig or Impala. You’ll almost never write raw Java.
Pig, Hive and Impala (which uses a subset of Hive’s interface language HiveQL) are all much simpler to learn than Java. Even if your data analysts do know Java well, it makes no sense to spend one hour writing Java code when you can get the same end result by constructing a five- to 10-line PigLatin, HiveQL or Impala query.
Hive and Pig still run MapReduce under the hood, but they abstract out a great deal of the complexity that MapReduce requires, and in most cases they run nearly as fast as well-written Java code would run. Most folks report that many of their Hive or Pig queries run only 10–15 percent slower than well-written Java — meaning that even with a slightly slower run time, the productivity gain from using these interfaces will result in overall faster start-to-finish analysis.
Impala is a very different beast from either Hive or Pig. It uses a completely separate set of daemons from MapReduce. These daemons favor speed over fault tolerance, making your queries run maybe 10 to 70 times faster than equivalent Hive or Pig — excellent for the ad hoc data discovery and analysis that many Hadoop users want and need.
One of the few times you’ll be writing raw Java is when you need to squeeze out that last bit of performance to ensure that your job runs faster — kind of like when you need to rewrite some interpreted language like PHP, Python or Ruby in low-level C.
If you’re planning on sending some of your users off for Hadoop training, you may want to consider sending them to a class focused on the tools they’ll be using the most — Hive, Pig or Impala — rather than a session that teaches them how to write MapReduce in Java.
Before you run off and build your cluster, you’ll also want to take heed of two other big surprises: that managing a cluster is a complicated task (which many companies underestimate) and that, at the end of it all, you may not even need your own cluster.
Hadoop is complicated and although tools like Cloudera Manager greatly simplify the installation, operation and maintenance of a cluster, you’ll never escape the routine tasks of node commissioning/decommissioning, hardware and software monitoring and failed component replacement. And, here’s the rub — you may not even need to do any of this to begin with.
If your big data analysis needs are sporadic — say only five or so shorter jobs running over the course of a week — a “Hadoop as a Service” product like Amazon Web Services Elastic MapReduce or “EMR” may enable you to even further simplify your big data analysis.
EMR, and tools like it, allow you to spin up hundreds of machines and only use them (and pay for them) for the hours your job is running. Using EMR or a similar tool can enable you to completely focus on the analysis tasks, without burdening your team with hardware and software install, monitoring and maintenance.
If you’re just getting started with Hadoop, using EMR can enable your team to get straight to the coding and experience the power of Hadoop within just a few hours. And, should your big data analysis needs become less sporadic, you can always pull your source data back out of the cloud and set up your cluster locally.
It’s our sincere hope that these “from the trenches” tips will enable you to better plan your Hadoop usage over the first few months. If you have additional tips or tricks, we’d love to hear your feedback in the comments below.