Learning how to program and develop for the Hadoop platform can lead to lucrative new career opportunities in Big Data. But like the problems it solves, the Hadoop framework can be quite complex and challenging. Building a strong foundation, leveraging online resources, and focusing on the basics with professional training can help neophytes across the Hadoop finish line. If I’ve learned one thing in two decades of IT, it’s that the learning never ends. In the following posts, I’d like to walk you through the path that I took in identifying Hadoop as a “must have” skill and how I quickly got ramped up on the technology.
Hadoop is a paradigm-shifting technology that lets you do things you could not do before – namely compile and analyze vast stores of data that your business has collected. “What would you want to analyze?” you may ask. How about customer click and/or buying patterns? How about buying recommendations? How about personalized ad targeting or more efficient use of marketing dollars?
From a business perspective, Hadoop is often used to build deeper relationships with external customers, providing them with valuable features like recommendations, fraud detection, and social graph analysis. In-house, Hadoop is used for log analysis, data mining, image processing, extract-transform-load (ETL), network monitoring – anywhere you’d want to process gigabytes, terabytes, or petabytes of data.
Hadoop allows businesses to find answers to questions they didn’t even know how to ask, providing insights into daily operations, driving new product ideas, or putting compelling recommendations and/or advertisements in front of consumers who are likely to buy. The fact that Hadoop can do all the above is not the compelling argument for its use. Other technologies have been around for a long, long while which can and do address everything we’ve listed so far. What makes Hadoop shine however, is that it performs these tasks in minutes or hours for little or no cost versus the days or weeks and substantial costs (licensing, product, specialized hardware) of previous solutions.
Hadoop does this by abstracting out all of the difficult work in analyzing large data sets, performing its work on commodity hardware and scaling linearly. Add twice as many worker nodes and your processing will generally complete two times faster. With datasets growing larger and larger, Hadoop has become the solitary solution businesses turn to when they need fast, reliable processing of large, growing data sets for little cost.
Because it needs only commodity hardware to operate, Hadoop also works incredibly well with public cloud infrastructure. Spin up a large cluster only when you need to, then turn everything off once the analysis is done. In the hands of a business-savvy technologist, Hadoop makes the impossible look trivial.
The Pillars of Hadoop — HDFS and MapReduce
Architecturally, Hadoop is just the combination of two technologies: the Hadoop Distributed File System (HDFS) that provides storage and the MapReduce programming model which provides processing. HDFS exists to split, distribute, and manage chunks of the overall data set, which could be a single file or a directory full of files. These chunks of data are pre-loaded onto the worker nodes, which later process them in the MapReduce phase. By having the data local at process time, HDFS saves all of the headache and inefficiency of shuffling data back and forth across the network.
In the MapReduce phase, each worker node spins up one or more tasks (which can either be Map or Reduce). Map tasks are assigned based on data locality if at all possible. A Map task will be assigned to the worker node where the data resides. Reduce tasks (which are optional) then typically aggregate the output of all of the dozens, hundreds, or thousands of map tasks and produce final output.
The Map and Reduce programs are where your specific logic lies, and seasoned programmers will immediately recognize Map as a common built-in function or data type in many languages. For example, we could define a function squareMe, which does nothing but return the square of a number. We could then pass an array of numbers to a map call, telling it to run squareMe on each. So an input array of (2,3,4,5) would return (4,9,16,25), and our call would look like (in Python) map(“squareMe”,array(‘i’,[2,3,4,5]). Hadoop will parse the data in HDFS into user-defined keys and values, and each key and value will then be passed to your Mapper code. In the case of image processing, each value may be the binary contents of your image file, and your Mapper may simply run a user-defined convertToPdf function against each file. In this case, you wouldn’t even need a Reducer, as the Mappers would simply write out the PDF file to some datastore (like HDFS or S3).
Consider, however, if you wished to count the occurrences of a list of “good/bad” keywords in all customer chat sessions, twitter feeds, public Facebook posts, and/or e-mails in order to gauge customer satisfaction. Your good list may look like happy, appreciate, “great job”, awesome, etc., while your bad list may look like unhappy, angry, mad, horrible, etc., and your total data set of all chat sessions and emails may be hundreds of gigabytes. In this case, each Mapper would work only on a subset of that overall data, and the Reducer would be used to compile the final count, summing up outputs of all the Map tasks. At its core, Hadoop is really that simple. It takes care of all the underlying complexity, making sure that each record is processed, that the overall job runs quickly, and that failure of any individual task (or hardware/network failure) is handled gracefully. You simply bring your Map (and optionally Reduce) logic, and Hadoop processes every record in your dataset with that logic.
Solid Linux and Java Will Speed Your Success
Hadoop is written in Java and is optimized to run Map and Reduce tasks that were written in Java as well. If your Java is rusty, you may want to spend a few hours with your Java for Dummies book before you even begin looking at Hadoop. Although Java familiarity also implies good coding practices (especially Object-Oriented Design (OOD) coding practices), you may want to additionally brush up on your Object Oriented skills and have a clear understanding of concepts like Interfaces, Abstract Objects, Static Methods, and Variables, etc. Weekend afternoons at Barnes and Noble or brownbag sessions with other developers at work are the quickest ways I know to come up to speed on topics like this.
Although it does offer a Streaming API which allows you to write your basic Map and Reduce code in any language of your choice, you’ll find that most code examples and supporting packages are Java-based, that deeper development (such as writing Partitioners) still requires Java, and that your Streaming API code will run up to 25% slower than a Java implementation. Although Hadoop can run on Windows, it was built initially on Linux, and Linux is the preferred method for both installing and managing Hadoop. The Cloudera Distribution of Hadoop (CDH) is only officially supported on Linux derivatives like Ubuntu ® and RedHat ®. Having a solid understanding of getting around in a Linux shell will also help you tremendously in digesting Hadoop, especially with regards to many of the HDFS command line parameters. Again, a Linux for Dummies book will probably be all you need.
Once you’re comfortable in Java, Linux, and good OOD coding practices, the next logical step would be getting your hands dirty by either installing a CDH Virtual Machine or a CDH distribution. When you install a Cloudera provided CDH distribution of Hadoop, you’re getting assurance that some of the best minds in the Hadoop community have carefully reviewed and chosen security, functionality, and supporting patches, tested them together, and provided a working, easy-to-install package.
Although you can install Hadoop from scratch, it is both a daunting and unnecessary task that could burn up several weeks of your time. Instead, download either a local virtual machine (VM), which you can run on your workstation, or install the CDH package (CDH4 is the latest) for your platform – either of which will only take 10 minutes. If you’re running in a local VM, you can run the full Hadoop stack in pseudo-distributed mode, which basically mimics the operation of a real production cluster right on your workstation. This is fantastic for jumping right in and exploring.
Reproduced from Global Knowledge White Paper: Learning How To Learn Hadoop