Once you’re doing real development, you’ll want to get into the habit of using smaller, test datasets on your local machine and running your code iteratively in Local Jobrunner Mode (which lets you locally test and debug your Map and Reduce code), then Pseudo-Distributed Mode (which more closely mimics the production environment), then finally Fully-Distributed Mode (your real production cluster). By doing this iterative development, you’ll be able to get bugs worked out on smaller subsets of the data so that when you run on your full dataset with real production resources, you’ll have all the kinks worked out, and your job won’t crash three-quarters of the way in.
Remember that in Hadoop, your Map (and possibly Reduce) code will be running on dozens, hundreds, or thousands of nodes. Any bugs or inefficiencies will get amplified in the production environment. In addition to performing iterative Local, Pseudo, Full development with increasingly larger subsets of test data, you’ll also want to code defensively, making heavy use of try/catch blocks, and gracefully handling malformed or missing data (which you’re sure to).
Chances are also very high that once you or others in your company come across Pig or Hive that you’ll never write another line of Java again. Pig and Hive represent two different approaches to the same issue: that writing good Java code to run on Map Reduce is hard and unfamiliar to many. What these two supporting products provide are simplified interfaces into the MapReduce paradigm, making the power of Hadoop accessible to nondevelopers. In the case of Hive, a SQL-like language called HiveQL provides this interface. Users simply submit HiveQL queries like SELECT * FROM SALES WHERE amount > 100 AND region = ‘US’, and Hive will translate that query into one or more MapReduce jobs, submit those jobs to your Hadoop cluster, and return results. Hive was heavily influenced by MySQL, and those familiar with that database will be quite at ease with HiveQL.
Pig takes a very similar approach, using a high-level programming language called PigLatin, which contains familiar constructs such as FOREACH, as well as arithmetic, comparison, and boolean comparators and SQLlike MIN, MAX, JOIN operations. When users run a PigLatin program, Pig converts the code into one or more MapReduce jobs and submits it to the Hadoop cluster, the same as Hive. What these two interfaces have in common is that they are incredibly easy to use, and they both create highly optimized MapReduce jobs, often running even faster than similar code developed in a non-Java language via the Streaming API.
If you’re not a developer or you don’t want to write your own Java code, mastery of Pig and Hive is probably where you want to spend your time and training budgets. Because of the value they provide, it’s believed that the vast majority of Hadoop jobs are actually Pig or Hive jobs, even in such technology-savvy companies as Facebook. As you dig deeper into the Hadoop ecosystem, you’ll quickly trip across some other supporting products like Flume, Sqoop, Oozie, and ZooKeeper, which we didn’t have time to mention here.
Applying the Knowledge at Work
At this point, your education could take one of many routes, and if at all possible consider an official Cloudera training class. With a powerful technology such as Hadoop, you want to make sure you get the essentials and basics down as soon as possible, and taking a course will quickly pay for itself by helping you avoid costly mistakes as well as introducing you to concepts and ideas that would’ve taken you months to learn on your own. There is simply no substitute for having dedicated time with a domain expert from whom you can ask questions and get clarifications.
If training is not available to you, probably the next best way to learn is by giving yourself a real-world task, or, even better, by getting company approval to use Hadoop at work. Some potential tasks you may want to look at are: counting and ranking the number of interactions (e-mails, chat sessions, etc.) per customer agent, crawling weblogs looking for errors or common “drop off” pages, building search indices for large document stores, or monitoring social media channels for brand sentiment. The only requirement of an initial project is that it should be relatively simple and low-risk. You’ll want to take baby steps before tackling harder tasks.
As you grow to understand and appreciate the power of Hadoop, you’ll be uniquely positioned to identify opportunities for its use inside of your business. You may find it useful to initiate meetings with gatekeepers or executives inside your business in order to help them understand and leverage Hadoop on data that may be just sitting around unused. Many businesses see a 3% to 5% jump in sales after implementing a recommendation engine alone. Whatever tasks you decide to tackle in Hadoop, you’ll also find that there are abundant, easy-to-find code walkthroughs online.
If you’re stuck for ideas, or still confused about some concepts in Hadoop, you’ll find that Cloudera produces some excellent (and free) online web courses at Cloudera University.
Step Back To Move Forward
With a product as deep and wide as Hadoop, time spent making sure you understand the foundation will more than pay for itself when you get to higher level concepts and supporting packages. Although it may be frustrating and/or humbling to go back and re-read a Linux or Java “Dummies” book, you’ll be well rewarded once you inevitably encounter some bizarre behavior even in a Pig or Hive query, and you need to look under the hood to debug and resolve the issue.
Whether you choose formal training, on the job training, or just slogging through code examples you find on the Web, make sure you have a firm foundation in what Hadoop does and how it does it.
Reproduced from Global Knowledge White Paper: Learning How To Learn Hadoop