Live Chat
Monday - Friday 8am - 6pm EST Chat Now
Contact Us
Monday - Friday 8am - 8pm EST 1-800-268-7737 Other Contact Options

Cart () Loading...

    • Quantity:
    • Delivery:
    • Dates:
    • Location:


Hadoop Ecosystem

This learning path provides an explanation and demonstration of the most popular components in the Hadoop ecosystem.

GK# 7315

Course Overview


Apache Hadoop is an open source software for affordable supercomputing; it provides the distributed file system and the parallel processing required to run a massive computing cluster. This learning path provides an explanation and demonstration of the most popular components in the Hadoop ecosystem. It defines and describes theory and architecture, while also providing instruction on installation, configuration, usage, and low-level use cases for the Hadoop ecosystem. This learning path can be used to help prepare for the Cloudera Certified Developer for Hadoop, HDP Certified Developer, Cloudera Certified Administrator for Hadoop, or Hadoop 2.0 Administrator Certification exam.


  • Delivery Format:
  • Date:
  • Location:
  • Access Period:


What You'll Learn

  • Ecosystem for Hadoop
  • Installation of Hadoop
  • Data Repository with HDFS and HBase
  • Data Repository with Flume
  • Data Repository with Sqoop
  • Data Refinery with YARN and MapReduce
  • Data Factory with Hive
  • Data Factory with Pig
  • Data Factory with Oozie and Hue
  • Data Flow for the Hadoop Ecosystem


Viewing outline for:

On-Demand Outline

Ecosystem for Hadoop

A Map for Big Data

  • Mapping Big Data
  • Continuing to Map Big Data

Key Terminology for Big Data

  • Defining Big Data
  • Terms for Data

Ecosystem for Hadoop

  • Mapping the Big Data Stack
  • Introducing Data Repository Components
  • Introducing Data Refinery Components
  • Introducing Data Factory Components

Theory for Hadoop

  • Design Principle for Hadoop
  • The Principle of Sharing Nothing
  • The Principle of Embracing Failure

Data Repository for Hadoop

  • Hadoop Distributed File System (HDFS)
  • Introducing HDFS Deamons

Data Refinery for Hadoop

  • Introducing Hadoop YARN
  • YARN Daemons on the Master Server
  • YARN Daemons on the Data Server
  • Introducing MapReduce

Data Analytics

  • Introducing Data Analytics

Hadoop Ecosystem Complexities

  • Mastering the Hadoop Ecosystem

Installation of Hadoop

Configuration of User Environments

  • Selecting an Environment

Pre-installation for Hadoop

  • Creating a Development Environment
  • Installing Java
  • Setting up SSH

Setup of Hadoop

  • The History of Hadoop Versions
  • Setting Up Hadoop
  • Installing Hadoop
  • Configuring Hadoop Environment
  • Configuring HDFS
  • Starting and Stopping HDFS
  • Configuring YARN and MapReduce
  • Starting and Stopping YARN
  • Stress Testing Hadoop

Operations for Hadoop

  • First Use of HDFS
  • Introducing WordCount
  • First Use of WordCount

Monitoring for Hadoop

  • Hadoop Web UIs
  • NameNode and Resource Manager Web UIs

Troubleshooting of Hadoop Installation

  • Configuration Changes
  • Troubleshooting Hadoop
  • Troubleshooting Installation Errors

Data Repository with HDFS and HBase

Theory of HDFS

  • Describing HDFS Data Blocks
  • Describing HDFS URI
  • Describing the NameNode
  • NameNode in Detail
  • Describing the DataNode
  • Describing the Checkpoint Node
  • Describing the Backup Node

Operations for HDFS

  • HDFS Command Categories
  • HDFS Commands for Managing Data Files
  • HDFS Commands for Managing Web Logs

Troubleshooting of HDFS

  • HDFS Administration
  • HDFS Configuration
  • Troubleshooting HDFS

Theory for NoSQL and RDBMS

  • Comparing NoSQL to RDBMS

Overview of HBase and ZooKeeper

  • Introducing HBase and ZooKeeper
  • Installing ZooKeeper
  • Installing HBase

Operation for HBase

  • Using the HBase Command Line
  • Working with HBase Tables
  • Working with HBase Data

Data Repository with Flume

The Purpose of Flume

  • Defining Flume Data
  • Introducing cURL
  • Using cURL for Web Data

Setup of Flume

  • Defining Flume
  • Installing Flume

Operations for Flume

  • Creating Flume Agents
  • Describing a Flume Agent
  • First Flume Agent

Sources, Sinks, and Channels

  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • File Channel to HDFS

Serializing Data with Avro

  • Avro Serialization
  • Using Avro Source
  • Multiple Flume Agents

Multiplex Agents for Flume

  • Flume Timestamps
  • Timestamping with Flume
  • Multiple Sources with Flume
  • Multi-flow Flume Agents
  • Multi-source Flume Agents
  • Multiple Sinks with Flume
  • Multi-sink Flume Agents

Troubleshooting of Flume

  • Flume Troubleshooting
  • Logging to the Flume Log File

Data Repository with Sqoop

Setup of MySQL

  • Overviewing MySQL
  • Installing MySQL
  • Creating a Database
  • Loading Data into MySQL Tables

The Purpose of Sqoop

  • Overviewing Sqoop
  • Architecture of Sqoop

Setup of Sqoop

  • Overview of Sqoop Configuration
  • Installing Sqoop

Operations for Sqoop

  • Overview of Importing with Sqoop
  • Performing a Sqoop Import
  • Overview of Exporting with Sqoop
  • Performing a Sqoop Export
  • Sqoop and HBase
  • Exporting into HBase

Troubleshooting of Sqoop

  • Sqoop Troubleshooting
  • Correcting a Database Connect Failure

Data Refinery with YARN and MapReduce

Theory for YARN

  • Explaining Parallel Processing
  • YARN Key Concepts
  • YARN Resource Manager
  • YARN Node Manager
  • YARN ApplicationMaster
  • YARN Job Failure
  • YARN Configurations

Theory for Key-value Pairs

  • Key-Value Pairs
  • MapReduce and Key-Value Pairs

Operations for MapReduce

  • WordCount, the Hello World of Hadoop
  • MapReduce
  • MapReduce Step-by-Step

First Program for MapReduce

Exploring Hadoop Classpath

Writing a MapReduce Job

APIs for MapReduce

  • The Mapper Java API
  • The Reducer Java API
  • The Driver Java API

Second Program for MapReduce

  • Writing a MapReduce Job for Inventory

Streaming for MapReduce

  • Hadoop Streaming
  • Running a Streaming Job

Data Factory with Hive

The Purpose of Hive

  • Overviewing Hive

Setup of Hive

  • Overview of Hive Configuration
  • Installing Hive
  • Using Derby for Hive
  • Setting Up MySQL for Hive

Details of Hive

  • Hive Data Types
  • Hive Operators

Operations for Hive

  • Hive CREATE Statements
  • Hive SELECT Statements
  • Hive SQL Statements
  • Writing a First Hive Script

Joins and Views for Hive

  • Hive Joins and Views
  • Joining and Viewing in Hive

Partitions and Buckets for Hive

  • Overview of Partitioning Hive Data
  • Writing a Hive Partition Script
  • Overview of Bucketing Hive Data
  • Bucketing Hive Data

User-defined Functions for Hive

  • Hive User-defined Functions
  • Create a Hive UDF

Troubleshooting for Hive

  • Hive Troubleshooting
  • Using a Hive Explain Plain

Data Factory with Pig

The Purpose of Pig

  • Overviewing Pig

Setup for Pig

  • Overview of Pig Configuration
  • Installing Pig

Details of Pig

  • Pig Data Types
  • Pig Operators

Operations for Pig

  • Pig Command Line
  • Pig Scripts
  • First Pig Script
  • Pig Filtering
  • Pig Parameters and Arguments
  • Pig Functions

Working with Pig Operators

  • Pig JOIN
  • Pig GROUP

User-defined Functions for Pig

  • Pig User-defined Functions
  • Creating a Pig UDF

Troubleshooting for Pig

  • Pig Troubleshooting
  • Debugging with Pig

Data Factory with Oozie and Hue

The Purpose of Hive Daemons

  • The Purpose of Metastore and HiveServer2
  • Creating an Oozie Workflow
  • Installing HiveServer2
  • The Purpose of HCatalog
  • Installing WebHCat
  • Using HCatalog

The Purpose of Oozie

  • Overviewing Oozie

Setup for Oozie

  • Setting up Oozie
  • Installing Oozie
  • Configuring Oozie
  • Configuring Oozie with MySQL
  • Enabling the Oozie Examples

Operations for Oozie

  • Oozie Workflows
  • Submitting an Oozie Workflow
  • Creating an Oozie Workflow
  • Running an Oozie Workflow
  • The Purpose of Hue
  • Overviewing Hu

Setup for Hue

  • Setting up Hue
  • Installing Hue
  • Configuring Hue
  • Configuring Hue with MySQL

Operations for Hue

  • Hue in Action

Data Flow for the Hadoop Ecosystem

The World of Data

  • The World of Data

Flowing Data with Sqoop

  • Sqoop and Hive
  • Loading SQL Data Tables
  • Importing Data into Hive
  • Sqoop and Hive Exports
  • Exporting Data from Hive
  • Working with Date Data Types
  • Importing Datetime Stamps
  • Exporting Datetime Stamps

Flowing Data with Hive

  • Preprocessing Data
  • Cleaning with Functions
  • Cleaning with Regular Expressions

Administration for the Ecosystem

  • Selecting Additional Ecosystem Components
  • Best Practices for Pseudo-Mode
  • Admin Scripts in Action
  • Class Paths in Action
  • Config Files in Action



In the modern world, data is being generated at an exponential rate. Business data generation is increasing at a similarly rapid rate. Only a small percentage of business data is structured data in rows and columns of databases. This data proliferation requires a rethinking of traditional techniques for capture, storage, and processing. Big data is a term that describes data sets so big they can’t be managed with traditional database systems. Big Data is also a collection of tools and techniques aimed at solving these problems. This learning path covers the current thinking and state of the art for managing and manipulating large data sets using the techniques and tools of Big Data.

Who Should Attend


Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis.

Follow-On Courses


Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. In this learning path you will learn about cluster planning, installation and administration, resource management and monitoring and logging.

Course Delivery

This course is available in the following formats:


Train at your own pace with 24/7 access to courses that help you acquire must-have technology skills.

Request this course in a different delivery format.