Live Chat
Monday - Friday 8am - 6pm EST Chat Now
Contact Us
Monday - Friday 8am - 8pm EST 1-800-268-7737 Other Contact Options

Cart () Loading...

    • Quantity:
    • Delivery:
    • Dates:
    • Location:


JumpStart to Developing in Apache Spark | Spark Programs, RDDs, NoSQL, Spark SQL, Machine Learning & More

Gain core Spark skills to execute programs, work with databases, integrate machine learning, create streaming applications and more

GK# 9360

Course Overview


Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R - the favorite languages of Data Scientists - along with SQL-based front-ends.

With advanced libraries like Mahout and MLib for Machine Learning, GraphX or Neo4J for rich data graph processing as well as access to other NOSQL data stores, Rule engines and other Enterprise components, Spark is a lynchpin in modern Big Data and Data Science computing.

The JumpStart to Developing in Spark course introduces developers to enterprise-grade Spark programming, interacting with the significant components mentioned above to craft complete data science solutions.  This course is a fast-paced, breadth-based course intended to show topical overviews and “big-picture” interactions while giving students key hands-on experience with the technologies. This course is offered in Java with some alternatives offered in Python. Other variants of this course offer emphasis on Scala or R. 

This course can be further tailored to suit the agenda, labs, tools and topics that suit the needs of your team or organization.


  • Delivery Format:
  • Date:
  • Location:
  • Access Period:


What You'll Learn


This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools. 

Working in a hands-on learning environment, students will learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How to persist and restore data frames
  • Essential NOSQL access
  • How to integrate machine learning into Spark applications
  • How to use Spark Streaming and Kafka to create streaming applications


Viewing outline for:

Virtual Classroom Live Outline

Session: Getting Started with Spark


Lesson: Overview of Spark

  • Hadoop Ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. Map/Reduce
  • Spark: Lambda Architecture
  • Spark in the Enterprise Data Science Architecture


Lesson: Spark Component Overview

  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data Frames
  • Spark 2 Unified DataFrames
  • Spark Sessions
  • Functional Programming
  • Spark SQL
  • MLib
  • Structured Streaming
  • Spark R
  • Spark and Python
  • Exercise: Hello, Spark


Lesson: RDDs: Resilient Distributed Datasets

  • Coding with RDDs
  • Transformations
  • Actions
  • Lazy Evaluation and Optimization
  • RDDs in Map/Reduce
  • Exercise: Working with RDDs


Lesson: DataFrames

  • RDDs vs. DataFrames
  • Unified Dataframes (UDF) in Spark 2.x
  • Partitioning
  • Exercise: Working with Unified DataFrames


Lesson: DataFrame Persistence

  • RDD Persistence
  • DataFrame and Unified DataFrame Persistence
  • Distributed Persistence
  • Exercise: Saving and Restoring DataFrames


Lesson: Accessing NOSQL Data

  • Ingesting data
  • Relational Databases and Sqoop
  • Interacting with Hive
  • Graph Data
  • Accessing Cassandra Data
  • Exercise: NoSQL Data Access


Lesson: Spark SQL

  • Spark SQL
  • SQL and DataFrames
  • Spark SQL and Hive
  • Spark SQL and JDBC
  • Exercise: Working with SparkSQL


Lesson: Machine Learning

  • ML Lib
  • Mahout
  • Exercise: Hello, MLlib


Lesson: Spark Streaming

  • Streaming Overview
  • Streams
  • Structured Streaming
  • Lambda Streaming
  • Spark and Kafka
  • Exercise: Hello, Spark Streaming


Viewing labs for:

Virtual Classroom Live Labs

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core R programming and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.



Students should have basic skills in these areas:

  • Java Programming Fundamentals
  • Introduction to Python Programming
  • Introduction to SQL (Basic familiarity is needed, not in-depth SQL skills)

Who Should Attend


This is an intermediate level course is geared for experienced Developers and Architects (with development experience) who seek to be proficient in advanced, modern development skills working with Apache Spark in an enterprise data environment.

Course Delivery

This course is available in the following formats:

Virtual Classroom Live

Experience expert-led online training from the convenience of your home, office or anywhere with an internet connection.

Duration: 3 day

Request this course in a different delivery format.