Live Chat
Monday - Friday 8am - 6pm EST Chat Now
Contact Us
Monday - Friday 8am - 8pm EST 1-800-268-7737 Other Contact Options

Cart () Loading...

    • Quantity:
    • Delivery:
    • Dates:
    • Location:


Apache Spark Primer | Spark Essentials, Components, RDDs & UDFs (TTDS6500)

Data Science Primer Series: Working with Spark Components for Developers

GK# 8749

Course Overview


Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R - the favorite languages of Data Scientists - along with SQL-based front-ends.

The Spark Primer course introduces Scala, Python, or R developers to the world of Spark programming. Intended to offer an overview of topics for them to explore later, Spark Overview begins with an overview of the ecosystem and hands-on experience with platform essentials such as working with the Spark Shell, using RDDs and also DataFrames. Afterwards, a wider-scoped introduction to NOSQL, Spark Streaming, Spark SQL, and Spark MLLib demonstrates how the pieces are put together in a larger application.


  • Delivery Format:
  • Date:
  • Location:
  • Access Period:


What You'll Learn


This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools.  Working in a hands-on learning environment, led by our expert practitioner, students will learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How Spark core components come together for complete applications


Viewing outline for:

Virtual Classroom Live Outline

Session: Getting Started with Spark


Lesson: Overview of Spark

  • Hadoop Ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. Map/Reduce
  • Spark: Lambda Architecture
  • Spark in the Enterprise Data Science Architecture


Lesson: Spark Component Overview

  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data Frames
  • Spark 2 Unified DataFrames
  • Spark Sessions
  • Functional Programming
  • Spark SQL
  • MLib
  • Structured Streaming
  • Spark R
  • Spark and Python
  • Exercise: Hello, Spark


Lesson: RDDs: Resilient Distributed Datasets

  • Coding with RDDs
  • Transformations
  • Actions
  • Lazy Evaluation and Optimization
  • RDDs in Map/Reduce
  • Exercise: Working with RDDs


Lesson: DataFrames

  • RDDs vs. DataFrames
  • Unified Dataframes (UDF) in Spark 2.x
  • Partitioning
  • Exercise: Working with Unified DataFrames


Lesson: Advanced Spark Overview

  • Spark SQL
  • Spark Streaming
  • Spark ML Lib
  • Demo/Lab [Optional]: Advanced Spark Overview



Students should have basic skills in these areas (only 1 language is necessary):

  • Python Programming Basics
  • R Essentials, or Equivalent
  • Basic familiarity with SQL is needed, not in-depth SQL skills

Who Should Attend


This is an intermediate level course, geared for Data Scientists, Software Engineers, Data Engineers or developers who have basic experience working with Python, R or Scala, who need to learn the essentials of Spark interaction. This course supports students using Scala, Python or R. Incoming students should have prior experience working with the basics of at least one of these languages and should know the basics of SQL.

Course Delivery

This course is available in the following formats:

Virtual Classroom Live

Experience expert-led online training from the convenience of your home, office or anywhere with an internet connection.

Duration: 1 day

Request this course in a different delivery format.