Live Chat
Monday - Friday 8am - 6pm EST Chat Now
Contact Us
Monday - Friday 8am - 8pm EST 1-800-268-7737 Other Contact Options

Cart () Loading...

    • Quantity:
    • Delivery:
    • Dates:
    • Location:


Apache Spark for Data Scientists

Gain core spark skills to execute programs, work with databases, integrate machine learning, create streaming applications and more.

GK# 9400


Course Overview


Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.  With Spark, you can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

Apache Spark for Data Scientists is a three-day, hands-on course geared for technical business professional who wish to solve real-world data related problems using Apache Spark. This course explores using Apache Spark for common data related activities. 

Although the topics outline sin the agenda align with similar Spark course descriptions, the lab treatment and focus in this course is geared towards the data science aspects of Spark and related tools.  Students who want a more developer-oriented edition of this course should consider the TT6503 Working with Apache Spark three-day course, which aligns in subject coverage, but is geared for developers instead of data scientists.


  • Delivery Format:
  • Date:
  • Location:
  • Access Period:



Viewing outline for:

Virtual Classroom Live Outline

Spark Overview

  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • Spark and Storm
  • MLib and Mahout
  • Distributed vs. Local Run Modes
  • Hello, Spark


Spark Overview

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • MLib
  • Mahout
  • Spark Streaming
  • Spark API



  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • DataFrame Types
  • DataFrame Operations
  • Map/Reduce with DataFrames


Spark SQL

  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • ETL in Spark
  • Queries
  • Graph computation


Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory Management


Cluster Mode

  • Standalone Cluster
  • Masters and Workers
  • Configurations
  • Working with large data sets


Viewing labs for:

Virtual Classroom Live Labs

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core Spark and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.

Who Should Attend


This course is an Introductory level and beyond course. Typical attendees would include systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data. 

Attending students should have the following background:

  • Introduction to Java Programming (at least exposure to basic Java syntax)
  • Introduction to SQL (familiarity wits SQL basics)
  • Basic knowledge of Statistics and Probability
  • Data Science background
Course Delivery

This course is available in the following formats:

Virtual Classroom Live

Experience expert-led online training from the convenience of your home, office or anywhere with an internet connection.

Duration: 3 day

Request this course in a different delivery format.