Live Chat
Monday - Friday 8am - 6pm EST Chat Now
Contact Us
Monday - Friday 8am - 8pm EST 1-800-268-7737 Other Contact Options

Cart () Loading...

    • Quantity:
    • Delivery:
    • Dates:
    • Location:


Spark / R Programming for Data Scientists (Workshop)

Spark skills workshop for experienced data scientists already working in R

GK# 9313

Course Overview


Spark is a highly optimized Data Science environment running on Hadoop YARN, with support for Machine Learning through MLib and Mahout, SQL, DataFrames, and Streaming. In this course, Data Scientists dive into the details of practical data science on the Spark platform, including real-world interaction with other systems in modern Data Science environments.

This course is intended for existing Data Scientists already fluent in data science techniques in other languages such as SAS and already comfortable with R. This course will be presented in a "rolling lab" approach - a continuous workshop of real-world data exploration involving real-world problems. As such, problems and opportunities will be explored as data suggests and as questions arise. "Lecture" material will be provided only as is necessary to explain the background of the approach being used at the moment. Times and ordering of the material are highly flexible and should be used only as estimates. Student questions and requests will also significantly alter the direction of the workshop.


  • Delivery Format:
  • Date:
  • Location:
  • Access Period:


What You'll Learn


The objective of the course is to practically transition these data scientists to the R/Spark/Hadoop environment, becoming comfortable with the tools and machine learning libraries and conduct statistical and machine learning analyses they've already been performing in SAS or similar environments.


Viewing outline for:

Virtual Classroom Live Outline

Course Overview (0.5 hr lecture)

  • Our Data and our problem set
  • Accessing the cluster, the data, and the tools
  • The Continuous Workshop approach
  • "Let's build a model together"
  • Focus on analysis, exploration, data munging, algorithms
  • Tooling and fundamentals as necessary to get the job done

Spark Overview (1 hr lecture, 2 hr lab)

  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • MLib Overview
  • HDFS data - Accessing
  • Lab Focus
  • Working with HDFS data
  • Distributed vs. Local Run Modes
  • Spark vs. Other tools (when is Spark the right tool for the job?)
  • Spark vs. SAS
  • Spark Languages (Java, R, Python, and Scala)
  • Hello, Spark

Spark Overview (0.75 hr lecture, 1 hr lab)

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • Lab
  • MLib
  • Spark Streaming
  • Spark API

DataFrames (0.75 hr lecture, 1 hr lab)

  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • Adding variables to a DataFrame
  • DataFrame Types
  • DataFrame Operations
  • Dependent vs. Independent variables
  • Map/Reduce with DataFrames
  • Spark SQL (0.5 hr lecture, 1-2 hr lab)
  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • Queries

Spark MLib (0.5 hr lecture, 3 hr+ lab)

  • MLib overview
  • MLib Algorithms Overview
  • Classification Algorithms
  • Regression Algorithms
  • Lab Focus
  • Brief Comparison to SAS
  • Here's your split, how to tune regression
  • Decision Trees and forests
  • Lab Focus
  • Brief Comparison to SAS
  • Stepwise approach to Decision Trees
  • Working with Exit Criteria
  • Recommendation with ALS
  • Clustering Algorithms
  • Lab Focus
  • Key Clustering Algorithms
  • Choosing Clustering Algorithms
  • Working with key algorithms
  • Machine Learning Pipelines
  • Linear Algebra (SVD, PCA)
  • Statistics in MLib

Spark Streaming (0.25 hr lecture, 0 - 1 hr lab)

  • Streaming overview

Streaming with Kafka (0.25-5 hr lecture, 0 - 1 hr lab)

  • Kafka overview
  • Kafka and Spark Streaming

Data Flow with NiFi (0.25 hr lecture, 0 - 1 hr lab)

  • Apache NiFi overview
  • NiFi data flows with Spark/R

Cluster Mode (0.25hr lecture, 0 - 0.5 hr lab)

  • Standalone Cluster
  • Masters and Workers
  • Spark - the Big Picture (0.5-1 hr lecture, 0 - 2 hr lab)
  • Spark in Real-Time and near-Real-Time Decision Support Systems
  • Spark in the Enterprise
  • Best Practices



Viewing labs for:

Virtual Classroom Live Labs

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core R programming and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.



Incoming students should have skills equivalent to the topics in, or should have recently attended, this course as a prerequisite: R Essentials for Data Scientists (TTDS6680)

Who Should Attend


This course is intended for existing Data Scientists already fluent in data science techniques in other languages such as SAS and already comfortable with R. 

Course Delivery

This course is available in the following formats:

Virtual Classroom Live

Experience expert-led online training from the convenience of your home, office or anywhere with an internet connection.

Duration: 2 day

Request this course in a different delivery format.