Live Chat
Monday - Friday 8am - 6pm EST Chat Now
Contact Us
Monday - Friday 8am - 8pm EST 1-800-268-7737 Other Contact Options
Checkout

Cart () Loading...

    • Quantity:
    • Delivery:
    • Dates:
    • Location:

    $

JumpStart to Developing in Apache Spark | Spark Programs, RDDs, NoSQL, Spark SQL, Machine Learning & More

Gain core Spark skills to execute programs, work with databases, integrate machine learning, create streaming applications and more

GK# 9360

Course Overview

TOP

Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R - the favorite languages of Data Scientists - along with SQL-based front-ends.

With advanced libraries like Mahout and MLib for Machine Learning, GraphX or Neo4J for rich data graph processing as well as access to other NOSQL data stores, Rule engines and other Enterprise components, Spark is a lynchpin in modern Big Data and Data Science computing.

The JumpStart to Developing in Spark course introduces developers to enterprise-grade Spark programming, interacting with the significant components mentioned above to craft complete data science solutions.  This course is a fast-paced, breadth-based course intended to show topical overviews and “big-picture” interactions while giving students key hands-on experience with the technologies. This course is offered in Java with some alternatives offered in Python. Other variants of this course offer emphasis on Scala or R. 

This course can be further tailored to suit the agenda, labs, tools and topics that suit the needs of your team or organization.

Schedule

TOP
  • Delivery Format:
  • Date:
  • Location:
  • Access Period:

$

What You'll Learn

TOP

This course provides indoctrination in the practical use of the umbrella of technologies that are on the leading edge of data science development focused on Spark and related tools. 

Working in a hands-on learning environment, students will learn:

  • The essentials of Spark architecture and applications
  • How to execute Spark Programs
  • How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • How to persist and restore data frames
  • Essential NOSQL access
  • How to integrate machine learning into Spark applications
  • How to use Spark Streaming and Kafka to create streaming applications

Outline

TOP
Viewing outline for:

Virtual Classroom Live Outline

Session: Getting Started with Spark

 

Lesson: Overview of Spark

  • Hadoop Ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. Map/Reduce
  • Spark: Lambda Architecture
  • Spark in the Enterprise Data Science Architecture

 

Lesson: Spark Component Overview

  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data Frames
  • Spark 2 Unified DataFrames
  • Spark Sessions
  • Functional Programming
  • Spark SQL
  • MLib
  • Structured Streaming
  • Spark R
  • Spark and Python
  • Exercise: Hello, Spark

 

Lesson: RDDs: Resilient Distributed Datasets

  • Coding with RDDs
  • Transformations
  • Actions
  • Lazy Evaluation and Optimization
  • RDDs in Map/Reduce
  • Exercise: Working with RDDs

 

Lesson: DataFrames

  • RDDs vs. DataFrames
  • Unified Dataframes (UDF) in Spark 2.x
  • Partitioning
  • Exercise: Working with Unified DataFrames

 

Lesson: DataFrame Persistence

  • RDD Persistence
  • DataFrame and Unified DataFrame Persistence
  • Distributed Persistence
  • Exercise: Saving and Restoring DataFrames

 

Lesson: Accessing NOSQL Data

  • Ingesting data
  • Relational Databases and Sqoop
  • Interacting with Hive
  • Graph Data
  • Accessing Cassandra Data
  • Exercise: NoSQL Data Access

 

Lesson: Spark SQL

  • Spark SQL
  • SQL and DataFrames
  • Spark SQL and Hive
  • Spark SQL and JDBC
  • Exercise: Working with SparkSQL

 

Lesson: Machine Learning

  • ML Lib
  • Mahout
  • Exercise: Hello, MLlib

 

Lesson: Spark Streaming

  • Streaming Overview
  • Streams
  • Structured Streaming
  • Lambda Streaming
  • Spark and Kafka
  • Exercise: Hello, Spark Streaming

Labs

TOP
Viewing labs for:

Virtual Classroom Live Labs

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core R programming and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.

Prerequisites

TOP

Students should have basic skills in these areas:

  • Java Programming Fundamentals
  • Introduction to Python Programming
  • Introduction to SQL (Basic familiarity is needed, not in-depth SQL skills)

Who Should Attend

TOP

This is an intermediate level course is geared for experienced Developers and Architects (with development experience) who seek to be proficient in advanced, modern development skills working with Apache Spark in an enterprise data environment.

Course Delivery

This course is available in the following formats:

Virtual Classroom Live

Experience expert-led online training from the convenience of your home, office or anywhere with an internet connection.

Duration: 3 day

Request this course in a different delivery format.
Enroll