Skip to main content

Practical Apache Spark for Data Pipelines

Executive Summary

This three-day course equips participants with practical skills to develop, manage, and optimize Apache Spark pipelines on GCP Dataproc serverless through targeted lectures, hands-on labs, and a capstone project. By the end, attendees will understand Spark batch and streaming use-cases, master its execution model and core data structures, and build reusable, performance-tuned pipelines for diverse data workloads.

Programming Professionals Collaborating

Description

The course equips participants with practical skills to develop, manage, and optimize Apache Spark pipelines on GCP Dataproc serverless. Through targeted lectures, hands-on labs, and a capstone project, attendees will master Spark’s architecture, data structures, pipeline development, and tuning to maintain and expand DPP data pipelines, create reusable code, and address batch and streaming contexts.

Objectives

  • Understand use-cases and benefits of Spark Batch and Structured Streaming.
  • Gain working knowledge of Spark's execution model to support pipelines.
  • Develop reusable code for batch and streaming contexts.
  • Build and optimize Spark pipelines on GCP Dataproc serverless.
  • Master core data structures, operations, and performance tuning.

Duration

21 hours of intensive training with live instruction delivered over three to five days to accommodate varied scheduling needs.

Request Information

Course Outline

Spark Overview
  • Introduction to Apache Spark and its ecosystem
  • Spark Fundamentals Overview
  • Pipeline Development Overview
  • Advanced Spark and Optimization Overview
Spark Architecture and Use-Cases
  • Spark topology: master, driver, worker nodes, executors
  • Use-cases for Batch and Structured Streaming
  • Spark's role in data engineering
Core Data Structures
  • DataFrames and Spark SQL basics
  • Overview of Datasets and RDDs
  • Core operations: filtering, aggregations, joins
Hands-On: DataFrame Processing
  • Load a CSV dataset into a DataFrame
  • Apply transformations
  • Query with Spark SQL
Spark Execution Model
  • Partitioning
  • Lazy Execution
  • Fault Tolerance
  • Checkpointing
  • Serialization
Batch and Streaming Pipelines
  • Designing Batch Pipelines
  • Structured Streaming Fundamentals
  • Building Reusable Code Components
Hands-On: Batch & Steaming Pipelines
  • Create a batch pipeline for a log dataset, including a reusable data cleaning function
  • Build a streaming pipeline for a simulated real-time dataset (e.g., sensor data)
Advanced Features
  • Broadcast Variables
  • Accumulators
  • Serialization Challenges
Performance Tuning
  • Resource management: memory, CPU, partitioning
  • Optimization: caching, shuffle reduction
Pipeline Optimization Capstone
  • Optimize a batch or streaming pipeline
  • Utilize reusable code components
Case Study and Wrap-up
  • Discuss real-world Spark applications
  • Review takeaways

Prerequisites

  • Familiarity with Python (PySpark)
  • Basic data processing knowledge
  • Access to a GCP account with Dataproc serverless configured (provided if needed)

Training Materials

Students receive comprehensive courseware, including slides, code samples, and lab guides with pre-configured datasets.