Practical Apache Spark for Data Pipelines

Description

The course equips participants with practical skills to develop, manage, and optimize Apache Spark pipelines on GCP Dataproc serverless. Through targeted lectures, hands-on labs, and a capstone project, attendees will master Spark’s architecture, data structures, pipeline development, and tuning to maintain and expand DPP data pipelines, create reusable code, and address batch and streaming contexts.

Objectives

Understand use-cases and benefits of Spark Batch and Structured Streaming.
Gain working knowledge of Spark's execution model to support pipelines.
Develop reusable code for batch and streaming contexts.
Build and optimize Spark pipelines on GCP Dataproc serverless.
Master core data structures, operations, and performance tuning.

Spark Overview

Introduction to Apache Spark and its ecosystem
Spark Fundamentals Overview
Pipeline Development Overview
Advanced Spark and Optimization Overview

Spark Architecture and Use-Cases

Spark topology: master, driver, worker nodes, executors
Use-cases for Batch and Structured Streaming
Spark's role in data engineering

Core Data Structures

DataFrames and Spark SQL basics
Overview of Datasets and RDDs
Core operations: filtering, aggregations, joins

Hands-On: DataFrame Processing

Load a CSV dataset into a DataFrame
Apply transformations
Query with Spark SQL

Spark Execution Model

Partitioning
Lazy Execution
Fault Tolerance
Checkpointing
Serialization

Batch and Streaming Pipelines

Designing Batch Pipelines
Structured Streaming Fundamentals
Building Reusable Code Components

Hands-On: Batch & Steaming Pipelines

Create a batch pipeline for a log dataset, including a reusable data cleaning function
Build a streaming pipeline for a simulated real-time dataset (e.g., sensor data)

Advanced Features

Broadcast Variables
Accumulators
Serialization Challenges

Performance Tuning

Resource management: memory, CPU, partitioning
Optimization: caching, shuffle reduction

Pipeline Optimization Capstone

Optimize a batch or streaming pipeline
Utilize reusable code components

Case Study and Wrap-up

Discuss real-world Spark applications
Review takeaways