Course Outline
Introduction, Objectives, and Migration Strategy
- Course goals, alignment with participant profiles, and success criteria.
- Overview of high-level migration approaches and associated risk considerations.
- Setting up workspaces, repositories, and lab datasets.
Day 1 — Migration Fundamentals and Architecture
- Core Lakehouse concepts, an overview of Delta Lake, and Databricks architecture.
- Differences and migration implications between SMP and MPP architectures.
- Medallion (Bronze→Silver→Gold) design principles and Unity Catalog overview.
Day 1 Lab — Translating a Stored Procedure
- Hands-on migration of a sample stored procedure into a notebook.
- Mapping temporary tables and cursors to DataFrame transformations.
- Validation and comparison against original output.
Day 2 — Advanced Delta Lake & Incremental Loading
- ACID transactions, commit logs, versioning, and time travel capabilities.
- Auto Loader, MERGE INTO patterns, upserts, and schema evolution.
- OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning techniques.
Day 2 Lab — Incremental Ingestion & Optimization
- Implementing Auto Loader ingestion and MERGE workflows.
- Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results.
- Measuring improvements in read/write performance.
Day 3 — SQL in Databricks, Performance & Debugging
- Analytical SQL features: window functions, higher-order functions, and JSON/array handling.
- Reading the Spark UI, DAGs, shuffles, stages, tasks, and diagnosing bottlenecks.
- Query tuning patterns: broadcast joins, hints, caching, and spill reduction.
Day 3 Lab — SQL Refactoring & Performance Tuning
- Refactoring a complex SQL process into optimized Spark SQL.
- Using Spark UI traces to identify and resolve skew and shuffle issues.
- Benchmarking before/after results and documenting tuning steps.
Day 4 — Tactical PySpark: Replacing Procedural Logic
- Spark execution model: driver, executors, lazy evaluation, and partitioning strategies.
- Transforming loops and cursors into vectorized DataFrame operations.
- Modularization, UDFs/pandas UDFs, widgets, and creating reusable libraries.
Day 4 Lab — Refactoring Procedural Scripts
- Refactoring a procedural ETL script into modular PySpark notebooks.
- Introducing parametrization, unit-style tests, and reusable functions.
- Conducting code reviews and applying best-practice checklists.
Day 5 — Orchestration, End-to-End Pipeline & Best Practices
- Databricks Workflows: job design, task dependencies, triggers, and error handling.
- Designing incremental Medallion pipelines with quality rules and schema validation.
- Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic.
Day 5 Lab — Build a Complete End-to-End Pipeline
- Assembling a Bronze→Silver→Gold pipeline orchestrated with Workflows.
- Implementing logging, auditing, retries, and automated validations.
- Running the full pipeline, validating outputs, and preparing deployment notes.
Operationalization, Governance, and Production Readiness
- Best practices for Unity Catalog governance, lineage, and access controls.
- Cost management, cluster sizing, autoscaling, and job concurrency patterns.
- Deployment checklists, rollback strategies, and runbook creation.
Final Review, Knowledge Transfer, and Next Steps
- Participant presentations of migration work and lessons learned.
- Gap analysis, recommended follow-up activities, and training materials handoff.
- References, further learning paths, and support options.
Requirements
- A foundational understanding of data engineering concepts.
- Practical experience with SQL and stored procedures (Synapse / SQL Server).
- Familiarity with ETL orchestration concepts (such as ADF or similar tools).
Audience
- Technology managers possessing a data engineering background.
- Data engineers looking to transition procedural OLAP logic to Lakehouse patterns.
- Platform engineers responsible for driving Databricks adoption.