Data engineering with Azure Databricks

Go to class
Write Review

Free Online Course: Data engineering with Azure Databricks provided by Microsoft Learn is a comprehensive online course, which lasts for 10-11 hours worth of material. The course is taught in English and is free of charge.

Overview
    • Module 1: Describe Azure Databricks
    • In this module, you will:

      • Understand the Azure Databricks platform
      • Create your own Azure Databricks workspace
      • Create a notebook inside your home folder in Databricks
      • Understand the fundamentals of Apache Spark notebook
      • Create, or attach to, a Spark cluster
      • Identify the types of tasks well-suited to the unified analytics engine Apache Spark
    • Module 2: Spark architecture fundamentals
    • In this module, you will:

      • Understand the architecture of an Azure Databricks Spark Cluster
      • Understand the architecture of a Spark Job
    • Module 3: Read and write data in Azure Databricks
    • In this module, you will:

      • Use Azure Databricks to read multiple file types, both with and without a Schema.
      • Combine inputs from files and data stores, such as Azure SQL Database.
      • Transform and store that data for advanced analytics.
    • Module 4: Work with DataFrames in Azure Databricks
    • In this module, you will:

      • Use the count() method to count rows in a DataFrame
      • Use the display() function to display a DataFrame in the Notebook
      • Cache a DataFrame for quicker operations if the data is needed a second time
      • Use the limit function to display a small set of rows from a larger DataFrame
      • Use select() to select a subset of columns from a DataFrame
      • Use distinct() and dropDuplicates to remove duplicate data
      • Use drop() to remove columns from a DataFrame
    • Module 5: Describe lazy evaluation and other performance features in Azure Databricks
    • In this module, you will:

      • Describe the difference between eager and lazy execution
      • Define and identify transformations
      • Define and identify actions
      • Describe the fundamentals of how the Catalyst Optimizer works
      • Differentiate between wide and narrow transformations
    • Module 6: Work with DataFrames columns in Azure Databricks
    • In this module, you will:

      • Learn the syntax for specifying column values for filtering and aggregations
      • Understand the use of the Column Class
      • Sort and filter a DataFrame based on Column Values
      • Use collect() and take() to return records from a Dataframe to the driver of the cluster
    • Module 7: Work with DataFrames advanced methods in Azure Databricks
    • In this module, you will:

      • Manipulate date and time values in Azure Databricks
      • Rename columns in Azure Databricks
      • Aggregate data in Azure Databricks DataFrames
    • Module 8: Describe platform architecture, security, and data protection in Azure Databricks
    • In this module, you will:

      • Learn the Azure Databricks platform architecture and how it is secured.
      • Use Azure Key Vault to store secrets used by Azure Databricks and other services.
      • Access Azure Storage with Key Vault-based secrets.
    • Module 9: Build and query a Delta Lake
    • In this module, you will:

      • Learn about the key features and use cases of Delta Lake.
      • Use Delta Lake to create, append, and upsert tables.
      • Perform optimizations in Delta Lake.
      • Compare different versions of a Delta table using Time Machine.
    • Module 10: Process streaming data with Azure Databricks structured streaming
    • In this module, you will:

      • Learn the key features and uses of Structured Streaming.
      • Stream data from a file and write it out to a distributed file system.
      • Use sliding windows to aggregate over chunks of data rather than all data.
      • Apply watermarking to throw away stale old data that you do not have space to keep.
      • Connect to Event Hubs read and write streams.
    • Module 11: Describe Azure Databricks Delta Lake architecture
    • In this module, you will:

      • Process batch and streaming data with Delta Lake.
      • Learn how Delta Lake architecture enables unified streaming and batch analytics with transactional guarantees within a data lake.
    • Module 12: Create production workloads on Azure Databricks with Azure Data Factory
    • In this module, you'll:

      • Create an Azure Data Factory pipeline with a Databricks activity.
      • Execute a Databricks notebook with a parameter.
      • Retrieve and log a parameter passed back from the notebook.
      • Monitor your Data Factory pipeline.
    • Module 13: Implement CI/CD with Azure DevOps
    • In this module, you will:

      • Learn about CI/CD and how it applies to data engineering.
      • Use Azure DevOps as a source code repository for Azure Databricks notebooks.
      • Create build and release pipelines in Azure DevOps to automatically deploy a notebook from a development to a production Azure Databricks workspace.
    • Module 14: Integrate Azure Databricks with Azure Synapse
    • In this module, you will:

      • Access Azure Synapse Analytics from Azure Databricks by using the - SQL Data Warehouse connector.
    • Module 15: Describe Azure Databricks best practices
    • In this module, you will learn best practices in the following categories:

      • Workspace administration
      • Security
      • Tools & integration
      • Databricks runtime
      • HA/DR
      • Clusters

Syllabus
    • Module 1: Describe Azure Databricks
      • Introduction
      • Explain Azure Databricks
      • Create an Azure Databricks workspace and cluster
      • Understand Azure Databricks Notebooks
      • Exercise: Work with Notebooks
      • Knowledge check
      • Summary
    • Module 2: Spark architecture fundamentals
      • Introduction
      • Understand the architecture of Azure Databricks spark cluster
      • Understand the architecture of spark job
      • Knowledge check
      • Summary
    • Module 3: Read and write data in Azure Databricks
      • Introduction
      • Read data in CSV format
      • Read data in JSON format
      • Read data in Parquet format
      • Read data stored in tables and views
      • Write data
      • Exercises: Read and write data
      • Knowledge check
      • Summary
    • Module 4: Work with DataFrames in Azure Databricks
      • Introduction
      • Describe a DataFrame
      • Use common DataFrame methods
      • Use the display function
      • Exercise: Distinct articles
      • Knowledge check
      • Summary
    • Module 5: Describe lazy evaluation and other performance features in Azure Databricks
      • Introduction
      • Describe the difference between eager and lazy execution
      • Describe the fundamentals of how the Catalyst Optimizer works
      • Define and identify actions and transformations
      • Describe performance enhancements enabled by shuffle operations and Tungsten
      • Knowledge check
      • Summary
    • Module 6: Work with DataFrames columns in Azure Databricks
      • Introduction
      • Describe the column class
      • Work with column expressions
      • Exercise: Washingtons and Marthas
      • Knowledge check
      • Summary
    • Module 7: Work with DataFrames advanced methods in Azure Databricks
      • Introduction
      • Perform date and time manipulation
      • Use aggregate functions
      • Exercise: Deduplication of data
      • Knowledge check
      • Summary
    • Module 8: Describe platform architecture, security, and data protection in Azure Databricks
      • Introduction
      • Describe the Azure Databricks platform architecture
      • Perform data protection
      • Describe Azure key vault and Databricks security scopes
      • Secure access with Azure IAM and authentication
      • Describe security
      • Exercise: Access Azure Storage with key vault-backed secrets
      • Knowledge check
      • Summary
    • Module 9: Build and query a Delta Lake
      • Introduction
      • Describe the open source Delta Lake
      • Exercise: Work with basic Delta Lake functionality
      • Describe how Azure Databricks manages Delta Lake
      • Exercise: Use the Delta Lake Time Machine and perform optimization
      • Knowledge check
      • Summary
    • Module 10: Process streaming data with Azure Databricks structured streaming
      • Introduction
      • Describe Azure Databricks structured streaming
      • Perform stream processing using structured streaming
      • Work with Time Windows
      • Process data from Event Hubs with structured streaming
      • Knowledge check
      • Summary
    • Module 11: Describe Azure Databricks Delta Lake architecture
      • Introduction
      • Describe bronze, silver, and gold architecture
      • Perform batch and stream processing
      • Knowledge check
      • Summary
    • Module 12: Create production workloads on Azure Databricks with Azure Data Factory
      • Introduction
      • Schedule Databricks jobs in a data factory pipeline
      • Pass parameters into and out of Databricks jobs in data factory
      • Knowledge check
      • Summary
    • Module 13: Implement CI/CD with Azure DevOps
      • Introduction
      • Describe CI/CD
      • Create a CI/CD process with Azure DevOps
      • Knowledge check
      • Summary
    • Module 14: Integrate Azure Databricks with Azure Synapse
      • Introduction
      • Integrate with Azure Synapse Analytics
      • Knowledge check
      • Summary
    • Module 15: Describe Azure Databricks best practices
      • Introduction
      • Understand workspace administration best practices
      • List security best practices
      • Describe tools and integration best practices
      • Explain Databricks runtime best practices
      • Understand cluster best practices
      • Knowledge check
      • Summary