Friday, May 3, 2024

Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations


In part 2 of this course, Barry Smart, Director of Data and AI, walks through a demo showing how you can use Microsoft Fabric to set up a "data contract" that establishes minimum data quality standards for data that is being processed by a data pipeline. He deliberately passes bad data into the pipeline to show how the process can be set up to "fail elegantly" by dropping the bad rows and continuing with only the good rows. Finally, he uses the new Teams pipeline activity in Fabric to show how you can send a message to the data stewards who are responsible for the data set, informing them that validation has failed, itemising the specific rows that failed and the validation errors that were generated in the body of the message. The demo uses the popular Titanic data set to show features in data engineering experience in Fabric, including Notebooks, Pipelines and the Lakehouse. It uses the popular Great Expectations Python package to establish the data contract and Microsoft's mssparkutils Python package to enable the exit value of the Notebook to be passed back to the Pipeline that has triggered it. Barry begins the video by explaining the architecture that is being adopted in the demo including Medallion Architecture and DataOps practices. He explains how these patterns have been applied to create a data product that provides Diagnostic Analytics of the Titanic data set. This forms part of an end to end demo of Microsoft Fabric that we will be providing as a series of videos over the coming weeks. 00:12 Overview of the architecture 00:36 The focus for this video is processing data to Silver 00:55 The DataOps principles of data validation and alerting will be applied 02:19 Tour of the artefacts in the Microsoft Fabric workspace 02:56 Open the "Validation Location" notebook and viewing the contents 03:30 Inspect the reference data that is going to be validated by the notebook 05:14 Overview of the key stages in the notebook 05:39 Set up the notebook, using %run to establish utility functions 06:21 Set up a "data contract" using great expectations package 07:45 Load the data from the Bronze area of the lake 08:18 Validate the data by applying the "data contract" to it 08:36 Remove any bad records to create a clean data set 09:04 Write the clean data to the lakehouse in Delta format 09:52 Exit the notebook using mssparkutils to pass back validation results 10:53 Lineage is used to discover the pipeline that triggers it 11:01 Exploring the "Process to Silver" pipeline 11:35 An "If Condition" is configured to process the notebook exit value 11:56 A Teams pipeline activity is set up to notify users 12:51 Title and body of Teams message are populated with dynamic information 13:08 Word of caution about exposing sensitive information 13:28 What's in the next episode? #microsoftfabric #dataengineering #greatexpectations #course #tutorial

No comments:

Post a Comment