Skip to content

Latest commit

 

History

History
419 lines (313 loc) · 32.2 KB

File metadata and controls

419 lines (313 loc) · 32.2 KB

DataOps - Parking Sensor Demo

The sample demonstrate how DevOps principles can be applied end to end Data Pipeline Solution built according to the Modern Data Warehouse (MDW) pattern.

Contents


Solution Overview

The solution pulls near realtime Melbourne Parking Sensor data from a publicly available REST api endpoint and saves this to Azure Data Lake Gen2. It then validates, cleanses, and transforms the data to a known schema using Azure Databricks. A second Azure Databricks job then transforms these into a Star Schema which are then loaded into Azure Synapse Analytics (formerly SQLDW) using Polybase. The entire pipeline is orchestrated with Azure Data Factory.

Architecture

The following shows the overall architecture of the solution.

Architecture

Sample PowerBI report

PowerBI report

Continuous Integration and Continuous Delivery (CI/CD)

The following shows the overall CI/CD process end to end.

CI/CD

See here for details.

Technologies used

It makes use of the following azure services:

For a detailed walk-through of the solution and key concepts, watch the following video recording:

IMAGE ALT TEXT HERE

Key Learnings

The following summarizes key learnings and best practices demonstrated by this sample solution:

1. Use Data Tiering in your Data Lake

  • Generally, you want to divide your data lake into three major areas which contain your Bronze, Silver and Gold datasets.
    1. Bronze - This is a landing area for your raw datasets with no to minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store.
    2. Silver - These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by Data Scientists.
    3. Gold - These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard Fact and Dimension tables.

2. Validate data early in your pipeline

  • Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all succeeding datasets conform to a specific schema and known data invariants. This also can potentially prevent data pipeline failures in cases of unexpected changes to the input data.
  • Data that does not pass this validation stage can be rerouted to a Malformed Record store for diagnostic purpose.
  • It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can used to replay the data pipeline for both testing (ei. testing data validation logic) and data recovery purposes (ei. data corruption is introduced due to a bug in the data transformation code and thus pipeline needs to be replayed).

3. Make your data pipelines replayable and idempotent

  • Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines replayable and idempotent, you can recover from this state through deployment of code fix and replaying the data pipelines.
  • Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.

4. Ensure data transformation code is testable

  • Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages.
  • While it is possible to run tests against notebooks, by shifting tests left you increase developer productivity by increasing the speed of the feedback cycle.

5. Have a CI/CD pipeline

  • This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
  • There should also be a safe, repeatable process to move changes through dev, test and finally production.

6. Secure and centralize configuration

  • Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
  • Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.

7. Monitor infrastructure, pipelines and data

  • A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data quality should also be monitored. A common area that should have data monitoring is the malformed record store.
  • As an example this repository showcases how to use open source framework Great Expectations to define, measure and report data quality metrics at different stages of the data pipeline. Captured Data Quality metrics are reported to Azure Monitor for further visualizing and alerting. Take a look at sample Data Quality report generated with Azure Monitor workbook. Great Expectations can be configured to generate HTML reports and host directly as static site on Azure Blob Storage. Read more on How to host and share Data Docs on Azure Blob Storage.

Key Concepts

Build and Release Pipeline

The Build and Release Pipelines definitions can be found here.

Environments

  1. Sandbox and Dev- the DEV resource group is used by developers to build and test their solutions. It contains two logical environments - (1) a Sandbox environment per developer so each developer can make and test their changes in isolation prior committing to main, and (2) a shared Dev environment for integrating changes from the entire development team. "Isolated" sandbox environment are accomplish through a number of practices depending on the Azure Service.
    • Databricks - developers use their dedicated Workspace folder to author and save notebooks. Developers can choose to spin up their own dedicated clusters or share a High-concurrency cluster.
    • DataLake Gen2 - a "sandbox" file system is created. Each developer creates their own folder within this Sandbox filesystem.
    • AzureSQL or SQLDW - A transient database (restored from DEV) is spun up per developer on demand.
    • Data Factory - git integration allows them to make changes to their own branches and debug runs independently.
  2. Stage - the STG resource group is used to test deployments prior to going to production in a production-like environment. Integration tests are run in this environment.
  3. Production - the PROD resource group is the final Production environment.

Build and Release Sequence

There are eight numbered orange boxes describing the sequence from sandbox development to target environments:

CI/CD

  1. Developers develop in their own Sandbox environments within the DEV resource group and commit changes into their own short-lived git branches. (i.e. <developer_name>/<branch_name>)
  2. When changes are complete, developers raise a PR to main for review. This automatically kicks-off the PR validation pipeline which runs the unit tests, linting and DACPAC builds.
  3. On PR completion, the commit to main will trigger a Build pipeline -- publishing all necessary Build Artifacts.
  4. The completion of a successful Build pipeline will trigger the first stage of the Release pipeline. This deploys the publish build artifacts into the DEV environment, with the exception of Azure Data Factory*.
  5. Developers perform a Manual Publish to the DEV ADF from the collaboration branch (main). This updates the ARM templates in in the adf_publish branch.
  6. On the successful completion of the first stage, this triggers an Manual Approval Gate**. On Approval, the release pipeline continues with the second stage -- deploying changes to the Staging environment.
  7. Integration tests are run to test changes in the Staging environment.
  8. ***On the successful completion of the second stage, this triggers a second Manual Approval Gate. On Approval, the release pipeline continues with the third stage -- deploying changes to the Production environment.

Notes:

  • This is a simplified Build and Release process for demo purposes based on Trunk-based development practices.
  • *A manual publish is required -- currently, this cannot be automated.
  • **The solution deployment script does not configure Approval Gates at the moment. See Known Issues, Limitations and Workarounds
  • ***Many organization use dedicated Release Branches (including Microsoft) instead of deploying from main. See Release Flow.

More resources:

Testing

  • Unit Testing - These test small pieces of functionality within your code. Data transformation code should have unit tests and can be accomplished by abstracting Data Transformation logic into packages. Unit tests along with linting are automatically run when a PR is raised to main.

    • See here for unit tests within the solution and the corresponding QA Pipeline that executes the unit tests on every PR.
  • Integration Testing - These are run to ensure integration points of the solution function as expected. In this demo solution, an actual Data Factory Pipeline run is automatically triggered and its output verified as part of the Release to the Staging Environment.

More resources:

Observability / Monitoring

Observability-as-Code - Few key components of Observability and Monitoring are deployed and configured through Observability-as-Code at the time on Azure resources deployment. This includes log analytics workspace to collect monitoring data from key resources, central Azure dashboard to monitor key metrics and alerts to monitor the data pipelines. To learn more on monitoring specific service read below.

Databricks

Data Factory

How to use the sample

Prerequisites

  1. Github account
  2. Azure Account
  3. Azure DevOps Project

Software pre-requisites if you don't use dev container

Software pre-requisites if you use dev container

Setup and Deployment

IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources. NOTE: This deployment was tested using WSL 2 (Ubuntu 18.04) and Debian GNU/Linux 9.9 (stretch)

  1. Initial Setup

    • Ensure that:

      • You are logged in to the Azure CLI. To login, run

        az login
      • Azure CLI is targeting the Azure Subscription you want to deploy the resources to. To set target Azure Subscription, run

        az account set -s <AZURE_SUBSCRIPTION_ID>
      • Azure CLI is targeting the Azure DevOps organization and project you want to deploy the pipelines to. To set target Azure DevOps project, run

        az devops configure --defaults organization=https://dev.azure.com/<MY_ORG>/ project=<MY_PROJECT>
    • Fork this repository into a new Github repo.

    • Set the following required environment variables:

      • GITHUB_REPO - Name of your forked github repo in this form <my_github_handle>/<repo>. (ei. "devlace/mdw-dataops-import")
      • GITHUB_PAT_TOKEN - a Github PAT token. Generate them here. This requires "repo" scope.

      Optionally, set the following environment variables:

      • AZURE_LOCATION - Azure location to deploy resources. Default: westus.
      • AZURE_SUBSCRIPTION_ID - Azure subscription id to use to deploy resources. Default: default azure subscription. To see your default, run az account list.
      • DEPLOYMENT_ID - string appended to all resource names. This is to ensure uniqueness of azure resource names. Default: random five character string.
      • AZDO_PIPELINES_BRANCH_NAME - git branch where Azure DevOps pipelines definitions are retrieved from. Default: main.
      • AZURESQL_SERVER_PASSWORD - Password of the SQL Server instance. Default: random string.
    • If you are using dev container, follow the below steps:

      • Rename .envtemplate under ".devcontainer" folder to devcontainer.env and update the values as mentioned above instead of setting those as environment variables.
      • Open the project inside the vscode dev container (see details here).
    • To further customize the solution, set parameters in arm.parameters files located in the infrastructure folder.

      • To enable Observability and Monitoring components through code(Observability-as-code), please set enable_monitoring parameter to true in arm.parameters files located in the infrastructure folder. This will deploy log analytics workspace to collect monitoring data from key resources, setup an Azure dashboards to monitor key metrics and configure alerts for ADF pipelines.
  2. Deploy Azure resources

    • cd into the e2e_samples/parking_sensors folder of the repo.

    • Configure your default AzDo Organization and Project

      az devops configure --defaults organization="$AZDO_ORGANIZATION_URL" project="$AZDO_PROJECT"
    • Run ./deploy.sh.

      • This may take around ~30mins or more to run end to end. So grab yourself a cup of coffee... ☕
      • After a successful deployment, you will find .env.{environment_name} files containing essential configuration information per environment. See here for list of deployed resources.
      • Note that if you are using dev container, you would run the same script but inside the dev container terminal.
    • As part of the deployment script, this updated the Azure DevOps Release Pipeline YAML definition to point to your Github repository. Commit and push up these changes.

      • This will trigger a Build and Release which will fail due to a lacking adf_publish branch -- this is expected. This branch will be created once you've setup git integration with your DEV Data Factory and publish a change.
  3. Setup ADF git integration in DEV Data Factory

    IMPORTANT NOTE: Only the DEV Data Factory should be setup with Git integration. Do not setup git integration in the STG and PROD Data Factories.

    • In the Azure Portal, navigate to the Data Factory in the DEV environment and launch the Data Factory portal.
    • On the landing page, select "Set up code repository". For more information, see here.
    • Fill in the repository settings with the following:
      • Repository type: Github
      • Use GitHub Enterprise Server: Unselected, unless you are using GitHub Enterprise Server
      • Github Account: your_Github_account
      • Git repository (select Use repository link, if forked): forked Github repository url
      • Collaboration branch: main
      • Root folder: /e2e_samples/parking_sensors/adf
      • Import existing resources to repository: Selected
      • Import resource into this branch: main
    • When prompted to select a working branch, check Use existing and select main

    Ensure you Import Existing Data Factory resources to repository. The deployment script deployed ADF objects with Linked Service configurations in line with the newly deployed environments. Importing existing ADF resources definitions to the repository overrides any default Linked Services values so they are correctly in sync with your DEV environment.

  4. Trigger an initial Release

    • In the DEV Data Factory portal, navigate to "Manage > Triggers". Select the T_Sched trigger and activate it by clicking on the "Play" icon next to it. Click Publish to publish changes.
      • Publishing a change is required to generate the adf_publish branch which is used in the Release pipelines.
    • In Azure DevOps, notice a new run of the Build Pipeline (mdw-park-ci-artifacts) off main. This will build the Python package and SQL DACPAC, then publish these as Pipeline Artifacts.
    • After completion, this should automatically trigger the Release Pipeline (mdw-park-cd-release). This will deploy the artifacts across environments.
      • You may need to authorize the Pipelines initially to use the Service Connection and deploy the target environments for the first time. Release Pipeline
    • Optional. Trigger the Data Factory Pipelines per environment.
      • In the Data Factory portal of each environment, navigate to "Author", then select the P_Ingest_MelbParkingData.
      • Select "Trigger > Trigger Now".
      • To monitor the run, go to "Monitor > Pipeline runs". Data Factory Run
      • Currently, the data pipeline is configured to use "on-demand" databricks clusters so it takes a few minutes to spin up. That said, it is not uncommon to change these to point to "existing" running clusters in Development for faster data pipeline runs.
  5. Optional. Visualize data in PowerBI

    This requires PowerBI Desktop App installed.

    • Open the provided PowerBi pbix (PowerBI_ParkingSensors.pbix) under reports folder.
    • Under Queries, select "Transform Data" > "Data source settings".
    • Select "Change Source..." and enter the Server and Database details of your SQL Dedicated Pool. Click "Ok".

      You can retrieve these from the Azure Portal under "Connection Strings" of your SQL Dedicated Pool Instance.

    • Select "Edit Permissions...". Under "Credentials", select "Edit...". Select the "Database" tab. Enter the User name and password of your SQL Dedicated Pool Instance.

      You can retrieve these from the Secrets in your KeyVault instance.

    • Close the Data Source tabs.
    • Click on Refresh data.

      Your Dashboard will initially be empty. You will need your data pipeline to run a few times for the data in your SQL Dedicated Pool to populate.

Congratulations!! 🥳 You have successfully deployed the solution and accompanying Build and Release Pipelines. For next steps, we recommend watching this presentation for a detailed walk-through of the running solution.

If you've encountered any issues, please review the Troubleshooting section. If you are still stuck, please file a Github issue with the relevant error message, error screenshots, and replication steps.

Deployed Resources

After a successful deployment, you should have the following resources:

  • In Azure, three (3) Resource Groups (one per environment) each with the following Azure resources.
    • Data Factory - with pipelines, datasets, linked services, triggers deployed and configured correctly per environment.
    • Data Lake Store Gen2 and a Service Principal (SP) with Storage Contributor rights assigned.
    • Databricks workspace
      • notebooks uploaded at /notebooks folder in the workspace
      • SparkSQL tables created
      • ADLS Gen2 mounted at dbfs:/mnt/datalake using the Storage Service Principal.
      • Databricks KeyVault secrets scope created
    • Log Analytics Workspace - including a kusto query on Query explorer -> Saved queries, to verify results that will be logged on Synapse notebooks (notebooks are not deployed yet).
    • Azure Synapse SQL Dedicated Pool (formerly SQLDW) - currently, empty. The Release Pipeline will deploy the SQL Database objects.
    • Azure Synapse Spark Pool - currently, empty. Configured to point the deployed Log Analytics workspace, under "Apache Spark Configuration".
    • Azure Synapse Workspace - currently, empty.
    • Application Insights
    • KeyVault with all relevant secrets stored.
  • In Azure DevOps
    • Four (4) Azure Pipelines
      • mdwdops-cd-release - Release Pipeline
      • mdwdops-ci-artifacts - Build Pipeline
      • mdwdops-ci-qa-python - "QA" pipeline runs on PR to main
      • mdwdops-ci-qa-sql - "QA" pipeline runs on PR to main
    • Three (6) Variables Groups - two per environment
      • mdwdops-release-dev
      • mdwdops-secrets-dev*
      • mdwdops-release-stg
      • mdwdops-secrets-stg*
      • mdwdops-release-prod
      • mdwdops-secrets-prod*
    • Four (4) Service Connections
      • Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
        • mdwdops-serviceconnection-dev
        • mdwdops-serviceconnection-stg
        • mdwdops-serviceconnection-prod
      • Github Service Connection for retrieving code from Github
        • mdwdops-github
    • Three additional Service Principals (one per environment) with Data Factory Contributor role for running Integration Tests

Notes:

Clean up

This sample comes with an optional, interactive clean-up script which will delete resources with mdwdops in its name. It will list resources to be deleted and will prompt before continuing. IMPORTANT NOTE: As it simply searches for mdwdops in the resource name, it could list resources not part of the deployment! Use with care.

Data Lake Physical layout

ADLS Gen2 is structured as the following:

    datalake                    <- filesystem
        /sys/databricks/libs    <- contains all libs, jars, wheels needed for processing
        /data
            /lnd                <- Bronze - landing folder where all data files are ingested into.
            /interim            <- Silver - interim (cleansed) tables
            /dw                 <- Gold - final tables 

Known Issues, Limitations and Workarounds

The following lists some limitations of the solution and associated deployment script:

  • Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See here for more information.
  • Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Approval Gates can be easily configured manually. See here for more information.