Skip to content

Latest commit

 

History

History
202 lines (131 loc) · 11.1 KB

File metadata and controls

202 lines (131 loc) · 11.1 KB

Metadata driven dynamic loading of modules in synapse

Contents

1. Solution Overview

This solution demonstrates how we can separate our business logic transformations into different modules which can be packaged and loaded dynamically into synapse spark pools based on the defined metadata.

We can define the metadata in a separate database, a json configuration file or provide them as synapse pipeline parameters. To keep this sample simple and light weighted we are using pipeline parameters to store metadata for running our pipelines.

This sample will focus on provisioning a synapse work space and required resources where you can run the synapse pipeline to see how the same pipeline can be used with multiple datasets to perform data specific transformations by loading the module dynamically at run time.

1.1. Scope

The following list captures the scope of this sample:

  1. Provision an Azure Synapse environment by a shell script.
  2. The following services will be provisioned as a part of this sample setup:
    1. Synapse work space.
    2. Azure Storage account with a root container having sample csv data file.
    3. Synapse spark pool.
    4. Synapse pipeline.

Details about how to use this sample can be found in the later sections of this document.

1.2. Use Case

For our sample use case we have a country list csv file which gets processed via synapse pipeline and data gets stored in spark external tables.

We have two sample transformation modules:

  • md5: This module calculates the hash of all the columns and adds that as a separate column to the country list data.
  • data_filter: This module returns a filtered dataset based on what condition we pass to the module. E.g get the countries where region is Asia

Synapse pipeline will be run twice to demonstrate how the two different transformations will be applied to the country list data.

Details about how to run the pipeline can be found in the later sections of this document.

1.3. Architecture

The below diagram illustrates the design and the flow of the system: alt text

1.4. Technologies used

The following technologies are used to build this sample:

2. How to use this sample

This section holds the information about usage instructions of this sample.

2.1. Prerequisites

The following are the prerequisites for deploying this sample:

  1. Github account
  2. Azure Account

2.1.1 Software Prerequisites

  1. Azure CLI installed on the local machine
    • Installation instructions can be found here
  2. For Windows users,
    1. Option 1: Windows Subsystem for Linux
    2. Option 2: Use the devcontainer published here as a host for the bash shell. For more information about Devcontainers, see here.

2.2. Setup and deployment

IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for a list of deployed resources.

Below listed are the steps to deploy this sample :

  1. Fork and clone this repository. Navigate to single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/setup/.

  2. The sample depends on the variables to be set before the deployment script is run; You can set these as environment variables, otherwise the deploy script will prompt you to provide input for following variables.

    • DEPLOYMENT_PREFIX - Prefix for the resource names which will be created as a part of this deployment. Default: syndyn.
    • AZURE_SUBSCRIPTION_ID - Subscription ID of the Azure subscription where the resources should be deployed. Default: your default Azure subscription.
    • AZURE_RESOURCE_GROUP_LOCATION - Azure region where the resources will be deployed (e.g. australiaeast, eastus, etc.). Default: eastus.
  3. Run /deploy.sh

    Note: The script will prompt you to log in to the Azure account for authorization to deploy resources.

    The script will create the Synapse analytics workspace, Azure storage, Synapse pipelines & Synapse notebook. This script will also upload the sample file to blob storage and wheel packages to the Azure synapse.

    alt text

2.3. Deployed Resources

The following resources will be deployed as a part of this sample once the script is executed:

1.Azure Synapse Analytics workspace.

alt text

2.Azure Storage with sample file.

alt text

2.Azure Synapse spark pool with wheel packages.

alt text

2.4. Deployment validation and Execution

The following steps can be performed to validate the correct deployment and execution of the sample:

  • Users with appropriate access rights should be able to:

    1. Launch the synapse workspace from the Azure portal.
    2. Access the control plane for the storage account through the Azure portal.
    3. View the files uploaded in the Azure storage container
  • Detail steps on how to execute the sample:

    1. Launch the Azure synapse workspace & open the synapse pipelines.
    2. Select on DataProcessPipeline and click on edit configurations for IterateFiles(ForEach) activity. alt text
    3. Select InvokeOrchestratorNotebook activity and then click on Setttings tab and set the Spark pool to SparkPool1 alt text
    4. Click on debug or trigger to run the pipeline.

    Note: As part of deployment, 2 modules have already been uploaded to Azure synapse.

    alt text

    • storageAccountName is Azure storage account name

    • containerName is Azure blob storage container name

    • basePath is the folder inside the container name where you want to pick the files from

    • filPath is file name regex with which you want to pick the files.

    • arhivePath: defaults to archive folder inside the azure blob container

    • moduleName is the name of the transformation module(wheel package) you want to transform your data with during the the pipeline run. Defaults to md5

    • moduleConfig is the configuration required for the transformation module. It can be empty incase the module doesn't take any configuration.

    • targetTable parameter takes a JSON object with table name and path where the parquet files will be stored for the external table.

    • database: Input for the spark database ; defaults to default value

      1. Fill in the parameters as required (or keeping the default values) and run the pipeline. Once the pipeline is successful, you'll see the data in the spark table as defined in the targetTable parameter alt text

      2. Now let's run the same pipeline with another module, keeping everything same except the module name, module coinfiguration and target tables. alt text

      3. Run the pipeline again with the moduleName as data_filter & this module will filter the data based on the filter criteria provided in the moduleConfig parameter, which in our case is "Asia" on region column.

      4. Once pipeline finished, you'll see a new spark table named country_list_asia created successfully. alt text

    With this you can have generic pipelines which can load transformation modules dynamically based on configurations to process your data.

2.5. Clean-up

Please follow the below steps to clean up your environment :

The clean-up script can be executed to clean up the resources provisioned in this sample. Following are the steps to execute the script:

  1. Navigate to single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/setup/.

  2. Run /destroy.sh

3. Troubleshooting

SqlServerPasswordTooShort

If you run the /deploy.sh script on Mac OS, you may run into the following error:

(ValidationFailed) Workspace request validation failed, check error details for more information
Code: ValidationFailed
Message: Workspace request validation failed, check error details for more information
Exception Details:(SqlServerPasswordTooShort) Sql Server password must be atleast 8 characters long.
Code: SqlServerPasswordTooShort
Message: Sql Server password must be atleast 8 characters long.

This is usually preceded by this error earlier up in the log: tr: Illegal byte sequence. The core issue behind this is the script generates a random password and Mac OS can use non-binary characters which causes the command to fail.

In order to fix this you must set the following environment variables:

export LC_ALL=C
export LC_CTYPE=C