This solution demonstrates how we can separate our business logic transformations into different modules which can be packaged and loaded dynamically into synapse spark pools based on the defined metadata.
We can define the metadata in a separate database, a json configuration file or provide them as synapse pipeline parameters. To keep this sample simple and light weighted we are using pipeline parameters to store metadata for running our pipelines.
This sample will focus on provisioning a synapse work space and required resources where you can run the synapse pipeline to see how the same pipeline can be used with multiple datasets to perform data specific transformations by loading the module dynamically at run time.
The following list captures the scope of this sample:
- Provision an Azure Synapse environment by a shell script.
- The following services will be provisioned as a part of this sample setup:
- Synapse work space.
- Azure Storage account with a root container having sample csv data file.
- Synapse spark pool.
- Synapse pipeline.
Details about how to use this sample can be found in the later sections of this document.
For our sample use case we have a country list csv file which gets processed via synapse pipeline and data gets stored in spark external tables.
We have two sample transformation modules:
- md5: This module calculates the hash of all the columns and adds that as a separate column to the country list data.
- data_filter: This module returns a filtered dataset based on what condition we pass to the module. E.g get the countries where region is Asia
Synapse pipeline will be run twice to demonstrate how the two different transformations will be applied to the country list data.
Details about how to run the pipeline can be found in the later sections of this document.
The below diagram illustrates the design and the flow of the system:
The following technologies are used to build this sample:
This section holds the information about usage instructions of this sample.
The following are the prerequisites for deploying this sample:
- Github account
- Azure Account
-
Permissions needed: The ability to create and deploy to an Azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group.
-
Active subscription with the following resource providers enabled:
- Microsoft.Synapse
- Microsoft.Storage
-
- Azure CLI installed on the local machine
- Installation instructions can be found here
- For Windows users,
- Option 1: Windows Subsystem for Linux
- Option 2: Use the devcontainer published here as a host for the bash shell. For more information about Devcontainers, see here.
IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for a list of deployed resources.
Below listed are the steps to deploy this sample :
-
Fork and clone this repository. Navigate to
single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/setup/
. -
The sample depends on the variables to be set before the deployment script is run; You can set these as environment variables, otherwise the deploy script will prompt you to provide input for following variables.
DEPLOYMENT_PREFIX
- Prefix for the resource names which will be created as a part of this deployment. Default: syndyn.AZURE_SUBSCRIPTION_ID
- Subscription ID of the Azure subscription where the resources should be deployed. Default: your default Azure subscription.AZURE_RESOURCE_GROUP_LOCATION
- Azure region where the resources will be deployed (e.g. australiaeast, eastus, etc.). Default: eastus.
-
Run
/deploy.sh
Note: The script will prompt you to log in to the Azure account for authorization to deploy resources.
The script will create the Synapse analytics workspace, Azure storage, Synapse pipelines & Synapse notebook. This script will also upload the sample file to blob storage and wheel packages to the Azure synapse.
The following resources will be deployed as a part of this sample once the script is executed:
1.Azure Synapse Analytics workspace.
2.Azure Storage with sample file.
2.Azure Synapse spark pool with wheel packages.
The following steps can be performed to validate the correct deployment and execution of the sample:
-
Users with appropriate access rights should be able to:
- Launch the synapse workspace from the Azure portal.
- Access the control plane for the storage account through the Azure portal.
- View the files uploaded in the Azure storage container
-
Detail steps on how to execute the sample:
- Launch the Azure synapse workspace & open the synapse pipelines.
- Select on
DataProcessPipeline
and click on edit configurations forIterateFiles
(ForEach) activity. - Select
InvokeOrchestratorNotebook
activity and then click onSetttings
tab and set the Spark pool toSparkPool1
- Click on debug or trigger to run the pipeline.
Note: As part of deployment, 2 modules have already been uploaded to Azure synapse.
-
storageAccountName
is Azure storage account name -
containerName
is Azure blob storage container name -
basePath
is the folder inside the container name where you want to pick the files from -
filPath
is file name regex with which you want to pick the files. -
arhivePath
: defaults toarchive
folder inside the azure blob container -
moduleName
is the name of the transformation module(wheel package) you want to transform your data with during the the pipeline run. Defaults tomd5
-
moduleConfig
is the configuration required for the transformation module. It can be empty incase the module doesn't take any configuration. -
targetTable
parameter takes a JSON object with table name and path where the parquet files will be stored for the external table. -
database
: Input for the spark database ; defaults todefault
value-
Fill in the parameters as required (or keeping the default values) and run the pipeline. Once the pipeline is successful, you'll see the data in the spark table as defined in the
targetTable
parameter -
Now let's run the same pipeline with another module, keeping everything same except the module name, module coinfiguration and target tables.
-
Run the pipeline again with the moduleName as
data_filter
& this module will filter the data based on the filter criteria provided in the moduleConfig parameter, which in our case is "Asia" onregion
column. -
Once pipeline finished, you'll see a new spark table named
country_list_asia
created successfully.
-
With this you can have generic pipelines which can load transformation modules dynamically based on configurations to process your data.
Please follow the below steps to clean up your environment :
The clean-up script can be executed to clean up the resources provisioned in this sample. Following are the steps to execute the script:
-
Navigate to
single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/setup/
. -
Run
/destroy.sh
If you run the /deploy.sh script on Mac OS, you may run into the following error:
(ValidationFailed) Workspace request validation failed, check error details for more information
Code: ValidationFailed
Message: Workspace request validation failed, check error details for more information
Exception Details:(SqlServerPasswordTooShort) Sql Server password must be atleast 8 characters long.
Code: SqlServerPasswordTooShort
Message: Sql Server password must be atleast 8 characters long.
This is usually preceded by this error earlier up in the log: tr: Illegal byte sequence
. The core issue behind this is the script generates a random password and Mac OS can use non-binary characters which causes the command to fail.
In order to fix this you must set the following environment variables:
export LC_ALL=C
export LC_CTYPE=C