- 1. Solution Overview
- 2. How to use this template
When building a project in databricks, we can start from a notebook and implement the business logic in Python or SparkSQL. Before go-production, we need to create CI/CD pipelines. To reduce the effort of build CI/CD pipelines, we build this git repository as template including sample notebooks and unit tests with CI/CD pipelines in Azure DevOps yaml files .
It is the scaffolding of Azure databricks project.
To make it easy extendable, the notebooks and python code only contain super simple logic, and the unit tests are implemented by pytest and nutter
This template focuses on solutions of CI/CD pipeline, and demonstrates to support 2 approaches of spark application implementation, "notebook job" and "spark python job". A python package is also implemented and imported to notebooks and spark python job as library.
The following list captures the scope of this template:
-
Sample code, they are the two job types of databricks, plus a common python library
- Notebook Job.
- Spark Python Job.
- A python package as common library imported by "notebook" and "spark python"
-
Testing
- pytest for common library and spark python
- nutter for notebooks
-
DevOps pipelines build, test and deploy the spark jobs.
Details about how to use this sample can be found in the later sections of this document.
The below diagram illustrates the deployment process flow followed in this template:
The following technologies are used to build this template:
- Azure DevOps
- Azure Databricks
- Azure Resource Manager
- nutter
- databricks cli
- dbx - Databricks CLI eXtensions
This section holds the information about the instructions of this template.
The following are the prerequisites for deploying this template :
You need have 3 databricks workspace for 'develop', 'staging' and 'product'. You setup Azure Databricks services by the IaC samples from here
| .gitignore
| pytest.ini
| README.md
| requirements.txt
| setup.py
|
+---common
| | module_a.py
| | __init__.py
| |
| +---tests
| module_a_test.py
|
+---conf
| deployment_notebook.json
| deployment_notebook_new_cluster.json
| deployment_spark_python.json
| deployment_spark_python_new_cluster.json
|
+---devops
| | lib-pipelines.yml
| | notebook-pipelines.yml
| | spark-python-pipelines.yml
| |
| \---template
| create-deployment-json.yml
| deploy-lib-job.yml
| deploy-notebook-job.yml
| deploy-spark-python-job.yml
| test-lib-job.yml
| test-notebook-job.yml
| test-spark-python-job.yml
|
+---notebook_jobs
| | main_notebook_a.py
| | main_notebook_b.py
| | main_notebook_sql.py
| | module_b_notebook.py
| |
| \---tests
| main_notebook_a_test.py
| main_notebook_b_test.py
| main_notebook_sql_test.py
| module_b_notebook_test.py
|
+---spark_python_jobs
| main.py
| __init__.py
|
+---tests
+---integration
| main_test.py
| __init__.py
|
\---unit
main_test.py
__init__.py
This is to support the job of Notebook and it is the typical approach of databricks application. In this template, there are 4 notebooks and 4 testing notebooks based on nutter
-
This notebook imports a library named "common.module_a" and uses the "add_mount" method.
from common.module_a import add_mount
-
This notebook imports a method declared in the module_b_notebook.py.
%run ./module_b_notebook
-
This notebook has a method and is used by the notebook main_notebook_b.py.
-
This notebook shows how to use Spark Sql to process data.
-
It is the nutter based notebook testing, it runs the notebook as below.
%run ../main_notebook_a
It compares the expected result with the actual
class Test1Fixture(NutterFixture): def __init__(self): self.actual_df = None NutterFixture.__init__(self) def run_test_transform_data(self): self.actual_df = transform_data(df) def assertion_test_transform_data(self): assert(self.actual_df.collect() == expected_df.collect()) def after_test_transform_data(self): print('done')
-
It is the nutter based notebook testing, it runs the notebook as below.
%run ../main_notebook_b
It compares the expected result with the actual
class Test1Fixture(NutterFixture): def __init__(self): self.actual_df = None NutterFixture.__init__(self) def run_test_transform_data(self): self.actual_df = transform_data(df) def assertion_test_transform_data(self): assert(self.actual_df.collect() == expected_df.collect()) def after_test_transform_data(self): print('done')
-
tests/module_b_notebook_test.py
It is the nutter based notebook testing, it runs the notebook as below.
%run ../module_b_notebook
It compares the expected result with the actual
class Test1Fixture(NutterFixture): def __init__(self): self.actual_df = None NutterFixture.__init__(self) def run_test_add_mount(self): self.actual_df = add_mount(df, 10) def assertion_test_add_mount(self): assert(self.actual_df.collect() == expected_df.collect()) def after_test_add_mount(self): print('done')
-
It is the nutter based notebook testing, it runs the notebook as below.
dbutils.notebook.run('../main_notebook_sql', 600)
The bash script below is to create a standalone git repository. You need to create a project in Azure DevOps and create a repository in the project. And replace the [your repo url] in the code below with your repository url.
mkdir [your project name]
cd [your project name]
git clone https://github.com/Azure-Samples/modern-data-warehouse-dataops.git
cd modern-data-warehouse-dataops
git checkout single-tech/databricks-ops
git archive --format=tar single-tech/databricks-ops:single_tech_samples/databricks/sample4_ci_cd | tar -x -C ../
cd ..
rm -rf modern-data-warehouse-dataops
git init
git remote add origin [your repo url]
git add -A
git commit -m "first commit"
git push -u origin --all
git branch develop master
git branch staging develop
git branch production staging
git push -u origin develop
git push -u origin staging
git push -u origin production
After running the scripts, you can open the your repo url to check the code is pushed to the repository.
There are 3 branch in the repository:
- develop branch is the code base of development
- staging branch is for integration testing
- production branch is for production deployment
You can find the document to set branch policy.
In this repo, there are several yml files which are the pipelines to support the CI/CD. you need to import the yml as build pipeline.
- Import ./devops/notebook-pipelines.yml as build pipeline. This pipeline tests and uploads the notebooks to databricks workspace.
Here is a post to introduce how to import a yaml file as Azure DevOps pipeline from Azure DevOps repository.
- Import ./devops/lib-pipelines.yml as build pipeline. This pipeline tests and uploads the python library to databricks cluster as a library.
You need to select branch to run the pipeline for different environments.
- Manually run the pipeline from develop branch to deploy the library to Databricks in develop environment
- Manually run the pipeline from staging branch to deploy the library to Databricks in staging environment
- Manually run the pipeline from production branch to deploy the library to Databricks in production environment
If no library is required in your notebooks project, you need remove the Import statement in notebooks.
-
Create 3 Variable Group as the names below.
Each variable group has 3 variables:
Here are the document of how to create variable groups, and the document of how to get the token.
Follow this document you can import the notebooks from the repository to databricks workspace.
-
Switch to develop branch.
-
Open one of the notebook to edit.
-
Open the relevant testing notebook to run and check.
-
Commit and push the changes to develop branch.
- Create a pull request from develop branch to staging branch
- Complete the merge, it triggers the pipeline to run tests at staging databricks cluster.
Or
- Create a pull request from staging branch to production branch or directly run the pipeline on release branch.
- It triggers the pipeline to run tests and import notebooks into production databricks workspace.
Or
- Manually run the pipeline from production branch
The pipeline does not create job with the notebooks.
This is to support the job of Spark Submit. In this approach, you can develop Spark application in local IDE and submit to Databricks cluster to run it.
Please follow 2.4.1 Repository setup
In this repo, there are several yaml files, which are the pipelines to support the CI/CD. You need to import the yaml as build pipeline.
- Import ./devops/spark-python-pipelines.yml as build pipeline. This pipeline tests and uploads the notebooks to databricks workspace.
- Clone the repo into your local folder and open the folder with VSCode
- If needed, install Microsoft VSCode Remote-Containers extension
- In VSCode, open Command Pallete and type
Remote-Containers: Open Folder in Container...
- Choose the folder named
***\sample4_ci_cd
- Wait for the devcontainer to start and then in the VSCode terminal window, run the script below to start the tests.
pytest common/tests
pytest spark_python_jobs/tests/unit
- Setup local Spark with this document
- Open a cmd terminal window and run the script below to setup the project development.
pip install -r requirements.txt
-
Edit main.py file in VSCode.
-
In cmd terminal window and run the script below to start the tests.
pytest common/tests
pytest spark_python_jobs/tests/unit
- Commit and push the changes to develop branch.
- Create a pull request from develop branch to staging branch
- Complete the merge, it will trigger the pipeline to run tests at staging databricks cluster.
Or
- Create a pull request from staging branch to production branch or directly run the pipeline
- Complete the merge, it will trigger the pipeline to run tests and create a job in production databricks.
Or