In this sample we are doing a few things:
- Upload some sample data into a newly created container
- Adding the logged in account as a Storage Blob Data Contributor to enable passthrough access
- Deploy a default Databricks cluster and a High-Concurrency Databricks cluster
- Accessing data using the default cluster by mounting the Data Lake using the Account Key
- Accessing data using the high-concurrency cluster by mounting the Data Lake using AD Credential Passthrough
- You need to have already deployed the Databricks workspace using
sample1_basic_azure_databricks_environment
- You need to have rights to assign account roles to the storage account
If you would like to run this sample with sample 2, please follow this for a few extra setup steps.
- Azure CLI installed on the local machine
- Installation instructions can be found here
- For Windows users,
- Option 1: Windows Subsystem for Linux
- Option 2: Use the dev container published here as a host for the bash shell.
- Databricks CLI
First you need to configure your environment variable file .env
, which should look same
or similar to the one you are using in sample1
. Then run:
source .env
This will give you the configuration you need to start the following process.
You can run ./deploy.sh to start the deployment process in the below steps.
./deploy.sh
This will execute a few scripts:
This step will create a new container $storageAccountContainerName
in your storage account,
and upload the sample data file sample-data.us-population.json
into the container.
It will then add your current account as a "Storage Blob Data Contributor" to the storage account, which will allow you to access storage account later using AD credential passthrough. Without this step you will get an unauthorized error when trying to read the mount point.
This step will create two clusters using the Azure Databricks RESTful API and the JSON files located in the sample directory. One of the cluster is a basic cluster, while the other one is a high-concurrency cluster. You would need the high-concurrency cluster for AD passthrough configuration, and you also need a Premium Databricks workspace.
This step will create a file ~/.databirckscfg
with a newly fetched AD Token and the hostname
of your Databricks workspace. This will allow you to start to use the command databricks
in the
future steps. Note that this token expires very quickly (probably around 30 minutes), so you
may want to run this step again if you are around for longer.
This step will upload the following scripts into the root level of your Databricks workspace:
access-data-directly-via-account-key.ipy
access-data-mount-via-ad-passthrough.ipy
Note that there's a search-replace performed in the template before it is uploaded, which will replace the storage account name with the correct prefix.
After you run this script, you should then login to your Databricks and run the notebooks.
Note that you need to run the AD Passthrough example in the High-Concurrency cluster, otherwise you will get an error saying you cannot mount the storage account.
To clean up the resources, go back to the sample1 and run ./destroy.sh
./destroy.sh
If you want to run this sample with a more restrictive network settings like in the Sample 2, you would need to add your IP address to the Storage Account firewall prior to the deployment.
Two options:
From Azure Portal
by going to "Networking" -> "Firewalls and virtual networks" and tick "Add your client IP address"Azure CLI
->az storage account network-rule add ...
.