Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks Plugin Setup Doc Enhancement #4445

Merged
merged 8 commits into from
Nov 28, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 89 additions & 21 deletions rsts/deployment/plugins/webapi/databricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,30 +42,99 @@ Databricks workspace
To set up your Databricks account, follow these steps:

1. Create a `Databricks account <https://www.databricks.com/>`__.

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_workspace.png
:alt: A screenshot of Databricks workspace creation.

2. Ensure that you have a Databricks workspace up and running.

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/open_workspace.png
:alt: A screenshot of Databricks workspace.

3. Generate a `personal access token
<https://docs.databricks.com/dev-tools/auth.html#databricks-personal-ACCESS_TOKEN-authentication>`__ to be used in the Flyte configuration.
You can find the personal access token in the user settings within the workspace.
You can find the personal access token in the user settings within the workspace. ``User settings`` -> ``Developer`` -> ``Access tokens``

.. note::
.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_access_token.png
:alt: A screenshot of access token.

4. Enable custom containers on your Databricks cluster before you trigger the workflow.

.. code-block:: bash

curl -X PATCH -n -H "Authorization: Bearer <your-personal-access-token>" \
https://<databricks-instance>/api/2.0/workspace-conf \
-d '{"enableDcs": "true"}'

When testing the Databricks plugin on the demo cluster, create an S3 bucket because the local demo
cluster utilizes MinIO. Follow the `AWS instructions
<https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html>`__
to generate access and secret keys, which can be used to access your preferred S3 bucket.
For more detail, check `custom containers <https://docs.databricks.com/administration-guide/clusters/container-services.html>`__.

Create an `instance profile
5. Create an `instance profile
<https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html>`__
for the Spark cluster. This profile enables the Spark job to access your data in the S3 bucket.
Please follow all four steps specified in the documentation.

Upload the following entrypoint.py file to either
Create an instance profile using the AWS console (For AWS Users)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. In the AWS console, go to the IAM service.
2. Click the Roles tab in the sidebar.
3. Click Create role.

a. Under Trusted entity type, select AWS service.
b. Under Use case, select **EC2**.
c. Click Next.
d. At the bottom of the page, click Next.
e. In the Role name field, type a role name.
f. Click Create role.

4. In the role list, click the **AmazonS3FullAccess** role.
5. Click Create role button.

In the role summary, copy the Role ARN.

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/s3_arn.png
:alt: A screenshot of s3 arn.

Locate the IAM role that created the Databricks deployment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you don’t know which IAM role created the Databricks deployment, do the following:

1. As an account admin, log in to the account console.
2. Go to ``Workspaces`` and click your workspace name.
3. In the Credentials box, note the role name at the end of the Role ARN

For example, in the Role ARN ``arn:aws:iam::123456789123:role/finance-prod``, the role name is finance-prod

Edit the IAM role that created the Databricks deployment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. In the AWS console, go to the IAM service.
2. Click the Roles tab in the sidebar.
3. Click the role that created the Databricks deployment.
4. On the Permissions tab, click the policy.
5. Click Edit Policy.
6. Append the following block to the end of the Statement array. Ensure that you don’t overwrite any of the existing policy. Replace <iam-role-for-s3-access> with the role you created in Configure S3 access with instance profiles.

.. code-block:: bash

{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
}


6. Upload the following ``entrypoint.py`` file to either
`DBFS <https://docs.databricks.com/archive/legacy/data-tab.html>`__
(the final path can be ``dbfs:///FileStore/tables/entrypoint.py``) or S3.
This file will be executed by the Spark driver node, overriding the default command in the
`dbx <https://docs.databricks.com/dev-tools/dbx.html>`__ job.
(the final path will be ``dbfs:///FileStore/tables/entrypoint.py``) or S3.
This file will be executed by the Spark driver node, overriding the default command of the
`Databricks <https://docs.databricks.com/dev-tools/dbx.html>`__ job. This entrypoint file will

1. Download the inputs from S3 to the local filesystem.
2. Execute the spark task.
3. Upload the outputs from the local filesystem to S3 for the downstream tasks to consume.


.. TODO: A quick-and-dirty workaround to resolve https://github.com/flyteorg/flyte/issues/3853 issue is to import pandas.
.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/dbfs.png
:alt: A screenshot of dbfs.

.. code-block:: python

Expand Down Expand Up @@ -101,9 +170,7 @@ This file will be executed by the Spark driver node, overriding the default comm


def main():

args = sys.argv

click_ctx = click.Context(click.Command("dummy"))
if args[1] == "pyflyte-fast-execute":
parser = _fast_execute_task_cmd.make_parser(click_ctx)
Expand All @@ -122,6 +189,12 @@ This file will be executed by the Spark driver node, overriding the default comm

Specify plugin configuration
----------------------------
.. note::

Demo cluster saves the data to minio, but Databricks job saves the data to S3.
Therefore, you need to update the AWS credentials for the single binary deployment, so the pod can
access the S3 bucket that DataBricks job writes to.


.. tabs::

Expand Down Expand Up @@ -330,7 +403,6 @@ Add the Databricks access token to FlytePropeller:
apiVersion: v1
data:
FLYTE_DATABRICKS_API_TOKEN: <ACCESS_TOKEN>
client_secret: Zm9vYmFy
kind: Secret
...

Expand Down Expand Up @@ -376,8 +448,4 @@ Wait for the upgrade to complete. You can check the status of the deployment pod

kubectl get pods -n flyte

.. note::

Make sure you enable `custom containers
<https://docs.databricks.com/administration-guide/clusters/container-services.html>`__
on your Databricks cluster before you trigger the workflow.
For databricks plugin on the Flyte cluster, please refer to `Databricks Plugin Example <https://docs.flyte.org/projects/cookbook/en/latest/auto_examples/databricks_plugin/index.html>`_
Loading