From ba3adfbcc49631c0f61cfecd48ad19a640bda66b Mon Sep 17 00:00:00 2001 From: Future Outlier Date: Fri, 17 Nov 2023 23:36:35 +0800 Subject: [PATCH 1/6] databricks plugin doc update Signed-off-by: Future Outlier --- rsts/deployment/plugins/webapi/databricks.rst | 34 +++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index 671fdb4e18..266bfa42b1 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -54,12 +54,42 @@ To set up your Databricks account, follow these steps: `__ to generate access and secret keys, which can be used to access your preferred S3 bucket. -Create an `instance profile +Here's an example of your S3 bucket settings in your configmap. + +.. code-block:: bash + kubectl edit configmap flyte-sandbox-config -n flyte + +.. code-block:: bash + TODO: ADD MORE DETAILS + AWS_S3_ACCESS_KEY_ID: xxx + AWS_S3_SECRET_ACCESS_KEY: xxx + AWS_S3_REGION_NAME: xxx + AWS_S3_ENDPOINT_URL: xxx + +4. Enable custom containers on your Databricks cluster before you trigger the workflow. + +.. code-block:: bash + curl -X PATCH -n \ + -H "Authorization: Bearer " \ + https:///api/2.0/workspace-conf \ + -d '{ + "enableDcs": "true" + }' + +Here's the `custom containers + `__ + reference. + +5. Create an `instance profile `__ for the Spark cluster. This profile enables the Spark job to access your data in the S3 bucket. Please follow all four steps specified in the documentation. -Upload the following entrypoint.py file to either +Here's an example of your instance profile. +.. code-block:: bash + TODO: ADD MORE DETAILS + +6. Upload the following entrypoint.py file to either `DBFS `__ (the final path can be ``dbfs:///FileStore/tables/entrypoint.py``) or S3. This file will be executed by the Spark driver node, overriding the default command in the From ab24be53ab54813af3f0176b8e5d250ca7b922fa Mon Sep 17 00:00:00 2001 From: Future Outlier Date: Fri, 17 Nov 2023 23:38:18 +0800 Subject: [PATCH 2/6] update Signed-off-by: Future Outlier --- rsts/deployment/plugins/webapi/databricks.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index 266bfa42b1..2102871193 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -57,9 +57,11 @@ To set up your Databricks account, follow these steps: Here's an example of your S3 bucket settings in your configmap. .. code-block:: bash + kubectl edit configmap flyte-sandbox-config -n flyte .. code-block:: bash + TODO: ADD MORE DETAILS AWS_S3_ACCESS_KEY_ID: xxx AWS_S3_SECRET_ACCESS_KEY: xxx @@ -69,6 +71,7 @@ Here's an example of your S3 bucket settings in your configmap. 4. Enable custom containers on your Databricks cluster before you trigger the workflow. .. code-block:: bash + curl -X PATCH -n \ -H "Authorization: Bearer " \ https:///api/2.0/workspace-conf \ @@ -86,7 +89,9 @@ for the Spark cluster. This profile enables the Spark job to access your data in Please follow all four steps specified in the documentation. Here's an example of your instance profile. + .. code-block:: bash + TODO: ADD MORE DETAILS 6. Upload the following entrypoint.py file to either From e72df609ff1c73af1f4e82f090cb2ca51114a494 Mon Sep 17 00:00:00 2001 From: Future Outlier Date: Fri, 17 Nov 2023 23:41:16 +0800 Subject: [PATCH 3/6] update Signed-off-by: Future Outlier --- rsts/deployment/plugins/webapi/databricks.rst | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index 2102871193..f66d38e191 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -72,16 +72,14 @@ Here's an example of your S3 bucket settings in your configmap. .. code-block:: bash - curl -X PATCH -n \ - -H "Authorization: Bearer " \ - https:///api/2.0/workspace-conf \ - -d '{ - "enableDcs": "true" - }' - -Here's the `custom containers - `__ - reference. + curl -X PATCH -n \ + -H "Authorization: Bearer " \ + https:///api/2.0/workspace-conf \ + -d '{ + "enableDcs": "true" + }' + +Here's the `custom containers `__ reference. 5. Create an `instance profile `__ From 7b3b4ff6ef9f973179de2b85df56e0343a5f2567 Mon Sep 17 00:00:00 2001 From: Kevin Su Date: Sat, 25 Nov 2023 21:05:41 -0800 Subject: [PATCH 4/6] Kevin Update Signed-off-by: Kevin Su --- rsts/deployment/plugins/webapi/databricks.rst | 112 +++++++++++------- 1 file changed, 70 insertions(+), 42 deletions(-) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index f66d38e191..3ec56a5e0d 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -42,63 +42,94 @@ Databricks workspace To set up your Databricks account, follow these steps: 1. Create a `Databricks account `__. -2. Ensure that you have a Databricks workspace up and running. -3. Generate a `personal access token - `__ to be used in the Flyte configuration. - You can find the personal access token in the user settings within the workspace. - -.. note:: - - When testing the Databricks plugin on the demo cluster, create an S3 bucket because the local demo - cluster utilizes MinIO. Follow the `AWS instructions - `__ - to generate access and secret keys, which can be used to access your preferred S3 bucket. -Here's an example of your S3 bucket settings in your configmap. +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_workspace.png + :alt: A screenshot of Databricks workspace creation. -.. code-block:: bash +2. Ensure that you have a Databricks workspace up and running. - kubectl edit configmap flyte-sandbox-config -n flyte +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/open_workspace.png + :alt: A screenshot of Databricks workspace. -.. code-block:: bash +3. Generate a `personal access token + `__ to be used in the Flyte configuration. + You can find the personal access token in the user settings within the workspace. ``User settings`` -> ``Developer`` -> ``Access tokens`` - TODO: ADD MORE DETAILS - AWS_S3_ACCESS_KEY_ID: xxx - AWS_S3_SECRET_ACCESS_KEY: xxx - AWS_S3_REGION_NAME: xxx - AWS_S3_ENDPOINT_URL: xxx +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_access_token.png + :alt: A screenshot of access token. 4. Enable custom containers on your Databricks cluster before you trigger the workflow. .. code-block:: bash - curl -X PATCH -n \ - -H "Authorization: Bearer " \ + curl -X PATCH -n -H "Authorization: Bearer " \ https:///api/2.0/workspace-conf \ - -d '{ - "enableDcs": "true" - }' + -d '{"enableDcs": "true"}' -Here's the `custom containers `__ reference. +For more detail, check `custom containers `__. 5. Create an `instance profile `__ for the Spark cluster. This profile enables the Spark job to access your data in the S3 bucket. -Please follow all four steps specified in the documentation. -Here's an example of your instance profile. +Create an instance profile using the AWS console (For AWS Users) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +1. In the AWS console, go to the IAM service. +2. Click the Roles tab in the sidebar. +3. Click Create role. + + a. Under Trusted entity type, select AWS service. + b. Under Use case, select **EC2**. + c. Click Next. + d. At the bottom of the page, click Next. + e. In the Role name field, type a role name. + f. Click Create role. + +4. In the role list, click the **AmazonS3FullAccess** role. +5. Click Create role button. + +In the role summary, copy the Role ARN. + +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/s3_arn.png + :alt: A screenshot of s3 arn. + +Locate the IAM role that created the Databricks deployment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If you don’t know which IAM role created the Databricks deployment, do the following: + +1. As an account admin, log in to the account console. +2. Go to ``Workspaces`` and click your workspace name. +3. In the Credentials box, note the role name at the end of the Role ARN + +For example, in the Role ARN ``arn:aws:iam::123456789123:role/finance-prod``, the role name is finance-prod + +Edit the IAM role that created the Databricks deployment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +1. In the AWS console, go to the IAM service. +2. Click the Roles tab in the sidebar. +3. Click the role that created the Databricks deployment. +4. On the Permissions tab, click the policy. +5. Click Edit Policy. +6. Append the following block to the end of the Statement array. Ensure that you don’t overwrite any of the existing policy. Replace with the role you created in Configure S3 access with instance profiles. .. code-block:: bash - TODO: ADD MORE DETAILS + { + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "arn:aws:iam:::role/" + } + -6. Upload the following entrypoint.py file to either +6. Upload the following ``entrypoint.py`` file to either `DBFS `__ -(the final path can be ``dbfs:///FileStore/tables/entrypoint.py``) or S3. -This file will be executed by the Spark driver node, overriding the default command in the +(the final path will be ``dbfs:///FileStore/tables/entrypoint.py``) or S3. +This file will be executed by the Spark driver node, overriding the default command of the `dbx `__ job. -.. TODO: A quick-and-dirty workaround to resolve https://github.com/flyteorg/flyte/issues/3853 issue is to import pandas. +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/dbfs.png + :alt: A screenshot of dbfs. .. code-block:: python @@ -134,9 +165,7 @@ This file will be executed by the Spark driver node, overriding the default comm def main(): - args = sys.argv - click_ctx = click.Context(click.Command("dummy")) if args[1] == "pyflyte-fast-execute": parser = _fast_execute_task_cmd.make_parser(click_ctx) @@ -155,6 +184,12 @@ This file will be executed by the Spark driver node, overriding the default comm Specify plugin configuration ---------------------------- +.. note:: + + Demo cluster saves the data to minio, but Databricks job saves the data to S3. + Therefore, you need to update the AWS credentials for the single binary deployment, so the pod can + access the S3 bucket that DataBricks job writes to. + .. tabs:: @@ -363,7 +398,6 @@ Add the Databricks access token to FlytePropeller: apiVersion: v1 data: FLYTE_DATABRICKS_API_TOKEN: - client_secret: Zm9vYmFy kind: Secret ... @@ -408,9 +442,3 @@ Wait for the upgrade to complete. You can check the status of the deployment pod .. code-block:: kubectl get pods -n flyte - -.. note:: - - Make sure you enable `custom containers - `__ - on your Databricks cluster before you trigger the workflow. \ No newline at end of file From 1eaa419cb597ced334ea0402031cfe21716501c3 Mon Sep 17 00:00:00 2001 From: Kevin Su Date: Sat, 25 Nov 2023 21:15:18 -0800 Subject: [PATCH 5/6] Kevin Update Signed-off-by: Kevin Su --- rsts/deployment/plugins/webapi/databricks.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index 3ec56a5e0d..2460aae0fd 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -126,7 +126,12 @@ Edit the IAM role that created the Databricks deployment `DBFS `__ (the final path will be ``dbfs:///FileStore/tables/entrypoint.py``) or S3. This file will be executed by the Spark driver node, overriding the default command of the -`dbx `__ job. +`Databricks `__ job. This entrypoint file will + +1. Download the inputs from S3 to the local filesystem. +2. Execute the spark task. +3. Upload the outputs from the local filesystem to S3 for the downstream tasks to consume. + .. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/dbfs.png :alt: A screenshot of dbfs. From ac3f6d9a2cbf97b5aa6ccf905c3da30b6d04b62a Mon Sep 17 00:00:00 2001 From: Future Outlier Date: Mon, 27 Nov 2023 22:58:14 +0800 Subject: [PATCH 6/6] add example link Signed-off-by: Future Outlier --- rsts/deployment/plugins/webapi/databricks.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index 2460aae0fd..ee38a481df 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -447,3 +447,5 @@ Wait for the upgrade to complete. You can check the status of the deployment pod .. code-block:: kubectl get pods -n flyte + +For databricks plugin on the Flyte cluster, please refer to `Databricks Plugin Example `_