Skip to content

Commit

Permalink
Fixing coderabbitai PR comments
Browse files Browse the repository at this point in the history
  • Loading branch information
joelmataKPN committed Jul 1, 2024
1 parent c7f59d3 commit f6025a1
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 13 deletions.
2 changes: 1 addition & 1 deletion metadata-ingestion/docs/sources/abs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ We are working on using iterator-based JSON parsers to avoid reading in the enti

### Profiling

Profiling is not available in the current release.
Profiling is not available in the current release.
8 changes: 4 additions & 4 deletions metadata-ingestion/docs/sources/abs/abs.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@

### Path Specs

Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details.

All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.

Refer [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has more complex dataset representation.
Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.

**Additional points to note**
- Folder names should not contain {, }, *, / in their names.
Expand Down
2 changes: 1 addition & 1 deletion metadata-ingestion/docs/sources/s3/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.

:::tip
This connector can also be used to ingest local files.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import logging
import logging
import os
import re
from typing import Iterable, Optional, Dict, List
Expand Down Expand Up @@ -84,7 +83,7 @@ def get_abs_properties(
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError("No container_client available.")
raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

blob_service_client = azure_config.get_blob_service_client()
container_client = blob_service_client.get_container_client(
Expand Down Expand Up @@ -196,7 +195,7 @@ def get_abs_tags(
) -> Optional[GlobalTagsClass]:
# Todo add the service_client, when building out this get_abs_tags
if azure_config is None:
raise ValueError("container_client not set. Cannot browse abs")
raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

tags_to_add: List[str] = []
blob_service_client = azure_config.get_blob_service_client()
Expand Down Expand Up @@ -241,7 +240,7 @@ def list_folders(
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
) -> Iterable[str]:
if azure_config is None:
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")
raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

abs_blob_service_client = azure_config.get_blob_service_client()
container_client = abs_blob_service_client.get_container_client(container_name)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,11 @@ def get_bucket_name(path: str) -> str:
raise ValueError(f"Unable to get bucket name from path: {path}")

def get_sub_types(self) -> str:
if self.platform == "s3":
if self.platform == PLATFORM_S3:
return DatasetContainerSubTypes.S3_BUCKET
elif self.platform == "gcs":
elif self.platform == PLATFORM_GCS:
return DatasetContainerSubTypes.GCS_BUCKET
elif self.platform == "abs":
elif self.platform == PLATFORM_ABS:
return DatasetContainerSubTypes.ABS_CONTAINER
raise ValueError(f"Unable to sub type for platform: {self.platform}")

Expand Down

0 comments on commit f6025a1

Please sign in to comment.