Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): add and use file system abstraction in file source #8415

Merged
merged 17 commits into from
Jul 1, 2024

Conversation

simaov
Copy link
Contributor

@simaov simaov commented Jul 13, 2023

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

  • New Features

    • Introduced support for S3, local file system, and HTTP file system plugins for enhanced data ingestion capabilities.
    • Added new FileInfo class to improve file management.
  • Improvements

    • Enhanced file reading logic to handle streaming and batch modes more efficiently.
    • Updated file handling methods to utilize the new FileInfo class.
    • Improved resource management with adjustments to close and close_if_possible methods.
  • Bug Fixes

    • Fixed issues in the get_filenames method by returning an iterable of FileInfo objects instead of strings.

@simaov simaov changed the title Remote ingest mcp Add and use file system abstraction in file source Jul 13, 2023
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 13, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Jul 13, 2023

@simaov thanks for the PR - I haven't given it a detailed look yet, but overall seems pretty nifty.

I am wondering if using smart-open (https://pypi.org/project/smart-open/) or requests-file might yield a similar outcome with less code on our side. I think we have dependencies on both of those libraries in certain places already.

@simaov
Copy link
Contributor Author

simaov commented Jul 14, 2023

Hi @hsheth2, thank you for your comment. Sorry, I am not very familiar with python ecosystem. I will have a look. Thank you.

@simaov
Copy link
Contributor Author

simaov commented Jul 14, 2023

Hi @hsheth2, you are right, smart_open does what we need, so I used it to read data from different sources. Thank you.

@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jul 17, 2023
return [str(self.config.path)]
def get_filenames(self) -> Iterable[FileStatus]:
path_str = str(self.config.path)
fs = FileSystem.get(path_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this overall feels a bit complex

is there a reason we can't use smart_open directly without building wrappers for each thing (e.g. http, local, s3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken smart_open does not have methods to list path and get file info. It can open stream only. Currently there are 3 file systems: local, s3 and http. In the future, there can be more. And if someone need to add azure support, for instance, the only one thing that should be done is to implement FileSystem for Azure. Thats it.

@simaov simaov requested a review from hsheth2 August 2, 2023 13:55
@hsheth2
Copy link
Collaborator

hsheth2 commented Aug 3, 2023

@simaov you're right - I think we probably do need this sort of per-system implementation, and the complexity is warranted here

@asikowitz will chime in with some more detailed comments, but the main things I'm thinking about here:

  1. I don't want the file source to have a hard dependency on boto3 / other libraries. If they're installed, we should use them, but we shouldn't fail if they're missing. We already have a "registry" abstraction (the same one we use for sources/sinks/transformers) that supports this lazy loading, so ideally we reuse that.
  2. The FileSystem class should probably be an abstract class
  3. The file sizes aren't that important - for stuff like s3 where getting file size is a whole extra API call, is that really worth it just so we can show it in the file source report? Not sure though - I could go either way on this

@simaov
Copy link
Contributor Author

simaov commented Aug 14, 2023

@hsheth2, thanks for your comments. I agree that we need to avoid hard dependency. Could you please share more details about registry and how it can be used? Maybe some examples or how it is used in project.

@hsheth2
Copy link
Collaborator

hsheth2 commented Aug 15, 2023

@simaov here's where we're setting up the registry for sources:

source_registry = PluginRegistry[Source]()
, and we use it here
source_class = source_registry.get(source_type)

In this case, we should just use .register("s3", "path.to.import") or similar instead of using register_from_entrypoint. Then you can just use .get like a normal dictionary, and it will handle the lazy loading

@simaov
Copy link
Contributor Author

simaov commented Sep 8, 2023

Hi @hsheth2. I tried to address your comments:

  1. create separate register for file systems - so currently it is lazy and easy to manage
  2. make FileSystem abstract
  3. I tried to save original counters and calculations. If you think we don't need them - I can remove

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this looks pretty good

had a few questions about naming / config

class FileSystem(metaclass=ABCMeta):

@classmethod
def create_fs(cls) -> "FileSystem":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be more consistent to just call this create

also should this method be taking kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



@dataclass
class FileStatus:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should call this FileInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

raise NotImplementedError('File system implementations must implement "create_fs"')

@abstractmethod
def open(self, path: str, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add a return type annotation here, or is it too messy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point we don't know actual return type and it depends on underlying implementation. smart_open says that open method returns A file-like object

return S3FileSystem()

def open(self, path: str, **kwargs):
transport_params = kwargs.update({'client': S3FileSystem._s3})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does an end user configure the s3 client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on boto3 docs there are 3 options. But anyway I added ability to configure S3FileSystem using Config object

@@ -273,15 +273,15 @@ def _iterate_file(self, file_status: FileStatus) -> Iterable[Tuple[int, Any]]:
def iterate_mce_file(self, path: str) -> Iterator[MetadataChangeEvent]:
schema = get_path_schema(path)
fs_class = fs_registry.get(schema)
fs = fs_class.create_fs()
fs = fs_class.create()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll probably need to add a mechanism for passing config here from the recipe, but we can leave that for a follow up PR

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 15, 2023

@simaov looks like there's still a few small lint issues from isort

-from datahub.ingestion.source.fs.fs_base import FileSystem, FileInfo
-from typing import Iterable
 import os
 import pathlib
+from typing import Iterable
+
 import smart_open
+
+from datahub.ingestion.source.fs.fs_base import FileInfo, FileSystem

@hsheth2 hsheth2 added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 9, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 9, 2024

@simaov I went ahead and fixed up the code here. It looked like the read_mode config got lost, and supporting it again fixed the tests

@simaov
Copy link
Contributor Author

simaov commented Feb 22, 2024

Hi @hsheth2. Thanks for fixed up the code. To be honest, the idea was to unify reads, because in general it does not matter what is the source, reading from file can also be stream read and we could avoid read mode. But I am ok with it.

There is still one test failed. Is it related to changes that were made?

@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 23, 2024

@simaov doesn't look related. I just retriggered CI.

@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 21, 2024

The smoke tests persistently fail with this error, which suggests something is actually broken here. Still need to investigate further, but it seems plausibly related to this change.

___________________________ test_create_data_product ___________________________

ingest_cleanup_data = None

    @tenacity.retry(
        stop=tenacity.stop_after_attempt(sleep_times), wait=tenacity.wait_fixed(sleep_sec)
    )
    @pytest.mark.dependency(depends=["test_healthchecks"])
    def test_create_data_product(ingest_cleanup_data):
        domain_urn = Urn("domain", [datahub_guid({"name": "Marketing"})])
        graph: DataHubGraph = DataHubGraph(config=DatahubClientConfig(server=get_gms_url()))
>       result = graph.execute_graphql(
            get_gql_query("tests/dataproduct/queries/add_dataproduct.graphql"),
            {
                "domainUrn": str(domain_urn),
                "name": "Test Data Product",
                "description": "Test Description",
            },
        )

tests/dataproduct/test_dataproduct.py:169: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DataHubGraph: configured to talk to http://localhost:8080
query = 'mutation($domainUrn: String!, $name: String!, $description: String) {\n  createDataProduct(input: { properties: { nam...description:$description }, domainUrn:$domainUrn}) {\n    urn\n    type\n    properties {\n      name\n    }\n  }\n}\n'
variables = {'description': 'Test Description', 'domainUrn': 'urn:li:domain:e1[234](https://github.com/datahub-project/datahub/actions/runs/8166536641/job/22360840770?pr=8415#step:28:235)9b93190625e55a98ab2c2c616eb', 'name': 'Test Data Product'}

    def execute_graphql(self, query: str, variables: Optional[Dict] = None) -> Dict:
        url = f"{self.config.server}/api/graphql"
    
        body: Dict = {
            "query": query,
        }
    
        if variables:
            body["variables"] = variables
    
        logger.debug(
            f"Executing graphql query: {query} with variables: {json.dumps(variables)}"
        )
        result = self._post_generic(url, body)
        if result.get("errors"):
>           raise GraphError(f"Error executing graphql query: {result['errors']}")
E           datahub.configuration.common.GraphError: Error executing graphql query: [{'message': 'The Domain provided dos not exist', 'locations': [{'line': 2, 'column': 3}], 'path': ['createDataProduct'], 'extensions': {'code': 400, 'type': 'BAD_REQUEST', 'classification': 'DataFetchingException'}}]

../metadata-ingestion/src/datahub/ingestion/graph/client.py:804: GraphError

@shirshanka shirshanka added the accepted An Issue that is confirmed as a bug by the DataHub Maintainers. label Jun 28, 2024
Copy link
Contributor

coderabbitai bot commented Jun 28, 2024

Warning

Rate limit exceeded

@hsheth2 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 41 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between dd2b867 and cad285d.

Walkthrough

The recent updates introduce a modular file system plugin architecture for metadata-ingestion, supporting S3, local, and HTTP file systems. A new abstract class FileSystem sets the foundation, with specific implementations for each type of file system. Additionally, the get_filenames method in the file source now returns FileInfo objects, enhancing file management capabilities.

Changes

File Path Summary
metadata-ingestion/setup.py Added S3, local, and HTTP file system plugins in datahub.fs.plugins.
metadata-ingestion/src/.../fs_base.py Introduced abstract FileSystem class and FileInfo data class for file system operations.
metadata-ingestion/src/.../fs_registry.py Created plugin registry for file system plugins, registering them from entry points.
metadata-ingestion/src/.../http_fs.py Developed HttpFileSystem class with methods for handling HTTP-based file operations.
metadata-ingestion/src/.../local_fs.py Developed LocalFileSystem class for local file operations.
metadata-ingestion/src/.../s3_fs.py Introduced S3 file system support, including classes and methods for interacting with Amazon S3 storage.
metadata-ingestion/src/.../source/file.py Refactored get_filenames to return FileInfo objects, enhanced file reading and resource management.
metadata-ingestion/tests/unit/test_plugin_system.py Added fs_registry to the list of registries in the test case.

Poem

In the meadow of code, plugins bloom,
With S3, local, and HTTP in tune.
Files dance as FileInfo comes to play,
Ingestion hums in a brand new way.
Resource handled with utmost care,
Data flows like the fresh summer air.
🐇—CodeRabbit whispers, “Hey, it’s all there!”


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 62e6b7f and 1b42885.

Files selected for processing (8)
  • metadata-ingestion/setup.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/fs/fs_base.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/fs/fs_registry.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/fs/http_fs.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/fs/local_fs.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/fs/s3_fs.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/file.py (7 hunks)
  • metadata-ingestion/tests/unit/test_plugin_system.py (2 hunks)
Files not reviewed due to errors (2)
  • metadata-ingestion/src/datahub/ingestion/fs/fs_base.py (no review received)
  • metadata-ingestion/src/datahub/ingestion/fs/s3_fs.py (no review received)
Files skipped from review due to trivial changes (2)
  • metadata-ingestion/setup.py
  • metadata-ingestion/tests/unit/test_plugin_system.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/fs/local_fs.py

16-16: Use context handler for opening files

(SIM115)

metadata-ingestion/src/datahub/ingestion/fs/s3_fs.py

57-57: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

metadata-ingestion/src/datahub/ingestion/source/file.py

295-296: Replace yield over for loop with yield from

Replace with yield from

(UP028)

Additional comments not posted (7)
metadata-ingestion/src/datahub/ingestion/fs/fs_registry.py (1)

1-5: Registry setup looks clean and efficient.

The use of PluginRegistry for FileSystem types and the registration from the entry point datahub.fs.plugins is a standard and effective way to handle plugin architectures in Python. This approach supports extensibility and modular design.

metadata-ingestion/src/datahub/ingestion/fs/http_fs.py (1)

14-15: Use of smart_open for file handling is appropriate.

The use of smart_open enables efficient handling of different types of streams with a uniform API, which is beneficial for an HTTP file system implementation.

metadata-ingestion/src/datahub/ingestion/source/file.py (5)

194-203: Review: Implementation of get_filenames method.

This method correctly utilizes the new file system abstraction to list files based on the schema derived from the path. The method is concise and effectively filters files based on the specified extension.


Line range hint 214-232: Review: get_workunits_internal method.

The method effectively generates work units based on the file content and configuration. The use of isinstance for type checks and the conditional logic for aspect filtering are correctly implemented.

Tools
Ruff

295-296: Replace yield over for loop with yield from

Replace with yield from

(UP028)


266-287: Suggest performance improvement for _iterate_file_streaming.

While the method is correct, consider optimizing the element counting process to reduce overhead, especially for large files. Current implementation reads through the entire file just to count elements, which might not be efficient.


Line range hint 97-135: Review: FileSourceReport class.

This class is well-designed to track and report on the progress of file ingestion. The methods for updating times and counts are clear and effectively utilize Python's datetime for precision.


367-371: Review: Static utility method close_if_possible.

This utility method correctly checks if the object has a close method before calling it, which is a good practice to avoid attribute errors.

metadata-ingestion/src/datahub/ingestion/fs/http_fs.py Outdated Show resolved Hide resolved
Comment on lines 24 to 25
else:
raise Exception(f"Failed to process {path}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve error handling by providing more specific error information.

-            raise Exception(f"Failed to process {path}")
+            raise FileNotFoundError(f"The specified path does not exist: {path}")

This change makes the error message more informative and specific, which aids in troubleshooting.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
else:
raise Exception(f"Failed to process {path}")
else:
raise FileNotFoundError(f"The specified path does not exist: {path}")

metadata-ingestion/src/datahub/ingestion/source/file.py Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b42885 and dd2b867.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/file.py (7 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/file.py

296-297: Replace yield over for loop with yield from

Replace with yield from

(UP028)

Comment on lines +302 to +307
# TODO: Remove this method, as it appears to be unused.
schema = get_path_schema(path)
fs_class = fs_registry.get(schema)
fs = fs_class.create()
file_status = fs.file_status(path)
for obj in self._iterate_file(file_status):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing or refactoring the unused method iterate_mce_file.

The method is marked with a TODO comment indicating it is unused. If it is indeed unused, it should be removed to clean up the codebase.

-    def iterate_mce_file(self, path: str) -> Iterator[MetadataChangeEvent]:
-        # TODO: Remove this method, as it appears to be unused.
-        schema = get_path_schema(path)
-        fs_class = fs_registry.get(schema)
-        fs = fs_class.create()
-        file_status = fs.file_status(path)
-        for obj in self._iterate_file(file_status):
-            mce: MetadataChangeEvent = MetadataChangeEvent.from_obj(obj)
-            yield mce
+    # Method removed as it was unused.

If this method is still needed for future compatibility or other reasons, consider documenting its purpose more clearly or refactoring it to fit the current architecture better.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# TODO: Remove this method, as it appears to be unused.
schema = get_path_schema(path)
fs_class = fs_registry.get(schema)
fs = fs_class.create()
file_status = fs.file_status(path)
for obj in self._iterate_file(file_status):
# Method removed as it was unused.

Comment on lines +194 to +203
def get_filenames(self) -> Iterable[FileInfo]:
path_str = str(self.config.path)
schema = get_path_schema(path_str)
fs_class = fs_registry.get(schema)
fs = fs_class.create()
for file_info in fs.list(path_str):
if file_info.is_file and file_info.path.endswith(
self.config.file_extension
):
yield file_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor the get_filenames method to enhance clarity and performance.

The method is well-implemented but could benefit from a minor refactor to enhance clarity and reduce the number of lines.

-        path_str = str(self.config.path)
-        schema = get_path_schema(path_str)
-        fs_class = fs_registry.get(schema)
-        fs = fs_class.create()
-        for file_info in fs.list(path_str):
+        schema = get_path_schema(str(self.config.path))
+        fs = fs_registry.get(schema).create()
+        for file_info in fs.list(self.config.path):

This change reduces the redundancy by removing the unnecessary path_str variable and combines some operations into single lines.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_filenames(self) -> Iterable[FileInfo]:
path_str = str(self.config.path)
schema = get_path_schema(path_str)
fs_class = fs_registry.get(schema)
fs = fs_class.create()
for file_info in fs.list(path_str):
if file_info.is_file and file_info.path.endswith(
self.config.file_extension
):
yield file_info
def get_filenames(self) -> Iterable[FileInfo]:
schema = get_path_schema(str(self.config.path))
fs = fs_registry.get(schema).create()
for file_info in fs.list(self.config.path):
if file_info.is_file and file_info.path.endswith(
self.config.file_extension
):
yield file_info

Comment on lines +290 to +299
def _iterate_file_batch(self, fp: Any) -> Iterable[Any]:
# Read the file.
contents = json.load(fp)

# Maintain backwards compatibility with the single-object format.
if isinstance(contents, list):
for row in contents:
yield row
else:
yield contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utilize yield from for improved readability and performance in _iterate_file_batch.

The method can be optimized by using yield from when iterating through lists.

-        if isinstance(contents, list):
-            for row in contents:
-                yield row
-        else:
-            yield contents
+        yield from contents if isinstance(contents, list) else (contents,)

This change leverages yield from for better performance and readability, as suggested by the static analysis tool.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _iterate_file_batch(self, fp: Any) -> Iterable[Any]:
# Read the file.
contents = json.load(fp)
# Maintain backwards compatibility with the single-object format.
if isinstance(contents, list):
for row in contents:
yield row
else:
yield contents
def _iterate_file_batch(self, fp: Any) -> Iterable[Any]:
# Read the file.
contents = json.load(fp)
# Maintain backwards compatibility with the single-object format.
yield from contents if isinstance(contents, list) else (contents,)
Tools
Ruff

296-297: Replace yield over for loop with yield from

Replace with yield from

(UP028)

@hsheth2 hsheth2 removed the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jun 28, 2024
@hsheth2 hsheth2 merged commit 8b4e302 into datahub-project:master Jul 1, 2024
57 of 58 checks passed
yoonhyejin pushed a commit that referenced this pull request Jul 16, 2024
Co-authored-by: oleksandrsimonchuk <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>
Co-authored-by: Tamas Nemeth <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 17, 2024
…ahub-project#8415)

Co-authored-by: oleksandrsimonchuk <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>
Co-authored-by: Tamas Nemeth <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted An Issue that is confirmed as a bug by the DataHub Maintainers. community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants