feat(ingest): add and use file system abstraction in file source #8415

simaov · 2023-07-13T13:43:25Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

New Features
- Introduced support for S3, local file system, and HTTP file system plugins for enhanced data ingestion capabilities.
- Added new FileInfo class to improve file management.
Improvements
- Enhanced file reading logic to handle streaming and batch modes more efficiently.
- Updated file handling methods to utilize the new FileInfo class.
- Improved resource management with adjustments to close and close_if_possible methods.
Bug Fixes
- Fixed issues in the get_filenames method by returning an iterable of FileInfo objects instead of strings.

hsheth2 · 2023-07-13T18:09:03Z

@simaov thanks for the PR - I haven't given it a detailed look yet, but overall seems pretty nifty.

I am wondering if using smart-open (https://pypi.org/project/smart-open/) or requests-file might yield a similar outcome with less code on our side. I think we have dependencies on both of those libraries in certain places already.

simaov · 2023-07-14T08:08:13Z

Hi @hsheth2, thank you for your comment. Sorry, I am not very familiar with python ecosystem. I will have a look. Thank you.

simaov · 2023-07-14T11:18:44Z

Hi @hsheth2, you are right, smart_open does what we need, so I used it to read data from different sources. Thank you.

hsheth2 · 2023-07-27T17:52:19Z

metadata-ingestion/src/datahub/ingestion/source/file.py

-            return [str(self.config.path)]
+    def get_filenames(self) -> Iterable[FileStatus]:
+        path_str = str(self.config.path)
+        fs = FileSystem.get(path_str)


this overall feels a bit complex

is there a reason we can't use smart_open directly without building wrappers for each thing (e.g. http, local, s3)?

If I am not mistaken smart_open does not have methods to list path and get file info. It can open stream only. Currently there are 3 file systems: local, s3 and http. In the future, there can be more. And if someone need to add azure support, for instance, the only one thing that should be done is to implement FileSystem for Azure. Thats it.

hsheth2 · 2023-08-03T03:20:52Z

@simaov you're right - I think we probably do need this sort of per-system implementation, and the complexity is warranted here

@asikowitz will chime in with some more detailed comments, but the main things I'm thinking about here:

I don't want the file source to have a hard dependency on boto3 / other libraries. If they're installed, we should use them, but we shouldn't fail if they're missing. We already have a "registry" abstraction (the same one we use for sources/sinks/transformers) that supports this lazy loading, so ideally we reuse that.
The FileSystem class should probably be an abstract class
The file sizes aren't that important - for stuff like s3 where getting file size is a whole extra API call, is that really worth it just so we can show it in the file source report? Not sure though - I could go either way on this

simaov · 2023-08-14T10:11:14Z

@hsheth2, thanks for your comments. I agree that we need to avoid hard dependency. Could you please share more details about registry and how it can be used? Maybe some examples or how it is used in project.

hsheth2 · 2023-08-15T17:30:45Z

@simaov here's where we're setting up the registry for sources:

datahub/metadata-ingestion/src/datahub/ingestion/source/source_registry.py

Line 7 in d733363

source_registry = PluginRegistry[Source]()

, and we use it here

datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py

Line 223 in eac003c

source_class = source_registry.get(source_type)

In this case, we should just use .register("s3", "path.to.import") or similar instead of using register_from_entrypoint. Then you can just use .get like a normal dictionary, and it will handle the lazy loading

simaov · 2023-09-08T10:45:33Z

Hi @hsheth2. I tried to address your comments:

create separate register for file systems - so currently it is lazy and easy to manage
make FileSystem abstract
I tried to save original counters and calculations. If you think we don't need them - I can remove

hsheth2

overall this looks pretty good

had a few questions about naming / config

hsheth2 · 2023-09-12T21:39:03Z

metadata-ingestion/src/datahub/ingestion/source/fs/fs_base.py

+class FileSystem(metaclass=ABCMeta):
+
+    @classmethod
+    def create_fs(cls) -> "FileSystem":


would be more consistent to just call this create

also should this method be taking kwargs?

hsheth2 · 2023-09-12T21:40:02Z

metadata-ingestion/src/datahub/ingestion/source/fs/fs_base.py

+
+
+@dataclass
+class FileStatus:


maybe we should call this FileInfo?

hsheth2 · 2023-09-12T21:40:31Z

metadata-ingestion/src/datahub/ingestion/source/fs/fs_base.py

+        raise NotImplementedError('File system implementations must implement "create_fs"')
+
+    @abstractmethod
+    def open(self, path: str, **kwargs):


is it possible to add a return type annotation here, or is it too messy?

I think at this point we don't know actual return type and it depends on underlying implementation. smart_open says that open method returns A file-like object

hsheth2 · 2023-09-12T21:41:39Z

metadata-ingestion/src/datahub/ingestion/source/fs/s3_fs.py

+        return S3FileSystem()
+
+    def open(self, path: str, **kwargs):
+        transport_params = kwargs.update({'client': S3FileSystem._s3})


how does an end user configure the s3 client?

Based on boto3 docs there are 3 options. But anyway I added ability to configure S3FileSystem using Config object

hsheth2 · 2023-10-02T20:16:02Z

metadata-ingestion/src/datahub/ingestion/source/file.py

@@ -273,15 +273,15 @@ def _iterate_file(self, file_status: FileStatus) -> Iterable[Tuple[int, Any]]:
    def iterate_mce_file(self, path: str) -> Iterator[MetadataChangeEvent]:
        schema = get_path_schema(path)
        fs_class = fs_registry.get(schema)
-        fs = fs_class.create_fs()
+        fs = fs_class.create()


we'll probably need to add a mechanism for passing config here from the recipe, but we can leave that for a follow up PR

hsheth2 · 2023-10-15T20:36:41Z

@simaov looks like there's still a few small lint issues from isort

-from datahub.ingestion.source.fs.fs_base import FileSystem, FileInfo
-from typing import Iterable
 import os
 import pathlib
+from typing import Iterable
+
 import smart_open
+
+from datahub.ingestion.source.fs.fs_base import FileInfo, FileSystem

hsheth2 · 2024-02-09T22:04:01Z

@simaov I went ahead and fixed up the code here. It looked like the read_mode config got lost, and supporting it again fixed the tests

simaov · 2024-02-22T11:02:05Z

Hi @hsheth2. Thanks for fixed up the code. To be honest, the idea was to unify reads, because in general it does not matter what is the source, reading from file can also be stream read and we could avoid read mode. But I am ok with it.

There is still one test failed. Is it related to changes that were made?

hsheth2 · 2024-02-23T01:18:09Z

@simaov doesn't look related. I just retriggered CI.

hsheth2 · 2024-03-21T05:26:12Z

The smoke tests persistently fail with this error, which suggests something is actually broken here. Still need to investigate further, but it seems plausibly related to this change.

___________________________ test_create_data_product ___________________________

ingest_cleanup_data = None

    @tenacity.retry(
        stop=tenacity.stop_after_attempt(sleep_times), wait=tenacity.wait_fixed(sleep_sec)
    )
    @pytest.mark.dependency(depends=["test_healthchecks"])
    def test_create_data_product(ingest_cleanup_data):
        domain_urn = Urn("domain", [datahub_guid({"name": "Marketing"})])
        graph: DataHubGraph = DataHubGraph(config=DatahubClientConfig(server=get_gms_url()))
>       result = graph.execute_graphql(
            get_gql_query("tests/dataproduct/queries/add_dataproduct.graphql"),
            {
                "domainUrn": str(domain_urn),
                "name": "Test Data Product",
                "description": "Test Description",
            },
        )

tests/dataproduct/test_dataproduct.py:169: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DataHubGraph: configured to talk to http://localhost:8080
query = 'mutation($domainUrn: String!, $name: String!, $description: String) {\n  createDataProduct(input: { properties: { nam...description:$description }, domainUrn:$domainUrn}) {\n    urn\n    type\n    properties {\n      name\n    }\n  }\n}\n'
variables = {'description': 'Test Description', 'domainUrn': 'urn:li:domain:e1[234](https://github.com/datahub-project/datahub/actions/runs/8166536641/job/22360840770?pr=8415#step:28:235)9b93190625e55a98ab2c2c616eb', 'name': 'Test Data Product'}

    def execute_graphql(self, query: str, variables: Optional[Dict] = None) -> Dict:
        url = f"{self.config.server}/api/graphql"
    
        body: Dict = {
            "query": query,
        }
    
        if variables:
            body["variables"] = variables
    
        logger.debug(
            f"Executing graphql query: {query} with variables: {json.dumps(variables)}"
        )
        result = self._post_generic(url, body)
        if result.get("errors"):
>           raise GraphError(f"Error executing graphql query: {result['errors']}")
E           datahub.configuration.common.GraphError: Error executing graphql query: [{'message': 'The Domain provided dos not exist', 'locations': [{'line': 2, 'column': 3}], 'path': ['createDataProduct'], 'extensions': {'code': 400, 'type': 'BAD_REQUEST', 'classification': 'DataFetchingException'}}]

../metadata-ingestion/src/datahub/ingestion/graph/client.py:804: GraphError

coderabbitai · 2024-06-28T21:25:17Z

Warning

Rate limit exceeded

@hsheth2 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 41 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between dd2b867 and cad285d.

Walkthrough

The recent updates introduce a modular file system plugin architecture for metadata-ingestion, supporting S3, local, and HTTP file systems. A new abstract class FileSystem sets the foundation, with specific implementations for each type of file system. Additionally, the get_filenames method in the file source now returns FileInfo objects, enhancing file management capabilities.

Changes

File Path	Summary
`metadata-ingestion/setup.py`	Added S3, local, and HTTP file system plugins in `datahub.fs.plugins`.
`metadata-ingestion/src/.../fs_base.py`	Introduced abstract `FileSystem` class and `FileInfo` data class for file system operations.
`metadata-ingestion/src/.../fs_registry.py`	Created plugin registry for file system plugins, registering them from entry points.
`metadata-ingestion/src/.../http_fs.py`	Developed `HttpFileSystem` class with methods for handling HTTP-based file operations.
`metadata-ingestion/src/.../local_fs.py`	Developed `LocalFileSystem` class for local file operations.
`metadata-ingestion/src/.../s3_fs.py`	Introduced S3 file system support, including classes and methods for interacting with Amazon S3 storage.
`metadata-ingestion/src/.../source/file.py`	Refactored `get_filenames` to return `FileInfo` objects, enhanced file reading and resource management.
`metadata-ingestion/tests/unit/test_plugin_system.py`	Added `fs_registry` to the list of registries in the test case.

Poem

In the meadow of code, plugins bloom,
With S3, local, and HTTP in tune.
Files dance as FileInfo comes to play,
Ingestion hums in a brand new way.
Resource handled with utmost care,
Data flows like the fresh summer air.
🐇—CodeRabbit whispers, “Hey, it’s all there!”

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 62e6b7f and 1b42885.

Files selected for processing (8)

metadata-ingestion/setup.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/fs/fs_base.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/fs/fs_registry.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/fs/http_fs.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/fs/local_fs.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/fs/s3_fs.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/file.py (7 hunks)
metadata-ingestion/tests/unit/test_plugin_system.py (2 hunks)

Files not reviewed due to errors (2)

metadata-ingestion/src/datahub/ingestion/fs/fs_base.py (no review received)
metadata-ingestion/src/datahub/ingestion/fs/s3_fs.py (no review received)

Files skipped from review due to trivial changes (2)

metadata-ingestion/setup.py
metadata-ingestion/tests/unit/test_plugin_system.py

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/fs/local_fs.py

16-16: Use context handler for opening files

(SIM115)

metadata-ingestion/src/datahub/ingestion/fs/s3_fs.py

57-57: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

metadata-ingestion/src/datahub/ingestion/source/file.py

295-296: Replace yield over for loop with yield from

Replace with yield from

(UP028)

Additional comments not posted (7)

metadata-ingestion/src/datahub/ingestion/fs/fs_registry.py (1)

1-5: Registry setup looks clean and efficient.

The use of PluginRegistry for FileSystem types and the registration from the entry point datahub.fs.plugins is a standard and effective way to handle plugin architectures in Python. This approach supports extensibility and modular design.

metadata-ingestion/src/datahub/ingestion/fs/http_fs.py (1)

14-15: Use of smart_open for file handling is appropriate.

The use of smart_open enables efficient handling of different types of streams with a uniform API, which is beneficial for an HTTP file system implementation.

metadata-ingestion/src/datahub/ingestion/source/file.py (5)

194-203: Review: Implementation of get_filenames method.

This method correctly utilizes the new file system abstraction to list files based on the schema derived from the path. The method is concise and effectively filters files based on the specified extension.

Line range hint 214-232: Review: get_workunits_internal method.

The method effectively generates work units based on the file content and configuration. The use of isinstance for type checks and the conditional logic for aspect filtering are correctly implemented.

Tools

Ruff

295-296: Replace yield over for loop with yield from

Replace with yield from

(UP028)

266-287: Suggest performance improvement for _iterate_file_streaming.

While the method is correct, consider optimizing the element counting process to reduce overhead, especially for large files. Current implementation reads through the entire file just to count elements, which might not be efficient.

Line range hint 97-135: Review: FileSourceReport class.

This class is well-designed to track and report on the progress of file ingestion. The methods for updating times and counts are clear and effectively utilize Python's datetime for precision.

367-371: Review: Static utility method close_if_possible.

This utility method correctly checks if the object has a close method before calling it, which is a good practice to avoid attribute errors.

metadata-ingestion/src/datahub/ingestion/fs/http_fs.py

coderabbitai · 2024-06-28T21:28:39Z

metadata-ingestion/src/datahub/ingestion/fs/local_fs.py

+        else:
+            raise Exception(f"Failed to process {path}")


Improve error handling by providing more specific error information.

- raise Exception(f"Failed to process {path}") + raise FileNotFoundError(f"The specified path does not exist: {path}")

This change makes the error message more informative and specific, which aids in troubleshooting.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

else:

raise Exception(f"Failed to process {path}")

else:

raise FileNotFoundError(f"The specified path does not exist: {path}")

metadata-ingestion/src/datahub/ingestion/fs/local_fs.py

metadata-ingestion/src/datahub/ingestion/source/file.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b42885 and dd2b867.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/file.py (7 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/file.py

296-297: Replace yield over for loop with yield from

Replace with yield from

(UP028)

coderabbitai · 2024-06-28T21:32:07Z

metadata-ingestion/src/datahub/ingestion/source/file.py

+        # TODO: Remove this method, as it appears to be unused.
+        schema = get_path_schema(path)
+        fs_class = fs_registry.get(schema)
+        fs = fs_class.create()
+        file_status = fs.file_status(path)
+        for obj in self._iterate_file(file_status):


Consider removing or refactoring the unused method iterate_mce_file.

The method is marked with a TODO comment indicating it is unused. If it is indeed unused, it should be removed to clean up the codebase.

- def iterate_mce_file(self, path: str) -> Iterator[MetadataChangeEvent]: - # TODO: Remove this method, as it appears to be unused. - schema = get_path_schema(path) - fs_class = fs_registry.get(schema) - fs = fs_class.create() - file_status = fs.file_status(path) - for obj in self._iterate_file(file_status): - mce: MetadataChangeEvent = MetadataChangeEvent.from_obj(obj) - yield mce + # Method removed as it was unused.

If this method is still needed for future compatibility or other reasons, consider documenting its purpose more clearly or refactoring it to fit the current architecture better.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# TODO: Remove this method, as it appears to be unused.

schema = get_path_schema(path)

fs_class = fs_registry.get(schema)

fs = fs_class.create()

file_status = fs.file_status(path)

for obj in self._iterate_file(file_status):

# Method removed as it was unused.

coderabbitai · 2024-06-28T21:32:07Z

metadata-ingestion/src/datahub/ingestion/source/file.py

+    def get_filenames(self) -> Iterable[FileInfo]:
+        path_str = str(self.config.path)
+        schema = get_path_schema(path_str)
+        fs_class = fs_registry.get(schema)
+        fs = fs_class.create()
+        for file_info in fs.list(path_str):
+            if file_info.is_file and file_info.path.endswith(
+                self.config.file_extension
+            ):
+                yield file_info


Refactor the get_filenames method to enhance clarity and performance.

The method is well-implemented but could benefit from a minor refactor to enhance clarity and reduce the number of lines.

- path_str = str(self.config.path) - schema = get_path_schema(path_str) - fs_class = fs_registry.get(schema) - fs = fs_class.create() - for file_info in fs.list(path_str): + schema = get_path_schema(str(self.config.path)) + fs = fs_registry.get(schema).create() + for file_info in fs.list(self.config.path):

This change reduces the redundancy by removing the unnecessary path_str variable and combines some operations into single lines.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_filenames(self) -> Iterable[FileInfo]:

path_str = str(self.config.path)

schema = get_path_schema(path_str)

fs_class = fs_registry.get(schema)

fs = fs_class.create()

for file_info in fs.list(path_str):

if file_info.is_file and file_info.path.endswith(

self.config.file_extension

):

yield file_info

def get_filenames(self) -> Iterable[FileInfo]:

schema = get_path_schema(str(self.config.path))

fs = fs_registry.get(schema).create()

for file_info in fs.list(self.config.path):

if file_info.is_file and file_info.path.endswith(

self.config.file_extension

):

yield file_info

coderabbitai · 2024-06-28T21:32:07Z

metadata-ingestion/src/datahub/ingestion/source/file.py

+    def _iterate_file_batch(self, fp: Any) -> Iterable[Any]:
+        # Read the file.
+        contents = json.load(fp)
+
+        # Maintain backwards compatibility with the single-object format.
+        if isinstance(contents, list):
+            for row in contents:
+                yield row
+        else:
+            yield contents


Utilize yield from for improved readability and performance in _iterate_file_batch.

The method can be optimized by using yield from when iterating through lists.

- if isinstance(contents, list): - for row in contents: - yield row - else: - yield contents + yield from contents if isinstance(contents, list) else (contents,)

This change leverages yield from for better performance and readability, as suggested by the static analysis tool.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _iterate_file_batch(self, fp: Any) -> Iterable[Any]:

# Read the file.

contents = json.load(fp)

# Maintain backwards compatibility with the single-object format.

if isinstance(contents, list):

for row in contents:

yield row

else:

yield contents

def _iterate_file_batch(self, fp: Any) -> Iterable[Any]:

# Read the file.

contents = json.load(fp)

# Maintain backwards compatibility with the single-object format.

yield from contents if isinstance(contents, list) else (contents,)

Tools

Ruff

296-297: Replace yield over for loop with yield from

Replace with yield from

(UP028)

metadata-ingestion/src/datahub/ingestion/source/file.py

Co-authored-by: oleksandrsimonchuk <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> Co-authored-by: Tamas Nemeth <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

…ahub-project#8415) Co-authored-by: oleksandrsimonchuk <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> Co-authored-by: Tamas Nemeth <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

simaov changed the title ~~Remote ingest mcp~~ Add and use file system abstraction in file source Jul 13, 2023

vercel bot deployed to Preview July 13, 2023 13:57 View deployment

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 13, 2023

vercel bot deployed to Preview July 14, 2023 11:21 View deployment

anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jul 17, 2023

hsheth2 reviewed Jul 27, 2023

View reviewed changes

simaov requested a review from hsheth2 August 2, 2023 13:55

maggiehays assigned hsheth2 Aug 15, 2023

asikowitz self-requested a review August 17, 2023 16:09

vercel bot deployed to Preview September 8, 2023 10:20 View deployment

hsheth2 reviewed Sep 12, 2023

View reviewed changes

vercel bot deployed to Preview September 26, 2023 12:18 View deployment

hsheth2 approved these changes Oct 2, 2023

View reviewed changes

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 2, 2023

simaov force-pushed the remote-ingest-mcp branch from 87f2583 to c57d12c Compare October 6, 2023 10:41

vercel bot deployed to Preview October 6, 2023 10:58 View deployment

vercel bot deployed to Preview October 11, 2023 12:07 View deployment

simaov force-pushed the remote-ingest-mcp branch from 1ee9e09 to 3c2858f Compare October 12, 2023 08:18

vercel bot deployed to Preview October 12, 2023 08:34 View deployment

simaov force-pushed the remote-ingest-mcp branch from 3c2858f to a47b5ea Compare October 19, 2023 09:43

vercel bot deployed to Preview October 19, 2023 10:30 View deployment

feat(ingestion): add file system abstraction

17bd6c6

vercel bot deployed to Preview February 9, 2024 22:20 View deployment

Merge branch 'master' into remote-ingest-mcp

5ae8227

vercel bot deployed to Preview February 23, 2024 01:45 View deployment

Merge branch 'master' into remote-ingest-mcp

6936d40

vercel bot deployed to Preview March 6, 2024 03:48 View deployment

Merge branch 'master' into remote-ingest-mcp

e30a43b

vercel bot deployed to Preview March 21, 2024 08:46 View deployment

shirshanka added the accepted An Issue that is confirmed as a bug by the DataHub Maintainers. label Jun 28, 2024

Merge branch 'master' into remote-ingest-mcp

1b42885

fix bug

dd2b867

coderabbitai bot reviewed Jun 28, 2024

View reviewed changes

Update metadata-ingestion/src/datahub/ingestion/fs/http_fs.py

9b568d9

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Jun 28, 2024

View reviewed changes

hsheth2 removed the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jun 28, 2024

vercel bot deployed to Preview June 28, 2024 22:02 View deployment

hsheth2 added 2 commits June 28, 2024 16:04

fix formatting

9a01cd7

tweak

cad285d

vercel bot deployed to Preview June 28, 2024 23:42 View deployment

hsheth2 merged commit 8b4e302 into datahub-project:master Jul 1, 2024
57 of 58 checks passed

hsheth2 mentioned this pull request Jul 3, 2024

fix(smoke-test): add suffix in temp file creation #10841

Merged

feat(ingest): add and use file system abstraction in file source #8415

feat(ingest): add and use file system abstraction in file source #8415

Conversation

simaov commented Jul 13, 2023 • edited by coderabbitai bot Loading

Checklist

Summary by CodeRabbit

hsheth2 commented Jul 13, 2023

simaov commented Jul 14, 2023

simaov commented Jul 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 commented Aug 3, 2023

simaov commented Aug 14, 2023

hsheth2 commented Aug 15, 2023

simaov commented Sep 8, 2023

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 commented Oct 15, 2023

hsheth2 commented Feb 9, 2024

simaov commented Feb 22, 2024

hsheth2 commented Feb 23, 2024

hsheth2 commented Mar 21, 2024

coderabbitai bot commented Jun 28, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 28, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 28, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 28, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 28, 2024

Choose a reason for hiding this comment

simaov commented Jul 13, 2023 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 28, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)