-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using version uuid
in datachain pull
#621
Changes from all commits
b5afbb2
dd6e125
12ae692
7ef4352
21b62b5
80b64ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -400,6 +400,18 @@ def get_parser() -> ArgumentParser: # noqa: PLR0915 | |
"--edatachain-file", | ||
help="Use a different filename for the resulting .edatachain file", | ||
) | ||
parse_pull.add_argument( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [C] I still don't know if it is better to expose these or uuid to the user There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should let client work with name / version. |
||
"--local-name", | ||
action="store", | ||
default=None, | ||
help="Name of the local dataset", | ||
) | ||
parse_pull.add_argument( | ||
"--local-version", | ||
action="store", | ||
default=None, | ||
help="Version of the local dataset", | ||
) | ||
|
||
parse_edit_dataset = subp.add_parser( | ||
"edit-dataset", parents=[parent_parser], description="Edit dataset metadata" | ||
|
@@ -1207,6 +1219,8 @@ def main(argv: Optional[list[str]] = None) -> int: # noqa: C901, PLR0912, PLR09 | |
catalog.pull_dataset( | ||
args.dataset, | ||
args.output, | ||
local_ds_name=args.local_name, | ||
local_ds_version=args.local_version, | ||
no_cp=args.no_cp, | ||
force=bool(args.force), | ||
edatachain=args.edatachain, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Why not use
uuid
instead ofremote_ds_name
+remote_ds_version
&local_ds_name
+local_ds_version
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, there are other APIs that use
name
+version
as well, but I would leave that to followup issue not to bloat this one too much.In general, user does use explicit name and version when initiating pull, e.g
datachain pull ds://dogs@v3
but the use ofuuid
in the process after initial get dataset info would fix potential issue of someone renaming remote dataset right in the middle of pull process (not sue how likely is this to happen but it's still the potential risk)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this needs to go into this PR but I wanted to mention it and this seems like an ok place.
I would advocate for hiding local/remote name/version from users as much as possible. (From experience) I think the concept will only confuse users. Only exposing it when we absolutely need to makes sense to me
For example:
If the user tries to pull
cat@v1
under the following circumstances we should throw an error stating there is a dataset with that name already and the uuid doesn't match. Ask them to rename the local dataset (as SaaS is the source of truth):I think we should update the output of
ls datasets
to have the same format as the above or at least give it the ability to show these discrepancies somehow.I am now second-guessing myself and wondering whether the local/remote name and version are that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it works in this PR:
uuid
exists -> skip pulling and printing "Local copy of dataset cats@v1 already present". If it has different name locally it says "Local copy of dataset cats@v1 already present as dataset other_cats@v5"uuid
and there is already local dataset name + version present that holds completely different data we throw error with message "Local dataset cats@v1 already exists, please choose different local dataset name or version"I didn't want to mention
uuid
anywhere as I'm not sure yet if that's something we should bother user with, but maybe I could enhance message from 2) by mentioning it ... "EDIT: Changed it to "Local dataset cats@v1 already exists with different uuid, please choose different local dataset name or version"