Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Delete branch #2341

Open
1 task
josefransaenz opened this issue May 9, 2023 · 2 comments
Open
1 task

[FEATURE] Delete branch #2341

josefransaenz opened this issue May 9, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@josefransaenz
Copy link

🚨🚨 Feature Request

It would be useful to delete unused or corrupted branches, or at least be able to rename them.

  • Related to an existing Issue
  • [X ] A new implementation (Improvement, Extension)

Is your feature request related to a problem?

Kind of. Some times I need to discard branches because corrupted or unmergeable (My datasets are quite complex and I still need to dedicate time to reproduce the problems in a simple way in order to open an issue). For instance, I wanted to have only one 'dev' branch where to work and add new data but after the first merge with 'main' it got corrupted (I think because I closed the interpreter after doing some modifications without committing). So we had to move to one branch per person making modifications but that also fail after some merge problems and now we are simply inventing names for each new change with a proliferation of dead branches.

If your feature will improve HUB

...

Description of the possible solution

To implement a method on the Dataset class that can be called like ds.delete_branch('branch_name')

An alternative solution to the problem can look like

If there is a technical motivation for not having a delete_branch() method, a rename_branch() can also be useful to rename dead branches even if it doesn't solve the proliferation of unused branches

@josefransaenz josefransaenz added the enhancement New feature or request label May 9, 2023
@istranic
Copy link
Contributor

istranic commented May 9, 2023

Hey @josefransaenz Thank you for raising the issue and explaining the motivation. We can add a method for deleting a branch, but it would only delete the branch at the meta level by making it unavailable. Deleting the actual commits is challenging because we don't duplicate all data during branching and merging, so the commits corresponding to various branches are necessary for reconstructing the dataset state.

The broader problem of corrupt and unusable dataset states is something we're working on. We will add back async flushing, which will substantially speed up merging, thus minimizing the corruption due to code interrupts. If you could send us any reproducible code that causes corruption, that would help us significantly for addressing the root issue.

Also, it sounds like you're interested in concurrent writes, which is why you're using multiple branches? If so, that's a feature we're currently working on!

@josefransaenz
Copy link
Author

Hi @istranic! thanks for your fast reply. I suspected that it was not possible to delete branches/commits for the reason you mention, but I'm happy to know that you are working hard on the merge issues and corrupted states. We are testing deeplake in a new project and while I'm exited and happy with its power and potential, the merge problems are making some of my colleagues uneasy. But I don't want to go back to file-based datasets =)

In the mean time it would be nice to have meta level solution to allow renaming so it's possible to reuse a single branch name (like 'develop' or 'annotation') and rename unusable/old states with names we don't have to care or remember.

Yes we are interested in concurrent writes, specially because deeplake doesn't have a method for controlling the locking state. In particular for releasing the lock when read_only=False, so for us it's very common to find that it's not possibile to load the dataset with read_only=False even some minutes after the other process that has opened with read_only=False has been closed. And as you imagine, this force the use of another branch for writing in the mean time. So having a method for ensure the release the lock before closing the process (or changing the read_only state without reloading the dataset) will be helpful. I don't think a true concurrent write is needed if you are not expecting big teams working on a dataset, a better control of when and who can write could be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants