[FEATURE] Delete branch #2341

josefransaenz · 2023-05-09T14:04:04Z

🚨🚨 Feature Request

It would be useful to delete unused or corrupted branches, or at least be able to rename them.

Related to an existing Issue
[X ] A new implementation (Improvement, Extension)

Is your feature request related to a problem?

Kind of. Some times I need to discard branches because corrupted or unmergeable (My datasets are quite complex and I still need to dedicate time to reproduce the problems in a simple way in order to open an issue). For instance, I wanted to have only one 'dev' branch where to work and add new data but after the first merge with 'main' it got corrupted (I think because I closed the interpreter after doing some modifications without committing). So we had to move to one branch per person making modifications but that also fail after some merge problems and now we are simply inventing names for each new change with a proliferation of dead branches.

If your feature will improve `HUB`

...

Description of the possible solution

To implement a method on the Dataset class that can be called like ds.delete_branch('branch_name')

An alternative solution to the problem can look like

If there is a technical motivation for not having a delete_branch() method, a rename_branch() can also be useful to rename dead branches even if it doesn't solve the proliferation of unused branches

The text was updated successfully, but these errors were encountered:

istranic · 2023-05-09T23:41:29Z

Hey @josefransaenz Thank you for raising the issue and explaining the motivation. We can add a method for deleting a branch, but it would only delete the branch at the meta level by making it unavailable. Deleting the actual commits is challenging because we don't duplicate all data during branching and merging, so the commits corresponding to various branches are necessary for reconstructing the dataset state.

The broader problem of corrupt and unusable dataset states is something we're working on. We will add back async flushing, which will substantially speed up merging, thus minimizing the corruption due to code interrupts. If you could send us any reproducible code that causes corruption, that would help us significantly for addressing the root issue.

Also, it sounds like you're interested in concurrent writes, which is why you're using multiple branches? If so, that's a feature we're currently working on!

josefransaenz · 2023-05-10T07:32:10Z

Hi @istranic! thanks for your fast reply. I suspected that it was not possible to delete branches/commits for the reason you mention, but I'm happy to know that you are working hard on the merge issues and corrupted states. We are testing deeplake in a new project and while I'm exited and happy with its power and potential, the merge problems are making some of my colleagues uneasy. But I don't want to go back to file-based datasets =)

In the mean time it would be nice to have meta level solution to allow renaming so it's possible to reuse a single branch name (like 'develop' or 'annotation') and rename unusable/old states with names we don't have to care or remember.

Yes we are interested in concurrent writes, specially because deeplake doesn't have a method for controlling the locking state. In particular for releasing the lock when read_only=False, so for us it's very common to find that it's not possibile to load the dataset with read_only=False even some minutes after the other process that has opened with read_only=False has been closed. And as you imagine, this force the use of another branch for writing in the mean time. So having a method for ensure the release the lock before closing the process (or changing the read_only state without reloading the dataset) will be helpful. I don't think a true concurrent write is needed if you are not expecting big teams working on a dataset, a better control of when and who can write could be enough.

josefransaenz added the enhancement New feature or request label May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Delete branch #2341

[FEATURE] Delete branch #2341

josefransaenz commented May 9, 2023

istranic commented May 9, 2023

josefransaenz commented May 10, 2023

[FEATURE] Delete branch #2341

[FEATURE] Delete branch #2341

Comments

josefransaenz commented May 9, 2023

🚨🚨 Feature Request

Is your feature request related to a problem?

If your feature will improve HUB

Description of the possible solution

An alternative solution to the problem can look like

istranic commented May 9, 2023

josefransaenz commented May 10, 2023

If your feature will improve `HUB`