Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XvcEntity for branches #198

Closed
iesahin opened this issue Jan 16, 2023 · 2 comments
Closed

Fix XvcEntity for branches #198

iesahin opened this issue Jan 16, 2023 · 2 comments
Assignees
Labels
bug Something isn't working focus

Comments

@iesahin
Copy link
Owner

iesahin commented Jan 16, 2023

Suppose Alice and Bob work on a project.

Alice adds a file to the project.

alice: git clone my-project ; xvc file track file-a.png

And Bob adds another file:

bob: git clone my-project ; xvc file track file-b.png

Now, suppose the initial entity counter in my-project was 1000. In Alice's version, 1001 will correspond to file-a.png, in Bob's version, 1001 will be file-b.png. This breaks the ECS.

We need a solution to this problem.

  • It may be possible to create another command xvc merge to merge two histories. Adding a new command for this frequent occasion makes user-education a must. Making users frustrated in such a common situation is bad.
  • It's possible to randomize entities. Currently they are usize integers. Randomization requires to check previous values for possible collisions. Although it's a low probability, two 64-bit integers may collide (due to Birthday paradox) if we pass a few billion mark for entities. Also, randomization for each entity is expensive.
  • It's possible use a timestamp (e.g. nanoseconds since 2000-1-1) as an XvcEntity. When we think about thousands of branches running in different boxes with the same base XvcEntity base value, this may lead to collisions as well. Also, not all boxes support nanosecond accuracy and I'd not want to depend on something like time which may go back and forth depending on your locale.
  • It's possible to use something from environment, e.g. branch name, user name etc. But these can also fail and very easy to duplicate.
  • We only have a basic counter now. It's 32-bit on 32-bit systems, and 64-bit on 64-bit systems. I think we need to add some randomization to this, but this randomization shouldn't lead to collisions easily.

The solution I'm able to come up is making XvcEntity a tuple of 64-bit integers. The first integer will be random and created when the command is run. The second is the counter we already use. So the EC file won't be needed to change and the collisions (even after a few billion elements) will be negligible. In each run, the initial 64-bit integer will be renewed and the second 64-bit will be loaded.

It might be simpler to use a single 128-bit integer and randomize it at the beginning. I think keeping the counter is better for sorting by creation of these objects. (e.g. when we need to have a consistent sort across entities, but the actual sort is not that important.)

This brings another problem to the light. Suppose, in above scenario, Alice and Bob add the same files.

alice: xvc file track file-a.png

bob: xvc file track file-a.png

Now, as the entities are randomized, we have two entities for file-a.png. When we build an index in XvcStore, this will fail. There will be two entities for the same XvcPath.

An extra command (called xvc fsck) may be required in this case. It will check entities and merge duplicates. Also, merge all store files to a single file for quick loading. Its capabilities may increase in the future.

@iesahin
Copy link
Owner Author

iesahin commented Jan 16, 2023

The thing with the second problem may be solved by using a hash for the first element in the tuple. But it doesn't solve the collision problem. We can check collisions and ask for another value (or append a counter) in case of a collision, but checking collisions require to load all values. Also, it's an expensive solution for a comparatively rare problem. Hashing all values to get their entity values is a constant factor to remedy an infrequent problem.

If we'd go to this route, we'd skip entity generation completely and hash the values to get their entity values. I think this is rather a large change that I'd not decide at the moment.

Having a semantically neutral way (like counters) to generate the hash values seems a better approach at the moment.

@iesahin iesahin added the focus label Jan 16, 2023
@iesahin iesahin self-assigned this Jan 16, 2023
@iesahin iesahin added the bug Something isn't working label Jan 17, 2023
iesahin added a commit that referenced this issue Jan 17, 2023
@iesahin
Copy link
Owner Author

iesahin commented Jan 18, 2023

Done in #201

@iesahin iesahin closed this as completed Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working focus
Projects
None yet
Development

No branches or pull requests

1 participant