Fix `XvcEntity` for branches #198

iesahin · 2023-01-16T11:57:52Z

Suppose Alice and Bob work on a project.

Alice adds a file to the project.

alice: git clone my-project ; xvc file track file-a.png

And Bob adds another file:

bob: git clone my-project ; xvc file track file-b.png

Now, suppose the initial entity counter in my-project was 1000. In Alice's version, 1001 will correspond to file-a.png, in Bob's version, 1001 will be file-b.png. This breaks the ECS.

We need a solution to this problem.

It may be possible to create another command xvc merge to merge two histories. Adding a new command for this frequent occasion makes user-education a must. Making users frustrated in such a common situation is bad.
It's possible to randomize entities. Currently they are usize integers. Randomization requires to check previous values for possible collisions. Although it's a low probability, two 64-bit integers may collide (due to Birthday paradox) if we pass a few billion mark for entities. Also, randomization for each entity is expensive.
It's possible use a timestamp (e.g. nanoseconds since 2000-1-1) as an XvcEntity. When we think about thousands of branches running in different boxes with the same base XvcEntity base value, this may lead to collisions as well. Also, not all boxes support nanosecond accuracy and I'd not want to depend on something like time which may go back and forth depending on your locale.
It's possible to use something from environment, e.g. branch name, user name etc. But these can also fail and very easy to duplicate.
We only have a basic counter now. It's 32-bit on 32-bit systems, and 64-bit on 64-bit systems. I think we need to add some randomization to this, but this randomization shouldn't lead to collisions easily.

The solution I'm able to come up is making XvcEntity a tuple of 64-bit integers. The first integer will be random and created when the command is run. The second is the counter we already use. So the EC file won't be needed to change and the collisions (even after a few billion elements) will be negligible. In each run, the initial 64-bit integer will be renewed and the second 64-bit will be loaded.

It might be simpler to use a single 128-bit integer and randomize it at the beginning. I think keeping the counter is better for sorting by creation of these objects. (e.g. when we need to have a consistent sort across entities, but the actual sort is not that important.)

This brings another problem to the light. Suppose, in above scenario, Alice and Bob add the same files.

alice: xvc file track file-a.png

bob: xvc file track file-a.png

Now, as the entities are randomized, we have two entities for file-a.png. When we build an index in XvcStore, this will fail. There will be two entities for the same XvcPath.

An extra command (called xvc fsck) may be required in this case. It will check entities and merge duplicates. Also, merge all store files to a single file for quick loading. Its capabilities may increase in the future.

The text was updated successfully, but these errors were encountered:

iesahin · 2023-01-16T12:44:15Z

The thing with the second problem may be solved by using a hash for the first element in the tuple. But it doesn't solve the collision problem. We can check collisions and ask for another value (or append a counter) in case of a collision, but checking collisions require to load all values. Also, it's an expensive solution for a comparatively rare problem. Hashing all values to get their entity values is a constant factor to remedy an infrequent problem.

If we'd go to this route, we'd skip entity generation completely and hash the values to get their entity values. I think this is rather a large change that I'd not decide at the moment.

Having a semantically neutral way (like counters) to generate the hash values seems a better approach at the moment.

Fixes #198

iesahin · 2023-01-18T09:30:22Z

Done in #201

iesahin added the focus label Jan 16, 2023

iesahin self-assigned this Jan 16, 2023

iesahin added the bug Something isn't working label Jan 17, 2023

iesahin added a commit that referenced this issue Jan 17, 2023

Fix XvcEntity for branches

8d41367

Fixes #198

iesahin closed this as completed Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `XvcEntity` for branches #198

Fix `XvcEntity` for branches #198

iesahin commented Jan 16, 2023

iesahin commented Jan 16, 2023

iesahin commented Jan 18, 2023

Fix XvcEntity for branches #198

Fix XvcEntity for branches #198

Comments

iesahin commented Jan 16, 2023

iesahin commented Jan 16, 2023

iesahin commented Jan 18, 2023

Fix `XvcEntity` for branches #198

Fix `XvcEntity` for branches #198