Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-roundtrippable objects #931

Open
progval opened this issue Jan 29, 2022 · 4 comments
Open

Non-roundtrippable objects #931

progval opened this issue Jan 29, 2022 · 4 comments

Comments

@progval
Copy link
Contributor

progval commented Jan 29, 2022

Hi,

Continuing on this comment:

It looks like some of these will break roundtripping

which is a property I'd like to maintain.

Here are some other sources of non-roundtrippability unrelated to timezones (I listed non-roundtrippable timezones in that comment):

dir entry modes

040000 instead of 40000 in tree entry mode (found in 37k old directories generated by GitHub, and by a Ruby library that GitHub may or may not have been using) is fixed by Dulwich:

>>> b = b'040000 example\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'
>>> t = dulwich.objects.Tree.from_string(b)
>>> t._needs_serialization = True
>>> t.as_raw_string()
b'40000 example\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

I have also seen about 1.4k commits with other types of broken permissions, most of which don't round-trip.

dir entry order

2k trees with various types of disordered entries:

>>> b = b'10644 example0\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa40000 example\x00\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb'
>>> t = dulwich.objects.Tree.from_string(b)
>>> t._needs_serialization = True
>>> t.as_raw_string()
b'40000 example\x00\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb10644 example0\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

This usually happened when people committed trees created with their home-made git implementation, which used a wrong sort (didn't sort at all, or used naive sort instead of git's, or wrote all file entries before all dir entries, etc.)

disordered commit headers

There are many types of these; eg with nonce or encoding added after gpgsig (I don't have stats on these ones, but I don't think there are more than 10k):

>>> b = b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <[email protected]> 1614159930 +0100\ncommitter John Doe <[email protected]> 1614159930 +0100\ngpgsig abcd\nnonce efgh\n\nfoo'
>>> c = dulwich.objects.Commit.from_string(b)
>>> c.author = c.author
>>> c.as_raw_string()
b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <[email protected]> 1614159930 +0100\ncommitter John Doe <[email protected]> 1614159930 +0100\nnonce efgh\ngpgsig abcd\n\nfoo'

Keep in mind these stats are over almost all publicly available git commits (ie. about 2 billions), so they are all a very small fraction of a percent, that may not be worth caring about.

Either way, they are not an issue for me; I just thought they may be relevant to you.

@jelmer
Copy link
Owner

jelmer commented Jan 29, 2022

I was aware of the directory order issue (it's really useful to see some stats) and I believe dulwich should reject them (preferable) or properly roundtrip them.

FTR I would like to support your use case; I just think we should find a good interface to do so that doesn't lead to surprises elsewhere.

hg-git and breezy both use Dulwich to convert import changes from git and then roundtrip back into git. There's als the issue that just setting a property to the same value may trigger a change of the hex sha and object, which would at the very least by surprising.

@jelmer
Copy link
Owner

jelmer commented Feb 1, 2022

There are a couple things we could do here, with decreasing levels of surprise/corruption for users:

  1. Don't support 100% roundtripping, simply parsing odd corner cases; this leads to surprises when e.g. object ids change
  2. Add attributes to represent every bit of oddness that's out there and fully support roundtripping - I think this is cumbersome and not worth the effort considering most of these cases are rare and due to broken git implementations.
  3. Add an argument of some sort to allow parsing broken objects, but raise an exception otherwise
  4. Simply have parsing functions (parse_commit) that can e.g. take such an argument, but don't support anything in the objects (Commit, Tree, etc)

I'd lean to (3) or (4); can you perhaps say more about how you're using the parse functions? Are you actually passing ShaFile objects around?

@progval
Copy link
Contributor Author

progval commented Feb 1, 2022

We considered this as well at @SoftwareHeritage, and we decided to only support offset oddities because they are the most common, and easy to support (ie. option 1).

For anything else, we use what Dulwich was able to parse, write it in in a database row, and store the original manifest alongside it if the parsed object doesn't round-trip, in case we ever need it. This is negligible in size, as only 1 commit in 2,500,000 and 1 tree in 30,000 don't round-trip.

This is why the status quo (plus #927) works for us.

@jelmer
Copy link
Owner

jelmer commented Mar 11, 2022

I haven't forgotten about this but haven't managed to take the time to dive deep. May take another week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants