Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It doesn't work with large files (~40M in my case) #17

Open
peske opened this issue Dec 13, 2020 · 2 comments
Open

It doesn't work with large files (~40M in my case) #17

peske opened this issue Dec 13, 2020 · 2 comments

Comments

@peske
Copy link

peske commented Dec 13, 2020

Thanks for the effort, but it looks that it doesn't work... At least not optimally. Here's my setup:

I have two binary files (DLLs), which have the exact same size (about 40M), and which differ very slightly. Here's the screenshot from BeyondCompare:

image

As you can see, the files differ only in very few bytes.

But when I've did the following (example from the documentation):

//Create fingerprint of a file
fingerprint := NewFingerprint("/path/foo_v1.binary", 1024)

//Say the file was updated
//Lets generate the diff
diff := NewDiff("/path/foo_v2.binary", *fingerprint)

I've found out that the resulting diff has more than 725,000 blocks (Block). Serialized in JSON the diff is about 9M. I've also tried with a smaller block size (64), and ended up with diff of 150M in JSON.

Sadly I cannot share the actual DLLs (company secret), but I believe that you can reproduce by using any DLL with a similar size, make a copy with few bytes changed here and there, and try.

@peske
Copy link
Author

peske commented Dec 13, 2020

Btw. wast majority (99.999% in my case) of returned blocks have RawData=nil and HasData=false. Are they really needed? I see that they contain some checksums - maybe to check the input file before patching? If so, isn't be better to ensure the input file integrity in a cheaper way, like include the whole file checksum as output...

@monmohan
Copy link
Owner

monmohan commented Dec 13, 2020

I have test cases with different binary files (including large files) so I know that the diff generation and patching does work. But I do understand your point about the patch file size being not optimal. I need to look into more optimal ways of serializing the patch information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants