Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow to compare significantly different directories #322

Open
vertigo220 opened this issue Apr 26, 2020 · 18 comments
Open

Very slow to compare significantly different directories #322

vertigo220 opened this issue Apr 26, 2020 · 18 comments

Comments

@vertigo220
Copy link

I have two directories full of subdirectories and more subdirectories, each full of pictures. I used another program to compare the two and remove all copies of images that are identical or near-identical (same image but rotated or tagged in one and not the other), which left me with ~15.5k files in one and 359 in the other. I then ran a comparison with WinMerge, which I expected to take a minute at most, since all it had to actually compare was those 359 files at most. However, it took several minutes, seeming to scan each and every file. Whatever method it uses for comparison doesn't seem to be very efficiently designed, and it seems there is a lot of room for improvement in speeding it up.

@sdottaka
Copy link
Member

The default comparison method "full contents compare method" is not fast because it scans the file for detecting encoding and EOL type, even if only one of the files exists.

Try using "Binary contents" instead.

image

@vertigo220
Copy link
Author

I considered that, but the help file says the full contents comparison is the most complete and recommended method, and just says "TBD" for binary compare, so I wasn't sure if it was a good idea. Two questions: what's the downside to using binary compare, and what purpose is there for checking all that in full compare if there's nothing to compare against. If there's no correlated file in the other folder(s), doing a "full" check on the one copy of the file that does exist seems pointless; it should just know it's "different" since it exists vs not exists.

@sdottaka
Copy link
Member

what's the downside to using binary compare

"Ignore case" option and "Ignore spaces" option are not applied. It also does not apply line filters, comment filters, or plugins.

Below is a table showing the characteristics of each comparison method.

Method Compare contents of file? File size is used to determine the difference? Modified date is used to determine the difference? Detect the number of differences? Binary file detection Is "Ignore codepage differences" option applied? Is "Ignore case" option is applied? Is "Ignore spaces" option is applied Is "Ignore carriage return differences" applied Is "Filter comments" opiton is applied? Line filters Plugins Speed
Full contens(default) Yes - - Yes Yes Yes Yes Yes Yes Yes Yes Yes Low
Quick contents Yes - - - Yes Yes Yes Yes Yes - - Yes Low
Binary contents Yes Yes - - - - - - - - - - Middle
Modifed Date - - Yes - - - - - - - - - High
Modified Date and Size - Yes Yes - - - - - - - - - High
Size - Yes - - - - - - - - - - High

what purpose is there for checking all that in full compare if there's nothing to compare against.

Probably the purpose is to display EOL-type, etc. in the ListView column of the folder comparison window.

@vertigo220
Copy link
Author

Probably the purpose is to display EOL-type, etc. in the ListView column of the folder comparison window.

So couldn't it just pass such files to make the comparison go much quicker then leave that line blank in the comparison, since it wouldn't matter anyways?

@sdottaka
Copy link
Member

I think we could add an option to do so.
It would be nice if someone could Pull Request.

@ghost
Copy link

ghost commented Feb 10, 2022

Hi, I have similar problem. But at least in my case, it's not about a speed of comparing files that are on both sides,
but that it takes long to go through files that are only on one side.

So far it looks to me, as if files that should be just quickly checked, that they are not on the other side,
are processed more.
Maybe read from disk, or something?

@sdottaka
Copy link
Member

The "Full Contents" compare method, which is the default compare method, detects the encoding and judges whether the file is binary or not, even if one of the files does not exist, which causes the file to be read.
For faster folder comparison, use the "Binary Contents" compare method or the "Modifed Date and Size" compare method, which compares only the size and modified date of files.

image

@ghost
Copy link

ghost commented Feb 10, 2022

So in that case, I want files that are on both sides to be compared with full content method,
and files that are only on one side to be compared with modified date and size method (or whichever doesn't read the file).

How do I set that up?

@sdottaka
Copy link
Member

Unfortunately, there is no way to make such a comparison.

@vertigo220
Copy link
Author

What about checking for a file to compare against before checking the encoding? IOW, currently it's doing checks on a file that take a while, then attempting to compare that file against another, whereas it would be much more efficient to see if there's even a file to compare against first before bothering with the other stuff, and only doing that if necessary, i.e. if there's a corresponding file.

@sdottaka
Copy link
Member

sdottaka commented Mar 2, 2022

The reason for not doing so is that in the Full Contents comparison method, we want the columns to show the encoding, EOL type, and whether the file is a binary file or not, even if the left and right files are not present.
However, if the columns for Encoding, EOL, etc. are hidden, it may be better to omit that process.

@vertigo220
Copy link
Author

I suppose that could be a solution, but I see a couple issues with it. While I don't think I've ever used those columns, requiring the user to hide them to speed this up means two things: they have to re-show them any time they want to use them (and possibly waste time running a scan again if they ran it with the columns hidden then realized they wanted to see them) and it would mean it would run slow by default unless the user knew to hide those columns.

Personally, I can't imagine why one would want to see that info on a file when there's no corresponding file, since the whole point is to check differences, and if there's no file to compare to, the difference is the entire file. So it still seems it would be best to simply skip said files and leave those columns blank for them, with an option to alternate this behavior, in which case it would seem best to default it this way, and allow advanced users who actually have a need for this and therefore are willing to accept the performance hit to manually set it to the current behavior. Wouldn't this be possible?

Then again, maybe it would be best to simply use binary mode, and I'll have to give that a try with some test comparisons, but there are problems with that, mainly that it requires using not only a non-default, but also non-recommended mode to get good performance, which means most users aren't going to know to use it or won't know if they even should, just as I'm still unsure. Maybe I'm wrong, but it seems the default mode should provide the fastest performance while providing the results/info that most users will need, and if users need more info than that at a cost to performance, that should require changing options to their non-default settings, not to mention having such options to begin with.

I'll also admit that while I've used WinMerge quite a bit, there's likely a lot about it and its capabilities I don't know and don't have experience with, so it's possible I'm simply missing something that makes my arguments invalid. So if they make sense and you decide to change things accordingly, great, but if they don't, then so be it, in which case at least maybe there should be some sort of prompt on first-run explaining the performance and usage differences of full-compare vs binary and ask the user to choose the default method right then and there.

@hadayovi
Copy link

hadayovi commented Mar 5, 2023

I'm not sure if I'm understanding all of this detail correctly, but I thought I remembered WinMerge in past years having a quick compare that only checked for file name, size, and datetime. Currently for a movie file list I really only need name and size, however, I don't think the date compare is what's slowing down the compare to ridiculous levels.

I have Binary Compare method on, and it's taking around 10 seconds to compare each file.

image

I don't see how this is useful at all. I can spot check what I need in Windows Explorer, but that's mistake-prone and can be very time-consuming as well, including boredom-inducing.

Right now my best solution is to walk away from WinMerge for hours, then come back and see what it actually thinks is different. Most of the time there's no difference. Maybe I need a sync solution, but ones I've tried in the past on Windows were disastrous.

I use WinMerge for other things that may or may not need a sync (or even a simple one), so it's still useful to me, but when I can't tell why it's taking so long to compare, when it didn't used to before, I wonder if there's some sort of logical flaw that can't be easily addressed with a better option for a simpler compare.

@sdottaka
Copy link
Member

sdottaka commented Mar 5, 2023

If you do not need to compare file contents, select Modified Date and Size compare method, etc. as shown below. This comparison method is faster becaus it does not compare file contents.

image

If it is still slow, you may have selected a hash value such as MD5 in the Additional Properties window as shown below.
If Additional Properties has been added, removing it will speed up the process.

image

@vertigo220
Copy link
Author

@hadayovi Check out FreeFileSync. I suspect it may be better for what you're doing.

@3363686
Copy link

3363686 commented Apr 18, 2023

@sdottaka I had a problem similar to TS's one and your solution helped me. thanks! But it's totally counter-intuitive, and apparently should be mentioned in the documentation.
Also, your table showing the characteristics of each comparison method is very useful, and could (imho) be included in the documentation too.
How would you pass this recommendation on to the documentation developers?

@sdottaka
Copy link
Member

@3363686
There are no developers on our team who specialize in documentation. We welcome pull requests.

BTW, did you solve your problem by changing the comparison method? Or did you solve your problem by adjusting additional property settings?

@blank-teer
Copy link

blank-teer commented Jul 23, 2023

why don't just see which folder has fewer files and do all needed calculations for them and their siblings from the opposite side, and that's it? i've just got a bloody stuck trying to compare folders (columns are filename/folder/comparison result/extension), one contains 1k files and the second contains 47k;

it's "cOMpAriNg" something for ~50k files, what a nonsense.
this issue is 3 years old, and no one can do anything with that but expect some pull reqs from random people? pity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants