-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use predictable ordering as much as possible #3317
use predictable ordering as much as possible #3317
Conversation
Not sure if that NullPoint while downloading NVD database is related to the changes in this PR. |
Just went looking some more in the codebase and found out that in the past there have been PRs that actually removed the TreeSets and replaced them with HashSets. And removing / changing comparators to no longer ensure ordering. |
I believe some of these collections were previously ordered (in addition to others). The ordering was removed due to performance issues (on these and other collections). Doing some test runs show that given I wonder if all of these need to be tree sets. My guess is that the evidence, projectReferences, relatedDependencies, vulnerabilities, and suppressed vulnerabilities do not need to be sorted. The ordering on the paths (https://github.com/jeremylong/DependencyCheck/pull/3317/files#diff-1593f793158c7b1dc3a7bd83329962ad1226667dc5d14ca1e68eef1bf39e89bbR319) is definitely needed. |
We could probably write a book on all the different Sets and their properties :-) Even with your change and the TreeSets, I still see the reports differing between scans. Sometimes there's a bunch of extra dependencies reported (most of the time they are not reported, but they are present). Without the ordering it was impossible to see what was happening, unless doing in memory comparisons between the sets and lists between runs. Even if you prefer not to use the TreeSets, wouldn't you agree it would be very helpful if the report was ordered predictably? Maybe it would be OK to do a one-time ordering after the analysis or when writing the report? Make a optional parameter for it for those who want to shave of those ms from their runs? What would be the best way to provide more info on the differences in dependencies between runs? I haven't fully grasped the model and logic of DC in my mind yet to know what's going on. These microservices have lots of identical dependencies, so I wonder if by using the file hashes or looking at parts of the file path is causing them to be lost sometimes. |
Did some quick tests and with both your and my changes together (i.e. this PR), the reports are different between runs. Already before the line with the new sorting from your change, the dependencies are different. |
@valentijnscholten do you have any specific test cases you could share? I went down a slightly different path then ordering everything (yes, we can look at sorting for the reports later - but for now I'd at least like to get the related dependency problem sorted). Take a look at #3343. |
I just tested #3343, but get different reports, different sizes: |
I just updated my branch. Take a look at #3343. I am now getting the same output from multiple executions (with the exception of the timestamp). |
Fixes Issue
#3291
Description of Change
As described in #3291, every run of dependency check will/could generate a different report. As long as you scan 1 artifact, this differences are limited to ordering of vulnerabilities etc. Once you start scanning multiple (2, 3, 4. ... 10) artifacts, the differences start to affect which dependencies are considered "top level" dependencies, and which dependencies are seen as "related dependencies". In the end all dependencies and vulnerabilities should still be there, but differently ordered/grouped. However in #3291 I have seen that there are bigger differences in the reports. There can as much 25% in size difference in the xml report. And when importing the report into vulnerability management tools as OWASP Defect Dojo, shows difference total vulnerability count. Analyzing these 8-10MB xml reports was hard, so in search of finding the cause of this, I started to make some changes to have a predictable ordering in the reports (and internal data structures).
This PR realizes this, and now it turns out the reports are consistently the same across runs of the scanner. So no more difference in vulnerability count or filesize or otherwise.
Implementation
The PR replaces some
HashSets
withTreeSets
to ensure natural ordering. A comparator is added toDependency.java
to allow ordering. It's based on the same comparator used elsewhere outside the class (based ongetActualFilePath()
)TODO
The
vulnerableSoftwareIdentifiers
property was also converted to aTreeSet
, but still appears in different order across runs. Not where this is caused. But for now I'd like to get feedback before continuing on this path.Have test cases been added to cover the new functionality?
no, not sure if they are needed.