use predictable ordering as much as possible #3317

valentijnscholten · 2021-04-22T14:24:15Z

Fixes Issue

Description of Change

As described in #3291, every run of dependency check will/could generate a different report. As long as you scan 1 artifact, this differences are limited to ordering of vulnerabilities etc. Once you start scanning multiple (2, 3, 4. ... 10) artifacts, the differences start to affect which dependencies are considered "top level" dependencies, and which dependencies are seen as "related dependencies". In the end all dependencies and vulnerabilities should still be there, but differently ordered/grouped. However in #3291 I have seen that there are bigger differences in the reports. There can as much 25% in size difference in the xml report. And when importing the report into vulnerability management tools as OWASP Defect Dojo, shows difference total vulnerability count. Analyzing these 8-10MB xml reports was hard, so in search of finding the cause of this, I started to make some changes to have a predictable ordering in the reports (and internal data structures).
This PR realizes this, and now it turns out the reports are consistently the same across runs of the scanner. So no more difference in vulnerability count or filesize or otherwise.

Implementation

The PR replaces some HashSets with TreeSets to ensure natural ordering. A comparator is added to Dependency.java to allow ordering. It's based on the same comparator used elsewhere outside the class (based on getActualFilePath())

TODO

The vulnerableSoftwareIdentifiers property was also converted to a TreeSet, but still appears in different order across runs. Not where this is caused. But for now I'd like to get feedback before continuing on this path.

Have test cases been added to cover the new functionality?

no, not sure if they are needed.

valentijnscholten · 2021-04-22T14:36:15Z

Not sure if that NullPoint while downloading NVD database is related to the changes in this PR.

valentijnscholten · 2021-04-22T21:04:38Z

Just went looking some more in the codebase and found out that in the past there have been PRs that actually removed the TreeSets and replaced them with HashSets. And removing / changing comparators to no longer ensure ordering.
https://github.com/jeremylong/DependencyCheck/pull/983/files

jeremylong · 2021-04-24T14:02:36Z

I believe some of these collections were previously ordered (in addition to others). The ordering was removed due to performance issues (on these and other collections). Doing some test runs show that given Dependencies Scanned: 6464 (4644 unique) using the un-ordered version I got an average runtime of 117 seconds. Using your ordered collections I got an average runtime of 124 seconds. Given that is a large number of dependencies - and most scans are not that large there is definitely a performance hit.

I wonder if all of these need to be tree sets. My guess is that the evidence, projectReferences, relatedDependencies, vulnerabilities, and suppressed vulnerabilities do not need to be sorted. The ordering on the paths (https://github.com/jeremylong/DependencyCheck/pull/3317/files#diff-1593f793158c7b1dc3a7bd83329962ad1226667dc5d14ca1e68eef1bf39e89bbR319) is definitely needed.

valentijnscholten · 2021-04-24T14:25:49Z

We could probably write a book on all the different Sets and their properties :-) Even with your change and the TreeSets, I still see the reports differing between scans. Sometimes there's a bunch of extra dependencies reported (most of the time they are not reported, but they are present). Without the ordering it was impossible to see what was happening, unless doing in memory comparisons between the sets and lists between runs.

Even if you prefer not to use the TreeSets, wouldn't you agree it would be very helpful if the report was ordered predictably? Maybe it would be OK to do a one-time ordering after the analysis or when writing the report? Make a optional parameter for it for those who want to shave of those ms from their runs?

What would be the best way to provide more info on the differences in dependencies between runs? I haven't fully grasped the model and logic of DC in my mind yet to know what's going on. These microservices have lots of identical dependencies, so I wonder if by using the file hashes or looking at parts of the file path is causing them to be lost sometimes.

valentijnscholten · 2021-04-24T16:42:34Z

Did some quick tests and with both your and my changes together (i.e. this PR), the reports are different between runs. Already before the line with the new sorting from your change, the dependencies are different.
Once I disable parallel processing, everything is consistent between runs (but slower of course).

jeremylong · 2021-04-30T12:58:11Z

@valentijnscholten do you have any specific test cases you could share? I went down a slightly different path then ordering everything (yes, we can look at sorting for the reports later - but for now I'd at least like to get the related dependency problem sorted). Take a look at #3343.

valentijnscholten · 2021-04-30T16:36:03Z

I just tested #3343, but get different reports, different sizes:

Without ordering it's not possible (for me) to make a good comparison between the reports. I will send you some wars by e-mail to reproduce.

jeremylong · 2021-05-03T11:54:49Z

I just updated my branch. Take a look at #3343. I am now getting the same output from multiple executions (with the exception of the timestamp).

valentijnscholten · 2021-05-04T11:57:03Z

superseded by #3343 which seems to fix #3291 and ensures predictable ordering in reports :-)

use predictable ordering as much as possible

dcf4cbd

boring-cyborg bot added cli changes to the cli core changes to core labels Apr 22, 2021

sort vulnerableSoftware in xml report

9b0c11a

wip

3670712

valentijnscholten closed this May 4, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use predictable ordering as much as possible #3317

use predictable ordering as much as possible #3317

valentijnscholten commented Apr 22, 2021

valentijnscholten commented Apr 22, 2021

valentijnscholten commented Apr 22, 2021 •

edited

Loading

jeremylong commented Apr 24, 2021

valentijnscholten commented Apr 24, 2021 •

edited

Loading

valentijnscholten commented Apr 24, 2021 •

edited

Loading

jeremylong commented Apr 30, 2021

valentijnscholten commented Apr 30, 2021

jeremylong commented May 3, 2021

valentijnscholten commented May 4, 2021

use predictable ordering as much as possible #3317

use predictable ordering as much as possible #3317

Conversation

valentijnscholten commented Apr 22, 2021

Fixes Issue

Description of Change

Implementation

TODO

Have test cases been added to cover the new functionality?

valentijnscholten commented Apr 22, 2021

valentijnscholten commented Apr 22, 2021 • edited Loading

jeremylong commented Apr 24, 2021

valentijnscholten commented Apr 24, 2021 • edited Loading

valentijnscholten commented Apr 24, 2021 • edited Loading

jeremylong commented Apr 30, 2021

valentijnscholten commented Apr 30, 2021

jeremylong commented May 3, 2021

valentijnscholten commented May 4, 2021

valentijnscholten commented Apr 22, 2021 •

edited

Loading

valentijnscholten commented Apr 24, 2021 •

edited

Loading

valentijnscholten commented Apr 24, 2021 •

edited

Loading