Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use predictable ordering as much as possible #3317

Conversation

valentijnscholten
Copy link
Contributor

Fixes Issue

#3291

Description of Change

As described in #3291, every run of dependency check will/could generate a different report. As long as you scan 1 artifact, this differences are limited to ordering of vulnerabilities etc. Once you start scanning multiple (2, 3, 4. ... 10) artifacts, the differences start to affect which dependencies are considered "top level" dependencies, and which dependencies are seen as "related dependencies". In the end all dependencies and vulnerabilities should still be there, but differently ordered/grouped. However in #3291 I have seen that there are bigger differences in the reports. There can as much 25% in size difference in the xml report. And when importing the report into vulnerability management tools as OWASP Defect Dojo, shows difference total vulnerability count. Analyzing these 8-10MB xml reports was hard, so in search of finding the cause of this, I started to make some changes to have a predictable ordering in the reports (and internal data structures).
This PR realizes this, and now it turns out the reports are consistently the same across runs of the scanner. So no more difference in vulnerability count or filesize or otherwise.

Implementation

The PR replaces some HashSets with TreeSets to ensure natural ordering. A comparator is added to Dependency.java to allow ordering. It's based on the same comparator used elsewhere outside the class (based on getActualFilePath())

TODO

The vulnerableSoftwareIdentifiers property was also converted to a TreeSet, but still appears in different order across runs. Not where this is caused. But for now I'd like to get feedback before continuing on this path.

Have test cases been added to cover the new functionality?

no, not sure if they are needed.

@boring-cyborg boring-cyborg bot added cli changes to the cli core changes to core labels Apr 22, 2021
@valentijnscholten
Copy link
Contributor Author

Not sure if that NullPoint while downloading NVD database is related to the changes in this PR.

@valentijnscholten
Copy link
Contributor Author

valentijnscholten commented Apr 22, 2021

Just went looking some more in the codebase and found out that in the past there have been PRs that actually removed the TreeSets and replaced them with HashSets. And removing / changing comparators to no longer ensure ordering.
https://github.com/jeremylong/DependencyCheck/pull/983/files

@jeremylong
Copy link
Owner

I believe some of these collections were previously ordered (in addition to others). The ordering was removed due to performance issues (on these and other collections). Doing some test runs show that given Dependencies Scanned: 6464 (4644 unique) using the un-ordered version I got an average runtime of 117 seconds. Using your ordered collections I got an average runtime of 124 seconds. Given that is a large number of dependencies - and most scans are not that large there is definitely a performance hit.

I wonder if all of these need to be tree sets. My guess is that the evidence, projectReferences, relatedDependencies, vulnerabilities, and suppressed vulnerabilities do not need to be sorted. The ordering on the paths (https://github.com/jeremylong/DependencyCheck/pull/3317/files#diff-1593f793158c7b1dc3a7bd83329962ad1226667dc5d14ca1e68eef1bf39e89bbR319) is definitely needed.

@valentijnscholten
Copy link
Contributor Author

valentijnscholten commented Apr 24, 2021

We could probably write a book on all the different Sets and their properties :-) Even with your change and the TreeSets, I still see the reports differing between scans. Sometimes there's a bunch of extra dependencies reported (most of the time they are not reported, but they are present). Without the ordering it was impossible to see what was happening, unless doing in memory comparisons between the sets and lists between runs.

Even if you prefer not to use the TreeSets, wouldn't you agree it would be very helpful if the report was ordered predictably? Maybe it would be OK to do a one-time ordering after the analysis or when writing the report? Make a optional parameter for it for those who want to shave of those ms from their runs?

What would be the best way to provide more info on the differences in dependencies between runs? I haven't fully grasped the model and logic of DC in my mind yet to know what's going on. These microservices have lots of identical dependencies, so I wonder if by using the file hashes or looking at parts of the file path is causing them to be lost sometimes.

@valentijnscholten
Copy link
Contributor Author

valentijnscholten commented Apr 24, 2021

Did some quick tests and with both your and my changes together (i.e. this PR), the reports are different between runs. Already before the line with the new sorting from your change, the dependencies are different.
Once I disable parallel processing, everything is consistent between runs (but slower of course).

@jeremylong
Copy link
Owner

@valentijnscholten do you have any specific test cases you could share? I went down a slightly different path then ordering everything (yes, we can look at sorting for the reports later - but for now I'd at least like to get the related dependency problem sorted). Take a look at #3343.

@valentijnscholten
Copy link
Contributor Author

I just tested #3343, but get different reports, different sizes:
image
Without ordering it's not possible (for me) to make a good comparison between the reports. I will send you some wars by e-mail to reproduce.

@jeremylong
Copy link
Owner

I just updated my branch. Take a look at #3343. I am now getting the same output from multiple executions (with the exception of the timestamp).

@valentijnscholten
Copy link
Contributor Author

superseded by #3343 which seems to fix #3291 and ensures predictable ordering in reports :-)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cli changes to the cli core changes to core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants