Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve analysis of maven based projects #7965

Open
netomi opened this issue Nov 30, 2023 · 10 comments · May be fixed by #7981
Open

Improve analysis of maven based projects #7965

netomi opened this issue Nov 30, 2023 · 10 comments · May be fixed by #7981
Labels
analyzer About the analyzer tool enhancement Issues that are considered to be enhancements

Comments

@netomi
Copy link

netomi commented Nov 30, 2023

I tried to use the ORT GH action to analyse a maven project, but it took more than 30 min to run only the analysis inside a GitHub runner (https://github.com/netomi/macos-notarization-service/actions/runs/7006736431).

The project has some additional repository setup and ORT seems to try find out from which repository a dependency is coming from by trying to download the dependency from each configured repository. There seems to be also some network throttling in place when run as a GH action, when running the same analysis locally, it completed in a couple of minutes.

However, there should be a way to speed this up and I worked on improving the resolution of dependencies in maven projects. In my fork at https://github.com/netomi/ort/tree/disable-remote-verification I did some experiments to get the resolved repository from maven itself (it stores that information in the _remote.repositories file in the local cache).

With these changes the run of the analysis on GitHub could be completed in around 2min, see https://github.com/netomi/macos-notarization-service/actions/runs/7021936345).

My approach so far is quick and dirty and more a PoC, but I will be working on a PR to make this as clean as possible.

@netomi netomi changed the title Improve analysis of maven based project Improve analysis of maven based projects Nov 30, 2023
@sschuberth sschuberth added enhancement Issues that are considered to be enhancements analyzer About the analyzer tool labels Nov 30, 2023
@sschuberth
Copy link
Member

My approach so far is quick and dirty and more a PoC, but I will be working on a PR to make this as clean as possible.

Looking forward to your PR, thanks a lot for looking into improving things!

As a semi-related side, I've been long aware that the Maven code probably is more complex than it needs to be by now, also see #4720.

@netomi
Copy link
Author

netomi commented Nov 30, 2023

My idea so far that worked is to let maven resolve all dependencies by itself which it can do best it seems (e.g. running a normal build is also fast, even when the artifacts are not cached). Afterwards try to get the repository that maven used to download the dependency from.

While debugging, it seems that the DefaultArtifactResolver does correctly set the repository, however this information is later lost, the artifacts that you receive from the project building request do not have a repository field being set though internally that information is known to maven.

This feels like a bug in maven, I could workaround that by registering a repositoryListener that gets called whenever an artifact is resolved. There the repository information is still present.

@mnonnenmacher
Copy link
Member

I tried to use the ORT GH action to analyse a maven project, but it took more than 30 min to run only the analysis inside a GitHub runner (https://github.com/netomi/macos-notarization-service/actions/runs/7006736431).

Such a slow resolution for Maven projects is unusual and often an indication of an issue with one of the used repositories. For example, sometimes repositories are configured which are not available anymore which leads to lots of time consuming network timeouts. I had a short look at the logs and it seems that the majority of the time is spent waiting for https://repository.jboss.org/. Each request takes several seconds, sometimes more than 10. As an example:

15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection request: [route: {s}->https://repository.jboss.org:443][total available: 4; route allocated: 1 of 50; total allocated: 4 of 100]
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection leased: [id: 44][route: {s}->https://repository.jboss.org:443][total available: 3; route allocated: 1 of 50; total allocated: 4 of 100]
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-44: set socket timeout to 0
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-44: set socket timeout to 1800000
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Executing request GET /nexus/content/groups/public/org/apache/maven/maven/3.9.3/maven-3.9.3.pom HTTP/1.1
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Target auth state: UNCHALLENGED
15:07:23.410 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Proxy auth state: UNCHALLENGED
15:07:28.112 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection can be kept alive indefinitely
15:07:28.113 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-44: Close connection
15:07:28.113 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection discarded
15:07:28.113 [DefaultDispatcher-worker-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 44][route: {s}->https://repository.jboss.org:443][total available: 3; route allocated: 0 of 50; total allocated: 3 of 100]

Requests to other repositories take only a few milliseconds. I wonder if the JBoss repository is always so slow or if this is a temporary issue. What I would investigate is:

  • Why are the requests to the JBoss repository so slow?
  • If you do a clean build with an empty local Maven repository, does it also take so long? If not, this could be an indication that ORT does unnecessary requests. This should not happen, because ORT just follows the normal Maven resolution process and should stop at the first repository where an artifact is found.

The project has some additional repository setup and ORT seems to try find out from which repository a dependency is coming from by trying to download the dependency from each configured repository.

ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See:

/**
* A specialized [WorkspaceReader] implementation used when building a Maven project that prevents unnecessary
* downloads of binary artifacts.
*
* When building a Maven project from a POM using Maven's [ProjectBuilder] API clients have no control over the
* downloads of dependencies: If dependencies are to be resolved, all the artifacts of these dependencies are
* automatically downloaded. For the purpose of just constructing the dependency tree, this is not needed and only
* costs time and bandwidth.
*
* Unfortunately, there is no official API to prevent the download of dependencies. However, Maven can be tricked to
* believe that the artifacts are already present on the local disk - then the download is skipped. This is what
* this implementation does. It reports that all binary artifacts are available locally, and only treats POMs
* correctly, as they may be required for the dependency analysis.
*/
private class SkipBinaryDownloadsWorkspaceReader(
/** The real workspace reader to delegate to. */
val delegate: WorkspaceReader
) : WorkspaceReader by delegate {
/**
* Locate the given artifact on the local disk. This implementation does a correct location only for POM files;
* for all other artifacts it returns a non-null file. Note: For the purpose of analyzing the project's
* dependencies the artifact files are never accessed. Therefore, the concrete file returned here does not
* actually matter; it just has to be non-null to indicate that the artifact is present locally.
*/
override fun findArtifact(artifact: Artifact): File? {
return if (artifact.extension == "pom") {
delegate.findArtifact(artifact)
} else {
File(artifact.artifactId)
}
}
}

However, there should be a way to speed this up and I worked on improving the resolution of dependencies in maven projects. In my fork at https://github.com/netomi/ort/tree/disable-remote-verification I did some experiments to get the resolved repository from maven itself (it stores that information in the _remote.repositories file in the local cache).

This could be an issue with setups where the local Maven repository is kept for ORT scans of different Maven projects. If those projects use different repositories for the same artifacts (like a mirror of Maven central) the information from the local cache might be wrong, because Maven did not attempt a new resolution of artifacts that were already cached.

@netomi
Copy link
Author

netomi commented Nov 30, 2023

The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc.

I understand that the information in the local maven cache might not reflect the real situation, however using ort like it is currently done does not seem to be working as expected when used in a GitHub Action. The existing ORT action does already some caching but it only affects the ORT job so I would not expect any negative side-effect of it. Maybe a special option could be added to enable this optimization when you are sure that the local maven cache will not have side-effects from other projects?

@netomi
Copy link
Author

netomi commented Nov 30, 2023

ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See:

I dont think that this is true, looking at

val binaryRemoteArtifact = localProject?.let {
RemoteArtifact.EMPTY
} ?: requestRemoteArtifact(artifact, repositories, useReposFromDependencies)
val isBinaryArtifactModified = isArtifactModified(artifact, binaryRemoteArtifact)
val sourceRemoteArtifact = when {
localProject != null -> RemoteArtifact.EMPTY
artifact.extension == "pom" -> binaryRemoteArtifact
else -> {
val sourceArtifact = artifact.let {
DefaultArtifact(it.groupId, it.artifactId, "sources", "jar", it.version)
}
requestRemoteArtifact(sourceArtifact, repositories, useReposFromDependencies)
}
}
and the code in requestRemoteArtifact looks like that the binary and even source archives are tried to be downloaded from all repositories to determine from which repository it comes from.

@mnonnenmacher
Copy link
Member

The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc.

This is exactly what Maven does during dependency resolution. If the POM file of a dependency defines repositories, they are appended to the list of repositories and used to resolve the transitive dependencies of that dependency (but not for other branches of the dependency tree).

I understand that the information in the local maven cache might not reflect the real situation, however using ort like it is currently done does not seem to be working as expected when used in a GitHub Action. The existing ORT action does already some caching but it only affects the ORT job so I would not expect any negative side-effect of it. Maybe a special option could be added to enable this optimization when you are sure that the local maven cache will not have side-effects from other projects?

Yes, making this optimization optional would be a requirement for some ORT users, but I agree that for the GitHub action it could be a useful option.

@mnonnenmacher
Copy link
Member

ORT only downloads the POM files for dependencies, for other artifacts only the checksums are downloaded. See:

I dont think that this is true, looking at

val binaryRemoteArtifact = localProject?.let {
RemoteArtifact.EMPTY
} ?: requestRemoteArtifact(artifact, repositories, useReposFromDependencies)
val isBinaryArtifactModified = isArtifactModified(artifact, binaryRemoteArtifact)
val sourceRemoteArtifact = when {
localProject != null -> RemoteArtifact.EMPTY
artifact.extension == "pom" -> binaryRemoteArtifact
else -> {
val sourceArtifact = artifact.let {
DefaultArtifact(it.groupId, it.artifactId, "sources", "jar", it.version)
}
requestRemoteArtifact(sourceArtifact, repositories, useReposFromDependencies)
}
}

and the code in requestRemoteArtifact looks like that the binary and even source archives are tried to be downloaded from all repositories to determine from which repository it comes from.

The requestRemoteArtifact function returns on the first hit, see:

return RemoteArtifact(info.downloadUrl, hash).also { remoteArtifact ->
logger.debug { "Writing remote artifact for '$artifact' to disk cache." }
remoteArtifactCache.write(cacheKey, remoteArtifact.toYaml())
}

It also does not actually download the artifact but only does an existence check, see:

val artifactDownload = ArtifactDownload(artifact, "project", downloadFile, policy.checksumPolicy)
artifactDownload.isExistenceCheck = true

@netomi
Copy link
Author

netomi commented Nov 30, 2023

The project only has one additional repository setup: repo.eclipse.org for one dependency. The jboss repository probably comes from another dependency's pom and is not used generally. One negative side-effect of the current approach is that ORT tries to download from the repositories that are defined for the project in the order as it retrieved from the maven API. In this project, the repo.eclipse.org repository is first in the list, thus is always tried first to resolve the artifact, which is a waste ofc.

This is exactly what Maven does during dependency resolution. If the POM file of a dependency defines repositories, they are appended to the list of repositories and used to resolve the transitive dependencies of that dependency (but not for other branches of the dependency tree).

The order how these repos are tried to resolve an artifact matters. If any artifact is first tried on a repo that only hosts a single dependency but mavencentral is tried last, you are wasting a lot of connections while trying to resolve your artifacts, which in the case of GitHub can have all kinds of side-effect (runner might get throttled). It looks like the maven API that ORT uses returns the repos in LIFO order, so mavencentral is defined in the super pom, but in the list of repos its always last, so ORT tries mavencentral as last resort which is kind of counter-intuitive as mavencentral should be the first option. But thats just what I have seen from debugging this stuff for 1-2 days, will need certainly more time to understand what is going on.

Yes, making this optimization optional would be a requirement for some ORT users, but I agree that for the GitHub action it could be a useful option.

Will need to think about it, but my idea is that maven does a very good job to resolve artifacts and has proven this over many years, any custom resolution mechanism on top of it will be most likely be worse in terms of performance, especially in constrained environements like CI systems.

@sschuberth
Copy link
Member

ORT tries mavencentral as last resort which is kind of counter-intuitive as mavencentral should be the first option.

As @mnonnenmacher pointed out, ORT does here the same thing as Maven itself would do: Repositories that are defined "closer" to the project / dependency in question are tried first. Otherwise, projects would have no chance of overriding an artifact that also lives in Maven Central.

@netomi
Copy link
Author

netomi commented Nov 30, 2023

ok ty, I did some more digging into the logs from ORT and maven for resolving the dependencies and the way the artifacts are resolved is pretty much the same. Maven seems to do it much faster though. Will need to understand what is different. Connections are dropped / released quite often and have to be re-established, which could trigger some protection mechanisms on some repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer About the analyzer tool enhancement Issues that are considered to be enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants