-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make classpath hashing more lightweight #433
Comments
See link for a proof of concept that basically removes the bottleneck entirely. Should really be implemented in the zinc layer but I have no way of doing that and testing it in a live sbt. |
We can use the checksums in the same jars to detect whether something has changed or not (all zip files have checksums in their headers). In fact, it's not clear why we compute a hash of the jars instead of reusing the checksum in the header. |
@jvican ha! That's a pretty good idea. Although I still think file size / timestamp is a pretty damn good low bar for everything. But having custom hashes sounds like an excellent idea, especially if the file already has a checksum! |
For completeness I raised PRs against launcher and sbt to enable monkey patching to let me play around. Until they are working and merged I'm afraid I can't help. |
I second the idea of using less safe but faster comparison options. Not sure about zip file signatures. Afaik zip files don't contain a hash over all the content but only crc32s for each file which would mean that you would have to read (probably significantly) into the content to collect all the hashes (but would probably still be faster than reading all the data). Getting just file system metadata should be still faster. |
Clear reproduction in https://github.com/cakesolutions/sbt-cake/tree/sbt-perf-regression type |
This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it. I think that reading checksums from the zip file could be done in a fast way with some tweaks. |
Hmm, relying on modification times, sizes, and other The incremental compiler is an optimization that will often be used in an interactive system. In that case, the system metadata should just be assumed to be reliable enough to be used. If it is not (for whatever reason) it should be the user who decides to switch to an even more reliable mechanism. |
system metadata like timestamps and file size? I have never seen that to be incorrect. Given that we're talking about external dependencies, it would be a fair assumption that they never change. I know for a fact that The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata. |
I don't have any data to show you because I've rarely had problems with it, but I can point out two facts:
The incremental compiler is the default way of using the Scalac compiler, I doubt users would tell you they consider it an optimization. Zinc is just another compiler and as such it's better if it's as reproducible as possible.
That has to do with the classpath handling of scalac, it's not related to this ticket. I also believe it's improved in the last release IIRC.
Which cache of the hash? We're discussing what Zinc should do to check if the classpath has changed, and for that it currently uses hashing of the contents. There's no cache of the hash... I believe that either the CRCs in the jar file or plain hashing (like we're doing right now) are the only two options here. I remind you that the reason why it's so slow is because it does it transitively for all projects, not because doing so is inherently slow. If we assume that we go with the hashing aproach, the effect should be minimal. For example, xxHash can do 5.4GB/s. Even if we assume a gigantic classpath of 1gb, that would not even be more than 250 ms (giving it some margin for IO and latency). |
the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io. This is proposed as a 1.0.x binary compatible solution, so changing the actual hashing algorithm is out of scope and considered elsewhere. |
Exactly, so maybe this was just a misunderstanding? The idea was not to remove hashes completely but to avoid recalculation when files have not changed. |
No, I really meant it: using filesystem metadata to figure out if a file has changed or not is a bad idea. I'm all for removing the duplication for the transitive jar hashing that happens in multi-build project, but I don't think that the caching should do more than just that. |
@jvican I'd need to see some solid evidence of the meta data being unreliable before I could agree with you. Removing transitives would be an even bigger exposure... what if I went and swapped out Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc. Given that scalac currently can't even handle changing jars, I personally think the whole concept of hashing jars is moot. |
Related ticket that explains why source file hashing is used, even though there's not much evidence: sbt/sbt#685. In essence, the precision of I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via |
I would also like to say that I am in favour of a very rare case of
undercompilation (which can be resolved by a clean) rather than a constant
overhead to catch a rare edge case.
|
Sounds good.
Yes that's a good point. The link to the description of the git modification detection logic also talked about that. They also use the last modification check to decide whether a change might have changed in the same second as the hash has been computed or not. |
It is possible to have both correctness and performance for this case if file watching is used to invalidate cache entries. sbt already has facilities to do this, afaik (as do all other consumers of zinc), so exposing a callbacky invalidation API would allow callers to bring their own invalidation. I believe @romanowski opened a PR for such a thing...? Assuming that that invalidation API exists and is used by the relevant consumers, having a default/fallback impl that used filesystem metadata would be ok. |
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 133ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
Fix #433: Make classpath hashing more lightweight
It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.
Possibly there could be some trigger that the file timestamp must be older than an hour or something... to avoid user interaction problems.
NIO allows a fast mechanism for collecting bulk metadata, it should be possible to traverse a directory and get all of this information very quickly before deciding what needs to be rehashed.
The text was updated successfully, but these errors were encountered: