Make classpath hashing more lightweight #433

fommil · 2017-10-14T10:19:47Z

It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.

Possibly there could be some trigger that the file timestamp must be older than an hour or something... to avoid user interaction problems.

NIO allows a fast mechanism for collecting bulk metadata, it should be possible to traverse a directory and get all of this information very quickly before deciding what needs to be rehashed.

fommil · 2017-10-14T11:52:22Z

See link for a proof of concept that basically removes the bottleneck entirely. Should really be implemented in the zinc layer but I have no way of doing that and testing it in a live sbt.

jvican · 2017-10-16T08:54:23Z

It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.

We can use the checksums in the same jars to detect whether something has changed or not (all zip files have checksums in their headers). In fact, it's not clear why we compute a hash of the jars instead of reusing the checksum in the header.

fommil · 2017-10-16T09:01:00Z

@jvican ha! That's a pretty good idea. Although I still think file size / timestamp is a pretty damn good low bar for everything. But having custom hashes sounds like an excellent idea, especially if the file already has a checksum!

fommil · 2017-10-31T07:40:18Z

For completeness I raised PRs against launcher and sbt to enable monkey patching to let me play around. Until they are working and merged I'm afraid I can't help.

sbt/launcher#43

sbt/sbt#3643

jrudolph · 2017-11-07T12:09:06Z

I second the idea of using less safe but faster comparison options. Not sure about zip file signatures. Afaik zip files don't contain a hash over all the content but only crc32s for each file which would mean that you would have to read (probably significantly) into the content to collect all the hashes (but would probably still be faster than reading all the data). Getting just file system metadata should be still faster.

fommil · 2017-11-07T14:29:23Z

Clear reproduction in https://github.com/cakesolutions/sbt-cake/tree/sbt-perf-regression

type sbt cpl to compile all test phases.

jvican · 2017-11-08T10:10:38Z

Getting just file system metadata should be still faster.

This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it.

I think that reading checksums from the zip file could be done in a fast way with some tweaks.

jrudolph · 2017-11-08T10:50:57Z

This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it.

Hmm, relying on modification times, sizes, and other stat information is also how git and rsync work in the default cases. If it is reliable enough for those kinds of more critical infrastructure, wouldn't it be for zinc as well?

The incremental compiler is an optimization that will often be used in an interactive system. In that case, the system metadata should just be assumed to be reliable enough to be used. If it is not (for whatever reason) it should be the user who decides to switch to an even more reliable mechanism.

fommil · 2017-11-08T10:51:42Z

system metadata like timestamps and file size? I have never seen that to be incorrect. Given that we're talking about external dependencies, it would be a fair assumption that they never change.

I know for a fact that scalac can't handle the case where a jar file changes... the whole classpath needs to be reloaded and I doubt zinc is even considering that.

The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata.

jvican · 2017-11-08T13:32:10Z

Hmm, relying on modification times, sizes, and other stat information is also how git and rsync work in the default cases. If it is reliable enough for those kinds of more critical infrastructure, wouldn't it be for zinc as well?

I don't have any data to show you because I've rarely had problems with it, but I can point out two facts:

The Zinc compiler hashes source files to detect changes because there were problems in the past with just checking the file metadata. If that happened with source files, it will also happen with jars. Mark Harrah documented it somewhere.
Pants and Bazel exist in part because you cannot trust system's metadata -- I believe reading they had some issues with it. As a result, they hash everything.

The incremental compiler is an optimization that will often be used in an interactive system.

The incremental compiler is the default way of using the Scalac compiler, I doubt users would tell you they consider it an optimization. Zinc is just another compiler and as such it's better if it's as reproducible as possible.

I know for a fact that scalac can't handle the case where a jar file changes... the whole classpath needs to be reloaded and I doubt zinc is even considering that.

That has to do with the classpath handling of scalac, it's not related to this ticket. I also believe it's improved in the last release IIRC.

The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata.

Which cache of the hash? We're discussing what Zinc should do to check if the classpath has changed, and for that it currently uses hashing of the contents. There's no cache of the hash...

I believe that either the CRCs in the jar file or plain hashing (like we're doing right now) are the only two options here. I remind you that the reason why it's so slow is because it does it transitively for all projects, not because doing so is inherently slow.

If we assume that we go with the hashing aproach, the effect should be minimal. For example, xxHash can do 5.4GB/s. Even if we assume a gigantic classpath of 1gb, that would not even be more than 250 ms (giving it some margin for IO and latency).

fommil · 2017-11-08T14:00:37Z

the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io.

This is proposed as a 1.0.x binary compatible solution, so changing the actual hashing algorithm is out of scope and considered elsewhere.

jrudolph · 2017-11-09T08:51:06Z

the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io.

Exactly, so maybe this was just a misunderstanding? The idea was not to remove hashes completely but to avoid recalculation when files have not changed.

jvican · 2017-11-09T10:15:27Z

No, I really meant it: using filesystem metadata to figure out if a file has changed or not is a bad idea.

I'm all for removing the duplication for the transitive jar hashing that happens in multi-build project, but I don't think that the caching should do more than just that.

fommil · 2017-11-09T10:20:12Z

@jvican I'd need to see some solid evidence of the meta data being unreliable before I could agree with you. Removing transitives would be an even bigger exposure... what if I went and swapped out scala-library.jar or (more likely, and the reason why I suspect Eugene added jar caching in the first place) some dependent jar that is part of my build? Your approach wouldn't notice this.

Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc. Given that scalac currently can't even handle changing jars, I personally think the whole concept of hashing jars is moot.

jvican · 2017-11-09T10:36:43Z

Related ticket that explains why source file hashing is used, even though there's not much evidence: sbt/sbt#685. In essence, the precision of lastModified is poor.

I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via packageBin).

leonardehrenfried · 2017-11-09T11:34:03Z

I would also like to say that I am in favour of a very rare case of undercompilation (which can be resolved by a clean) rather than a constant overhead to catch a rare edge case.

jrudolph · 2017-11-09T11:46:43Z

I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via packageBin).

Sounds good.

Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc.

Yes that's a good point. The link to the description of the git modification detection logic also talked about that. They also use the last modification check to decide whether a change might have changed in the same second as the hash has been computed or not.

stuhood · 2017-11-09T15:12:37Z

It is possible to have both correctness and performance for this case if file watching is used to invalidate cache entries. sbt already has facilities to do this, afaik (as do all other consumers of zinc), so exposing a callbacky invalidation API would allow callers to bring their own invalidation. I believe @romanowski opened a PR for such a thing...?

Assuming that that invalidation API exists and is used by the relevant consumers, having a default/fallback impl that used filesystem metadata would be ok.

And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 133ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.

Fix #433: Make classpath hashing more lightweight

This was referenced Oct 14, 2017

Timestamp hashes #434

Closed

Hash shortcut proof of concept [DO NOT MERGE] sbt/io#80

Closed

jvican changed the title ~~file hash shortcut~~ Make classpath hashing more lightweight Oct 16, 2017

jvican added discussion enhancement performance labels Oct 16, 2017

eed3si9n assigned jvican Nov 9, 2017

eed3si9n added the in progress label Nov 9, 2017

Duhemm added a commit that referenced this issue Nov 14, 2017

Merge pull request #452 from scalacenter/issue-433

5dcf3e6

Fix #433: Make classpath hashing more lightweight

jvican closed this as completed Nov 16, 2017

eed3si9n removed the in progress label Nov 16, 2017

jvican mentioned this issue Feb 8, 2018

Cache classloaders for compiler plugins and macros scala/scala#6314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make classpath hashing more lightweight #433

Make classpath hashing more lightweight #433

fommil commented Oct 14, 2017 •

edited

Loading

fommil commented Oct 14, 2017

jvican commented Oct 16, 2017

fommil commented Oct 16, 2017

fommil commented Oct 31, 2017

jrudolph commented Nov 7, 2017

fommil commented Nov 7, 2017

jvican commented Nov 8, 2017

jrudolph commented Nov 8, 2017

fommil commented Nov 8, 2017 •

edited

Loading

jvican commented Nov 8, 2017

fommil commented Nov 8, 2017

jrudolph commented Nov 9, 2017

jvican commented Nov 9, 2017

fommil commented Nov 9, 2017 •

edited

Loading

jvican commented Nov 9, 2017

leonardehrenfried commented Nov 9, 2017 via email •

edited

Loading

jrudolph commented Nov 9, 2017

stuhood commented Nov 9, 2017

Make classpath hashing more lightweight #433

Make classpath hashing more lightweight #433

Comments

fommil commented Oct 14, 2017 • edited Loading

fommil commented Oct 14, 2017

jvican commented Oct 16, 2017

fommil commented Oct 16, 2017

fommil commented Oct 31, 2017

jrudolph commented Nov 7, 2017

fommil commented Nov 7, 2017

jvican commented Nov 8, 2017

jrudolph commented Nov 8, 2017

fommil commented Nov 8, 2017 • edited Loading

jvican commented Nov 8, 2017

fommil commented Nov 8, 2017

jrudolph commented Nov 9, 2017

jvican commented Nov 9, 2017

fommil commented Nov 9, 2017 • edited Loading

jvican commented Nov 9, 2017

leonardehrenfried commented Nov 9, 2017 via email • edited Loading

jrudolph commented Nov 9, 2017

stuhood commented Nov 9, 2017

fommil commented Oct 14, 2017 •

edited

Loading

fommil commented Nov 8, 2017 •

edited

Loading

fommil commented Nov 9, 2017 •

edited

Loading

leonardehrenfried commented Nov 9, 2017 via email •

edited

Loading