Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make classpath hashing more lightweight #433

Closed
fommil opened this issue Oct 14, 2017 · 18 comments
Closed

Make classpath hashing more lightweight #433

fommil opened this issue Oct 14, 2017 · 18 comments

Comments

@fommil
Copy link

fommil commented Oct 14, 2017

It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.

Possibly there could be some trigger that the file timestamp must be older than an hour or something... to avoid user interaction problems.

NIO allows a fast mechanism for collecting bulk metadata, it should be possible to traverse a directory and get all of this information very quickly before deciding what needs to be rehashed.

@fommil
Copy link
Author

fommil commented Oct 14, 2017

See link for a proof of concept that basically removes the bottleneck entirely. Should really be implemented in the zinc layer but I have no way of doing that and testing it in a live sbt.

@jvican jvican changed the title file hash shortcut Make classpath hashing more lightweight Oct 16, 2017
@jvican
Copy link
Member

jvican commented Oct 16, 2017

It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.

We can use the checksums in the same jars to detect whether something has changed or not (all zip files have checksums in their headers). In fact, it's not clear why we compute a hash of the jars instead of reusing the checksum in the header.

@fommil
Copy link
Author

fommil commented Oct 16, 2017

@jvican ha! That's a pretty good idea. Although I still think file size / timestamp is a pretty damn good low bar for everything. But having custom hashes sounds like an excellent idea, especially if the file already has a checksum!

@fommil
Copy link
Author

fommil commented Oct 31, 2017

For completeness I raised PRs against launcher and sbt to enable monkey patching to let me play around. Until they are working and merged I'm afraid I can't help.

sbt/launcher#43

sbt/sbt#3643

@jrudolph
Copy link
Member

jrudolph commented Nov 7, 2017

I second the idea of using less safe but faster comparison options. Not sure about zip file signatures. Afaik zip files don't contain a hash over all the content but only crc32s for each file which would mean that you would have to read (probably significantly) into the content to collect all the hashes (but would probably still be faster than reading all the data). Getting just file system metadata should be still faster.

@fommil
Copy link
Author

fommil commented Nov 7, 2017

Clear reproduction in https://github.com/cakesolutions/sbt-cake/tree/sbt-perf-regression

type sbt cpl to compile all test phases.

@jvican
Copy link
Member

jvican commented Nov 8, 2017

Getting just file system metadata should be still faster.

This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it.

I think that reading checksums from the zip file could be done in a fast way with some tweaks.

@jrudolph
Copy link
Member

jrudolph commented Nov 8, 2017

This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it.

Hmm, relying on modification times, sizes, and other stat information is also how git and rsync work in the default cases. If it is reliable enough for those kinds of more critical infrastructure, wouldn't it be for zinc as well?

The incremental compiler is an optimization that will often be used in an interactive system. In that case, the system metadata should just be assumed to be reliable enough to be used. If it is not (for whatever reason) it should be the user who decides to switch to an even more reliable mechanism.

@fommil
Copy link
Author

fommil commented Nov 8, 2017

system metadata like timestamps and file size? I have never seen that to be incorrect. Given that we're talking about external dependencies, it would be a fair assumption that they never change.

I know for a fact that scalac can't handle the case where a jar file changes... the whole classpath needs to be reloaded and I doubt zinc is even considering that.

The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata.

@jvican
Copy link
Member

jvican commented Nov 8, 2017

Hmm, relying on modification times, sizes, and other stat information is also how git and rsync work in the default cases. If it is reliable enough for those kinds of more critical infrastructure, wouldn't it be for zinc as well?

I don't have any data to show you because I've rarely had problems with it, but I can point out two facts:

  1. The Zinc compiler hashes source files to detect changes because there were problems in the past with just checking the file metadata. If that happened with source files, it will also happen with jars. Mark Harrah documented it somewhere.

  2. Pants and Bazel exist in part because you cannot trust system's metadata -- I believe reading they had some issues with it. As a result, they hash everything.

The incremental compiler is an optimization that will often be used in an interactive system.

The incremental compiler is the default way of using the Scalac compiler, I doubt users would tell you they consider it an optimization. Zinc is just another compiler and as such it's better if it's as reproducible as possible.

I know for a fact that scalac can't handle the case where a jar file changes... the whole classpath needs to be reloaded and I doubt zinc is even considering that.

That has to do with the classpath handling of scalac, it's not related to this ticket. I also believe it's improved in the last release IIRC.

The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata.

Which cache of the hash? We're discussing what Zinc should do to check if the classpath has changed, and for that it currently uses hashing of the contents. There's no cache of the hash...

I believe that either the CRCs in the jar file or plain hashing (like we're doing right now) are the only two options here. I remind you that the reason why it's so slow is because it does it transitively for all projects, not because doing so is inherently slow.

If we assume that we go with the hashing aproach, the effect should be minimal. For example, xxHash can do 5.4GB/s. Even if we assume a gigantic classpath of 1gb, that would not even be more than 250 ms (giving it some margin for IO and latency).

@fommil
Copy link
Author

fommil commented Nov 8, 2017

the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io.

This is proposed as a 1.0.x binary compatible solution, so changing the actual hashing algorithm is out of scope and considered elsewhere.

@jrudolph
Copy link
Member

jrudolph commented Nov 9, 2017

the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io.

Exactly, so maybe this was just a misunderstanding? The idea was not to remove hashes completely but to avoid recalculation when files have not changed.

@jvican
Copy link
Member

jvican commented Nov 9, 2017

No, I really meant it: using filesystem metadata to figure out if a file has changed or not is a bad idea.

I'm all for removing the duplication for the transitive jar hashing that happens in multi-build project, but I don't think that the caching should do more than just that.

@fommil
Copy link
Author

fommil commented Nov 9, 2017

@jvican I'd need to see some solid evidence of the meta data being unreliable before I could agree with you. Removing transitives would be an even bigger exposure... what if I went and swapped out scala-library.jar or (more likely, and the reason why I suspect Eugene added jar caching in the first place) some dependent jar that is part of my build? Your approach wouldn't notice this.

Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc. Given that scalac currently can't even handle changing jars, I personally think the whole concept of hashing jars is moot.

@jvican
Copy link
Member

jvican commented Nov 9, 2017

Related ticket that explains why source file hashing is used, even though there's not much evidence: sbt/sbt#685. In essence, the precision of lastModified is poor.

I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via packageBin).

@leonardehrenfried
Copy link

leonardehrenfried commented Nov 9, 2017 via email

@jrudolph
Copy link
Member

jrudolph commented Nov 9, 2017

I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via packageBin).

Sounds good.

Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc.

Yes that's a good point. The link to the description of the git modification detection logic also talked about that. They also use the last modification check to decide whether a change might have changed in the same second as the hash has been computed or not.

@stuhood
Copy link

stuhood commented Nov 9, 2017

It is possible to have both correctness and performance for this case if file watching is used to invalidate cache entries. sbt already has facilities to do this, afaik (as do all other consumers of zinc), so exposing a callbacky invalidation API would allow callers to bring their own invalidation. I believe @romanowski opened a PR for such a thing...?

Assuming that that invalidation API exists and is used by the relevant consumers, having a default/fallback impl that used filesystem metadata would be ok.

jvican added a commit to scalacenter/zinc that referenced this issue Nov 9, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 133ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
jvican added a commit to scalacenter/zinc that referenced this issue Nov 9, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
jvican added a commit to scalacenter/zinc that referenced this issue Nov 9, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
jvican added a commit to scalacenter/zinc that referenced this issue Nov 9, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.
jvican added a commit to scalacenter/zinc that referenced this issue Nov 9, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.
jvican added a commit to scalacenter/zinc that referenced this issue Nov 9, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.
jvican added a commit to scalacenter/zinc that referenced this issue Nov 10, 2017
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.
Duhemm added a commit that referenced this issue Nov 14, 2017
Fix #433: Make classpath hashing more lightweight
@jvican jvican closed this as completed Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants