-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #433: Make classpath hashing more lightweight #452
Conversation
can you please validate against the demo project at https://github.com/cakesolutions/sbt-cake/tree/sbt-perf-regression ? I have no idea how to run a custom zinc, so I am unable to do so. Hence the monkey patch PRs to sbt. |
@@ -162,6 +165,13 @@ final class MixedAnalyzingCompiler( | |||
* of cross Java-Scala compilation. | |||
*/ | |||
object MixedAnalyzingCompiler { | |||
private[this] val cacheMetadataJar = new ConcurrentHashMap[File, (FileTime, FileHash)]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will leak... you need to use something like a Gauva weak cache.
Also I recommend using Path
instead of File
, because it has a much smaller memory footprint.
In fact File
(and URL
) are the biggest heap hogs in sbt on large projects... mostly because the path prefix is duplicated. I'm talking GBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that I should use a path here. I'll use a Synchronized map wrapping a weak cache ref. Ironically that was my first approach but didn't quite like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't sbt have a synchronised weak cache already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to the best of my knowledge...
Finally, I've decided not to use a synchronize weak map here. We could get fancy, but this cache is only for jars and it's going to have tops 1000 entries. So I'll keep it as it is.
FileHash.of(x, Stamper.forHash(x).hashCode) | ||
val parallelClasspathHashing = classpath.toParArray.map { file => | ||
val attrs = Files.readAttributes(file.toPath, classOf[BasicFileAttributes]) | ||
val currentFileTime = attrs.lastModifiedTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using size as well as modified is effectively free at his point (it's in attrs) so we could use it to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -181,9 +191,16 @@ object MixedAnalyzingCompiler { | |||
incrementalCompilerOptions: IncOptions, | |||
extra: List[(String, String)] | |||
): CompileConfiguration = { | |||
val classpathHash = classpath map { x => | |||
FileHash.of(x, Stamper.forHash(x).hashCode) | |||
val parallelClasspathHashing = classpath.toParArray.map { file => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toParArray
? Is there a better way? The parallel collections rarely speed things up. Certainly hitting the cache in parallel will be slower.
Perhaps parallelisation can be done in a follow up so its impact can be isolated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not my experience. Parallel collections do speed things up and they're reliable. I can tell you hitting the cache is not slower, I've done measurements locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for 10 jars, doing it in parallel is faster than doing it in serial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth comparing the performance on an SSD versus a mechanical disk. It depends on a lot of factors (and I've no idea how modern technology mitigates them), but parallelizing IO operations on a mechanical disk might actually slow it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, IO-bound operations yield very good results when they're parallelized. I think there are old textbooks on parallel computing that claim so too, before the pre-SSD era.
In any case, I benchmarked it just in case in a LAMP server. Results:
- Parallelized hashing: 549ms.
- Serial hashing: 1912ms.
I'm testing it on /dev/vda
, which as you see is a rotational disk:
platform@scalagesrv2:~/tests/zinc$ lsblk -d -o name,rota
NAME ROTA
fd0 1
sr0 1
vda 1
@fommil I will not validate it myself. I have already a test that proves this works. If you're interested in checking this fix out, you can publish zinc, change the zinc version in sbt/sbt, publish sbt/sbt locally and use that version in |
e7db2f9
to
17d451b
Compare
@@ -162,6 +165,17 @@ final class MixedAnalyzingCompiler( | |||
* of cross Java-Scala compilation. | |||
*/ | |||
object MixedAnalyzingCompiler { | |||
private type JarMetadata = (FileTime, Long) | |||
// Using paths instead of files as key because they have more lightweight memory consumption | |||
private[this] val cacheMetadataJar = new ConcurrentHashMap[File, (JarMetadata, FileHash)]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the code do something different from what the comment says? The key is a File
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and that is why comments SUCK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's a stale comment, thanks for the catch.
FileHash.of(x, Stamper.forHash(x).hashCode) | ||
// #433: Cache jars with their metadata to avoid recomputing hashes transitively in other projects | ||
val parallelClasspathHashing = classpath.toParArray.map { file => | ||
if (!file.exists()) emptyFileHash(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you can use the file attributes on a non-existent file and then use the response to detect existence... if that is the case it would be best to use it here and avoid a stack frame (which is slow on windows so best to batch / minimise all I/O calls)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just getting IO errors because of this, it was unintuitive to me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fommil I got a proof, /caca/peluda
does not exist:
scala> val attrs = java.nio.file.Files.readAttributes((new java.io.File("/caca/peluda").toPath), classOf[BasicFileAttributes])
java.nio.file.NoSuchFileException: /caca/peluda
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.nio.file.Files.readAttributes(Files.java:1737)
... 28 elided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool, that's good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine from my perspective, although aligning more cleanly with #427 would be good.
@@ -162,6 +165,17 @@ final class MixedAnalyzingCompiler( | |||
* of cross Java-Scala compilation. | |||
*/ | |||
object MixedAnalyzingCompiler { | |||
// For more safety, store both the time and size | |||
private type JarMetadata = (FileTime, Long) | |||
private[this] val cacheMetadataJar = new ConcurrentHashMap[File, (JarMetadata, FileHash)]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that #427 is in, it would seem that all of this code should be located on the companion object of the "default" implementation of hashing. Put it in the top of MixedAnalyzingCompiler
makes that class into more of a grabbag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the companion object of the default implementation of hashing, do you mean FileHash
or Stamper
? It's not clear to me we want to reuse this code across all the users of IOHash
or FileHash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't a relevant companion class, then I guess my suggestion would be to define a new one. Not a blocker. Just that logically I would expect this to look something like:
val hashingStrategy = if (flag) DefaultStrategy else UserDefinedStrategy(...)
???
But looking at it again, ExternalHooks doesn't make it possible to do this. So only the "grabbag" part is really relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it looks like a grabbag, so it's better to move it somewhere else. thanks for the suggestion 😄.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stuhood I've implemented your suggestion 😉. The Zinc API has been historically such a mess that sometimes I forget it can and should be improved 😄.
is this PR against the 1.0.x line or 1.1 line? I think you might need to backport to 1.0.x if it could be included in an sbt point release. |
It's in the 1.0.x line. |
I've just realized that #427 is in 1.x, and I need to backport it to 1.0.x before I apply my patch. I think it would be a good idea to revisit all this 1.0.x and 1.x thing... |
@jvican yeah that's what I thought. But you don't need to backport 427 for this. You can just implement this as the default and that won't impact the binary API for 1.0.x. |
Right, that's better @fommil. This will be shipped in a minor release, so it's better to avoid breaking bincompat in 1.0.x and then adapt @romanowski's approach to my PR. |
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
So I just pulled this PR; published local Zinc snapshot along with Sbt snapshot that uses it. Not seeing any difference wrt to hashing time, Here's a Visualvm screenshot (running Maybe this PR solves a different problem? |
@godenji Are you sure you've correctly published local Zinc and sbt? I really recommend to double check that, and change the versions as much as possible to avoid problems with cached jars. Make sure you use an sbt version that hasn't been used before. Since I don't have access to your project, it would really help if you can instrument this PR to report data and identify which classpath entries are being cached and which are not, as well as auxiliary useful information. |
@jvican I considered that, but after blowing away I'll dig around some more, bottleneck may be elsewhere than what this PR resolves. |
@godenji You also need to blow away the corresponding folder in |
@jvican that was it, noop Since this is going into Zinc 1.0.x and Sbt 1.0.4 is slated for this coming week, is there any chance that this magnificent fix can make its way into Sbt? That would be most welcome, incremental builds are horrendous ATM. |
I've benchmarked this in a personal project too and I confirm it's fixed. Thanks for volunteering to try this out @godenji. |
@eed3si9n Can you review this please 😄? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I like that the change is contained to implementation side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* 1.0.x: (25 commits) Add yourkit acknoledgement in the README Add header to cached hashing spec Add headers to missing files Fix sbt#332: Add sbt-header back to the build Update sbt-scalafmt to 1.12 Make classpath hashing more lightweight Fix sbt#442: Name hash of value class should include underlying type source-dependencies/value-class-underlying: fix test Ignore null in generic lambda tparams Improve and make scripted parallel Fix sbt#436: Remove annoying log4j scripted exception Fix sbt#127: Use `unexpanded` name instead of `name` Add pending test case for issue/127 source-dependencies / patMat-scope workaround Fixes undercompilation on inheritance on same source Add real reproduction case for sbt#417 Add trait-trait-212 for Scala 2.12.3 Fix source-dependencies/sealed Import statement no longer needed Move mima exclusions to its own file ... Conflicts: internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala The ClassToAPI conflict is due to: * sbt#393 (a 1.x PR), conflicting with * sbt#446 (a 1.0.x PR). The MixedAnalyzingCompiler conflict is due to: * sbt#427 (a 1.x PR), conflicting with * sbt#452 (a 1.0.x PR).
* 1.0.x: (28 commits) Split compiler bridge tests to another subproject Implement compiler bridge for 2.13.0-M2 Add yourkit acknoledgement in the README "sbt '++ 2.13.0-M2!' compile" does not work with sbt 1.0.0 Add header to cached hashing spec Add headers to missing files Fix sbt#332: Add sbt-header back to the build Update sbt-scalafmt to 1.12 Make classpath hashing more lightweight Fix sbt#442: Name hash of value class should include underlying type source-dependencies/value-class-underlying: fix test Ignore null in generic lambda tparams Improve and make scripted parallel Fix sbt#436: Remove annoying log4j scripted exception Fix sbt#127: Use `unexpanded` name instead of `name` Add pending test case for issue/127 source-dependencies / patMat-scope workaround Fixes undercompilation on inheritance on same source Add real reproduction case for sbt#417 Add trait-trait-212 for Scala 2.12.3 ... Conflicts: internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala project/build.properties zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala The ClassToAPI conflict is due to: * sbt#393 (a 1.x PR), conflicting with * sbt#446 (a 1.0.x PR). The build.properties conflict is due to different PRs bumping sbt.version from 1.0.0 to 1.0.2 to 1.0.3. (sbt#413, sbt#418, sbt#453). The MixedAnalyzingCompiler conflict is due to: * sbt#427 (a 1.x PR), conflicting with * sbt#452 (a 1.0.x PR).
The validator has checked the following projects against Scala 2.12,
✅ The result is: SUCCESS |
* 1.0.x: (28 commits) Split compiler bridge tests to another subproject Implement compiler bridge for 2.13.0-M2 Add yourkit acknoledgement in the README "sbt '++ 2.13.0-M2!' compile" does not work with sbt 1.0.0 Add header to cached hashing spec Add headers to missing files Fix #332: Add sbt-header back to the build Update sbt-scalafmt to 1.12 Make classpath hashing more lightweight Fix #442: Name hash of value class should include underlying type source-dependencies/value-class-underlying: fix test Ignore null in generic lambda tparams Improve and make scripted parallel Fix #436: Remove annoying log4j scripted exception Fix #127: Use `unexpanded` name instead of `name` Add pending test case for issue/127 source-dependencies / patMat-scope workaround Fixes undercompilation on inheritance on same source Add real reproduction case for sbt/zinc#417 Add trait-trait-212 for Scala 2.12.3 ... Conflicts: internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala project/build.properties zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala The ClassToAPI conflict is due to: * sbt/zinc#393 (a 1.x PR), conflicting with * sbt/zinc#446 (a 1.0.x PR). The build.properties conflict is due to different PRs bumping sbt.version from 1.0.0 to 1.0.2 to 1.0.3. (#413, #418, #453). The MixedAnalyzingCompiler conflict is due to: * sbt/zinc#427 (a 1.x PR), conflicting with * sbt/zinc#452 (a 1.0.x PR).
* 1.0.x: (28 commits) Split compiler bridge tests to another subproject Implement compiler bridge for 2.13.0-M2 Add yourkit acknoledgement in the README "sbt '++ 2.13.0-M2!' compile" does not work with sbt 1.0.0 Add header to cached hashing spec Add headers to missing files Fix sbt#332: Add sbt-header back to the build Update sbt-scalafmt to 1.12 Make classpath hashing more lightweight Fix sbt#442: Name hash of value class should include underlying type source-dependencies/value-class-underlying: fix test Ignore null in generic lambda tparams Improve and make scripted parallel Fix sbt#436: Remove annoying log4j scripted exception Fix sbt#127: Use `unexpanded` name instead of `name` Add pending test case for issue/127 source-dependencies / patMat-scope workaround Fixes undercompilation on inheritance on same source Add real reproduction case for sbt/zinc#417 Add trait-trait-212 for Scala 2.12.3 ... Conflicts: internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala project/build.properties zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala The ClassToAPI conflict is due to: * sbt/zinc#393 (a 1.x PR), conflicting with * sbt/zinc#446 (a 1.0.x PR). The build.properties conflict is due to different PRs bumping sbt.version from 1.0.0 to 1.0.2 to 1.0.3. (sbt#413, sbt#418, sbt#453). The MixedAnalyzingCompiler conflict is due to: * sbt/zinc#427 (a 1.x PR), conflicting with * sbt/zinc#452 (a 1.0.x PR).
* 1.0.x: (28 commits) Split compiler bridge tests to another subproject Implement compiler bridge for 2.13.0-M2 Add yourkit acknoledgement in the README "sbt '++ 2.13.0-M2!' compile" does not work with sbt 1.0.0 Add header to cached hashing spec Add headers to missing files Fix scala#332: Add sbt-header back to the build Update sbt-scalafmt to 1.12 Make classpath hashing more lightweight Fix scala#442: Name hash of value class should include underlying type source-dependencies/value-class-underlying: fix test Ignore null in generic lambda tparams Improve and make scripted parallel Fix scala#436: Remove annoying log4j scripted exception Fix scala#127: Use `unexpanded` name instead of `name` Add pending test case for issue/127 source-dependencies / patMat-scope workaround Fixes undercompilation on inheritance on same source Add real reproduction case for sbt/zinc#417 Add trait-trait-212 for Scala 2.12.3 ... Conflicts: internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala project/build.properties zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala The ClassToAPI conflict is due to: * sbt/zinc#393 (a 1.x PR), conflicting with * sbt/zinc#446 (a 1.0.x PR). The build.properties conflict is due to different PRs bumping sbt.version from 1.0.0 to 1.0.2 to 1.0.3. (scala#413, scala#418, scala#453). The MixedAnalyzingCompiler conflict is due to: * sbt/zinc#427 (a 1.x PR), conflicting with * sbt/zinc#452 (a 1.0.x PR).
* 1.0.x: (28 commits) Split compiler bridge tests to another subproject Implement compiler bridge for 2.13.0-M2 Add yourkit acknoledgement in the README "sbt '++ 2.13.0-M2!' compile" does not work with sbt 1.0.0 Add header to cached hashing spec Add headers to missing files Fix scala#332: Add sbt-header back to the build Update sbt-scalafmt to 1.12 Make classpath hashing more lightweight Fix scala#442: Name hash of value class should include underlying type source-dependencies/value-class-underlying: fix test Ignore null in generic lambda tparams Improve and make scripted parallel Fix scala#436: Remove annoying log4j scripted exception Fix scala#127: Use `unexpanded` name instead of `name` Add pending test case for issue/127 source-dependencies / patMat-scope workaround Fixes undercompilation on inheritance on same source Add real reproduction case for sbt/zinc#417 Add trait-trait-212 for Scala 2.12.3 ... Conflicts: internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala project/build.properties zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala The ClassToAPI conflict is due to: * sbt/zinc#393 (a 1.x PR), conflicting with * sbt/zinc#446 (a 1.0.x PR). The build.properties conflict is due to different PRs bumping sbt.version from 1.0.0 to 1.0.2 to 1.0.3. (scala#413, scala#418, scala#453). The MixedAnalyzingCompiler conflict is due to: * sbt/zinc#427 (a 1.x PR), conflicting with * sbt/zinc#452 (a 1.0.x PR). Rewritten from sbt/zinc@e245d95
And make it parallel!
This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running
compile
on multi-module builds thathave gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.
This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing
hashCode
on a SHA-1 checksum, which doesn'tmake sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. #371
explains why. The third limitation with this check is that file hashes
are implemented internally as
int
s, which is not enough to representthe richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.
Therefore, this pull request only tackles these two things:
Results, on my local machine: