Build Tooling: Optimize build by caching results #15967

aduth · 2019-06-03T15:00:53Z

As explored previously in #13124 and discussed in #15230, we may be able to benefit from improved build times by leveraging a persisted cache of previously-built artifacts.

Implementation Notes:

Before running a JavaScript or SASS file's build tasks, consider whether a cached result exists
- Cache can exist in some gitignored location, e.g. node_modules/.cache/gutenberg-build or require( 'os' ).tmpdir() (see babel-loader as prior art)
- Files are named by hash of the source file's content, to assure reuse only on identical source
  - Use a fast hash like SHA-1 or MD5
  - See: Build Tooling: Optimize build by spawning worker pool #15230 (comment)
Use Travis caching to reuse artifacts between builds
- This means the cache location must be configurable, e.g. similar to BABEL_CACHE_DIRECTORY override in Scripts: Optimize default Webpack configuration #14860
- To keep Travis cache download time reasonable, we must prune cache, ideally automatically
  - See: Build Tooling: Optimize build by spawning worker pool #15230 (comment), Build Tooling: Optimize build by spawning worker pool #15230 (comment)

The text was updated successfully, but these errors were encountered:

aduth · 2019-06-05T13:16:48Z

A complication here is that one source file will have multiple output files, so it's not quite straight-forward to detect a cached result by the presence of a hash.

A couple ideas:

We could refactor the build worker to have a function which produces the set of filenames which would result from a build, which is used both for checking for a cached result, and for the build itself.
The cache lookup result could be a directory of files rather than a single file

noisysocks · 2019-06-06T06:38:52Z

Last week I started experimenting with an approach where we cache objects instead of files by serialising the object as a JSON string. You can see where I got over in the try/build-caching branch.

It's not finished as there were errors when you running the build. Still, if anyone would like to pick this up: go for it!

The assumption I made that we will have to verify is that doing these JSON.parse() calls is faster than Babel's transform() function.

To keep Travis cache download time reasonable, we must prune cache, ideally automatically

Thinking that we just recursively traverse the cache directory and delete files that are more than 3 months old. If traversal itself is too slow then we could only perform this operation when Math.random() < 0.1.

aduth · 2019-06-06T15:42:08Z

Last week I started experimenting with an approach where we cache objects instead of files by serialising the object as a JSON string. You can see where I got over in the try/build-caching branch.

As written, are you assuming that we'd have multiple identical source files which could benefit from the memoization?

I could also imagine just writing this as a file with the stringified JSON object, as an alternative to what I considered in my previous comment of a directory of files, where the keys in the JSON object are equivalent to files in a directory. Unclear whether this would have a performance difference, e.g. if a file system would be faster at copying a file than to parse JSON and write a new one from the object contents.

Thinking that we just recursively traverse the cache directory and delete files that are more than 3 months old. If traversal itself is too slow then we could only perform this operation when Math.random() < 0.1.

This makes sense, but I'm not sure when this would occur. For me, ideally it'd be some "fire and forget" (non-blocking) action which occurs after a build completes.

There's various lifecycle steps like after_success, after_script (documentation).

If it's not possible to use these as "non-blocking" (i.e. assuming the container shuts down after the build completes), maybe instead it's just a separate task of the build to do the cleanup, since presumably it wouldn't take long and could be done in parallel to other tasks.

noisysocks · 2019-06-07T06:08:51Z

As written, are you assuming that we'd have multiple identical source files which could benefit from the memoization?

This isn't an explicit objective of what I wrote, no.

We could add the path of the input file to the string that is hashed so that two identical source files using the same cache entry. I figured this isn't necessary though since, if two source files are identical, they will compile to the same result.

noisysocks · 2019-06-07T06:23:01Z

Unclear whether this would have a performance difference, e.g. if a file system would be faster at copying a file than to parse JSON and write a new one from the object contents.

Yeah, you're right that it's likely that a mv is faster.

Perhaps something like this for an e.g. .cache/manifest.json:

{
	"src/example.js": {
		"hash": "1a79a4d60de6718e8e5b326e338ae533",
		"output": {
			"build/example.js": ".cache/1a79a4d60de6718e8e5b326e338ae533.js",
			"build/example.js.map": ".cache/1a79a4d60de6718e8e5b326e338ae533.js.map"
		}
	}
}

aduth · 2019-06-10T12:44:17Z

I spent some time experimenting with this on Friday, and made some decent (but incomplete/buggy) progress. The work is currently reflected in the add/build-cache branch (specifically d2cf89e).

A few of my observations:

While it was natural to try to add caching as an enhancement of the current behavior in the build-worker.js file (the forked build process), and while it did mostly work and was fairly simple to implement, I quickly ran into an issue of race conditions. When multiple of these forked processes would try to copy over a cached result, they'd often simultaneously try creating the recursive folder structure at the same time, resulting in errors (EEXISTS folder exists). It might be possible to swallow these errors, but in the tools I was trying to leverage (fs-extra, recursive-copy) it was not handled gracefully.
This led me instead to consider caching as part of a stream workflow of the managing parent process. This feels like a very appropriate use of streams, where the build was just one step in a pipeline of writing to/copying from cache.
Cache invalidation rears its head, as tends to be the case. The implementation in d2cf89e maintains a list of files which are relevant to the build process, whose contents are incorporated into the cache checksum hash. I fear this is quite fragile and could easily fall out of sync (resulting in wrongful cache reuse).
The implementation of the cache is as the second point mentioned in Build Tooling: Optimize build by caching results #15967 (comment) , where the result is a mapping of files which are written to a directory with name of a checksum in node_modules/.cache/gutenbuild.
- In reflection, I think there's a minor flaw to this which makes it non-optimal, which is that in copying files from cache, each copy must occur with a recursive folder check. For example, considering edit.js and save.scss from block-library/src/paragraph, each would have to ensure that the destination directory (build/paragraph) exist or are created before copying the file. Ideally this is at least memoized to know that a prior copy for the same directory had guaranteed its existence.
We now have fs-extra as a dependency (from the recent release tool introduced in Bootstrap an automation tool #15699), which helps as a minor refactor to avoid promisification of fs functions (see build-worker.js changes in d2cf89e)
The necessary refactor to change the build tasks to return an object of file mappings (file name to file contents) proved beneficial regardless of the introduction of caching. In retrospect, having the build task manage the creation of its output directory added some unnecessary complexity.

aduth added [Type] Enhancement A suggestion for improvement. [Type] Build Tooling Issues or PRs related to build tooling [Type] Performance Related to performance efforts labels Jun 3, 2019

aduth mentioned this issue Jun 3, 2019

Build Tooling: Optimize build by spawning worker pool #15230

Merged

aduth mentioned this issue Jun 13, 2019

Add support for watching block.json files when running npm run dev #16150

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build Tooling: Optimize build by caching results #15967

Build Tooling: Optimize build by caching results #15967

aduth commented Jun 3, 2019

aduth commented Jun 5, 2019

noisysocks commented Jun 6, 2019 •

edited

Loading

aduth commented Jun 6, 2019

noisysocks commented Jun 7, 2019 •

edited

Loading

noisysocks commented Jun 7, 2019

aduth commented Jun 10, 2019 •

edited

Loading

Build Tooling: Optimize build by caching results #15967

Build Tooling: Optimize build by caching results #15967

Comments

aduth commented Jun 3, 2019

aduth commented Jun 5, 2019

noisysocks commented Jun 6, 2019 • edited Loading

aduth commented Jun 6, 2019

noisysocks commented Jun 7, 2019 • edited Loading

noisysocks commented Jun 7, 2019

aduth commented Jun 10, 2019 • edited Loading

noisysocks commented Jun 6, 2019 •

edited

Loading

noisysocks commented Jun 7, 2019 •

edited

Loading

aduth commented Jun 10, 2019 •

edited

Loading