Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dependency-extraction-webpack-plugin: Calculate vendor hash from file output rather than Webpack internal state #34969

Conversation

anomiex
Copy link
Contributor

@anomiex anomiex commented Sep 20, 2021

Description

Webpack 5 recommends using the content hash over the old-style chunk
hash for cache invalidation, as the chunk hash can change even if the
output does not.

dependency-extraction-webpack-plugin will now include the content hash
for each content type in the asset file. The version field is still
present, now as a combined hash for all the content types reported, to
maintain backwards compatibility.

Fixes #34660

How has this been tested?

Tested locally, both in a trivial repo and in https://github.com/Automattic/jetpack/tree/master/projects/plugins/jetpack.

Screenshots

N/A

Types of changes

New feature.

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • I've tested my changes with keyboard and screen readers.
  • My code has proper inline documentation.
  • I've included developer documentation if appropriate.
  • I've updated all React Native files affected by any refactorings/renamings in this PR (please manually search all *.native.js files for terms that need renaming or removal).

Webpack 5 recommends using the content hash over the old-style chunk
hash for cache invalidation, as the chunk hash can change even if the
output does not.

dependency-extraction-webpack-plugin will now include the content hash
for each content type in the asset file. The `version` field is still
present, now as a combined hash for all the content types reported, to
maintain backwards compatibility.

Fixes WordPress#34660
@anomiex anomiex requested a review from gziolo as a code owner September 20, 2021 17:35
@github-actions
Copy link

👋 Thanks for your first Pull Request and for helping build the future of Gutenberg and WordPress, @anomiex! In case you missed it, we'd love to have you join us in our Slack community, where we hold regularly weekly meetings open to anyone to coordinate with each other.

If you want to learn more about WordPress development in general, check out the Core Handbook full of helpful information.

@github-actions github-actions bot added the First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository label Sep 20, 2021
@spacedmonkey
Copy link
Member

Wouldn't this be a breaking change?

@anomiex
Copy link
Contributor Author

anomiex commented Sep 20, 2021

What do you think it would break?

@gziolo gziolo added the [Tool] Dependency Extraction Webpack Plugin /packages/dependency-extraction-webpack-plugin label Oct 5, 2021
@gziolo
Copy link
Member

gziolo commented Oct 18, 2021

@anomiex, thank you for opening this PR.

For backward compatibility, we should optimize that the usage of version in the first place. It's used in WordPress core here:

https://github.com/WordPress/wordpress-develop/blob/d802fecf979d90b568e3c84d17d521df200cab38/src/wp-includes/blocks.php#L109-L114

The current behavior is also documented in the following section in the Block Editor Handbook:

https://developer.wordpress.org/block-editor/reference-guides/block-api/block-metadata/#wpdefinedasset (source)

I would expect that many plugins depend on that, too.

What is the rationale to have a different value for version and contentHash.javaScript? Why developers would want to use the new version? I like that you could now get the way to automate versioning for other resources. How could it be used in practice by plugins?

@anomiex
Copy link
Contributor Author

anomiex commented Oct 18, 2021

What is the rationale to have a different value for version and contentHash.javaScript?

The version encompasses all the assets for backwards compatibility. contentHash.javaScript covers only the JavaScript.

We could go to some extra work if we really wanted to detect the case where the only asset is JavaScript and make the two hashes be the same instead of calculating a combined hash. It doesn't seem worth the effort to me, though.

Why developers would want to use the new version? I like that you could now get the way to automate versioning for other resources. How could it be used in practice by plugins?

As noted in the patched readme, the JavaScript hash could be passed to wp_enqueue_script() while the CSS hash could be passed to wp_enqueue_style(), so if a change is made that only affects the CSS then the JS cache wouldn't have to be busted too.

On the other hand, they could just keep using version for simplicity. The main point of this for me is for the hashes to stop changing when none of the actual content changed, as detailed in #34660.

@gziolo
Copy link
Member

gziolo commented Oct 18, 2021

From my perspective, it should be perfectly fine to avoid introducing contentHash.javaScript and change the behavior of version. The changes in the documentation favor contentHash.javaScript over version, so there wouldn't be too much value in keeping version other than for backward compatibility. It would be also confusing to have an old and a new way to invalidate the cache for JavaScript assets. How plugin authors would know which one to use?

@anomiex
Copy link
Contributor Author

anomiex commented Oct 18, 2021

Are you requesting changes to the PR, or just musing?

@gziolo gziolo requested review from ocean90 and desrosj October 19, 2021 08:23
@gziolo
Copy link
Member

gziolo commented Oct 19, 2021

I think that this PR needs some iterations to avoid a need for updating PHP code to use the updated hash for JS files.

@anomiex
Copy link
Contributor Author

anomiex commented Oct 19, 2021

There's no need as it is now, it's fully backwards compatible. But if you want a revision to make it so there's no code update possible, that's fine with me.

On the other hand, as I implemented that and re-tested I found out my code wasn't working right. Webpack calculates the chunk contentHash fields when it first processes the chunk, but then when further updates happen (e.g. minification) it only updates the hash where it appears in the assets. It turns out that it worked with the code snippet from #34660 that only looked at the JavaScript asset, but fails when we combine all the assets' hashes.

@anomiex
Copy link
Contributor Author

anomiex commented Mar 22, 2022

@gziolo It has been five months since I made the change you requested. Care to re-review?

@gziolo
Copy link
Member

gziolo commented Mar 22, 2022

@gziolo It has been five months since I made the change you requested. Care to re-review?

I'll ask for help with the review in the parent issue. I don't think I fully understand the issue here.

@kraftbj
Copy link

kraftbj commented May 4, 2022

Confirming that this PR resolves the issue that we're facing.

The full narrative of why it matters to us (Jetpack):

  • we use the Gutenberg-developed dependency-extraction-webpack-plugin package to help us manage WordPress script dependencies within Jetpack-developed blocks.

  • As expected, it creates a PHP file at build time to include the WordPress script as a dependency in WP friendly way. For example, https://github.com/Automattic/jetpack-production/blob/master/_inc/blocks/business-hours/view.asset.php

  • The problem is the version field there. The “bug” is that the version field can change even if the underlying code we’re shipping remains the same. TBH, this doesn’t matter to most people. They’ll just accept it and move on.

  • For us, though, we’re also trying to sync these blocks to the WP.com codebase, which currently we’re taking the same source code and is built to meet WP.com’s needs. The impact is the same files building the same results can end up with different hashes.

  • In an example, the only actual change originally was changing the stable tab in readme.txt. Nothing else changed. However, because of the way that version is constructed right now, it uses the full path (e.g. /home/users/kraft/code/something/something/ as part of the math to figure out the version hash, even though it isn’t needed). The original issue provides an easy reproduction example: @wordpress/dependency-extraction-webpack-plugin should use content hash #34660

  • So, since WP.com is building the files in different directories (depending on which TeamCity host is executing the build), we get in situations where a random diff will be forced to include a pointless version bump.

  • In and of itself, that probably isn’t a huge deal, but we run into a conflict where we have two diffs go in at the same time. Neither one of them requires anything changing to the end result of the build, but both trigger this. Now we run into merge conflicts after one lands, so we need to force WP.com to rebuild to avoid the merge conflict.

  • The solution here is to tweak the method used to use what Webpack is now suggesting to use anyhow, contenthash which is a version number determined based on the output of the build itself… not something like the location of the files, etc. This resolves the issue we're seeing.

@gziolo
Copy link
Member

gziolo commented May 4, 2022

In an example, the only actual change originally was changing the stable tab in readme.txt. Nothing else changed. However, because of the way that version is constructed right now, it uses the full path (e.g. /home/users/kraft/code/something/something/ as part of the math to figure out the version hash, even though it isn’t needed). The original issue provides an easy reproduction example: #34660

We had exactly the same issue in WordPress core a year ago. @peterwilsoncc patched it by using optimization.moduleIds set to hashed (webpack v4, deterministic in v5`). See https://core.trac.wordpress.org/ticket/53192. Anyway, I will look closer into the attached project and try it myself and see whether changing the webpack config takes any effect.

@gziolo
Copy link
Member

gziolo commented May 4, 2022

In an example, the only actual change originally was changing the stable tab in readme.txt. Nothing else changed. However, because of the way that version is constructed right now, it uses the full path (e.g. /home/users/kraft/code/something/something/ as part of the math to figure out the version hash, even though it isn’t needed). The original issue provides an easy reproduction example: #34660

We had exactly the same issue in WordPress core a year ago. @peterwilsoncc patched it by using optimization.moduleIds set to hashed (webpack v4, deterministic in v5`). See https://core.trac.wordpress.org/ticket/53192. Anyway, I will look closer into the attached project and try it myself and see whether changing the webpack config takes any effect.

Ok, it looks like deterministic is the default setting in webpack 5, so it won’t help here.

It looks like the hash (version) generated for the asset depends on the absolute path of the entry point. I was able to confirm that before this patch moving the reproduction project to a different folder changes the hash generated. When it gets moved back then the hash gets restored.

// Go through the assets and hash the sources. We can't just use
// `entrypointChunk.contentHash` because that's not updated when
// assets are minified. Sigh.
// @todo Use `asset.info.contenthash` if we can make sure it's reliably set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that it could get fixed on the webpack side? Is there an issue open that describes the same use case and could be included as a reference so this todo item can be revisited later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not what it means.

What this comment is referring to is that sometimes Webpack sets .info.contenthash on the asset object, but it only does so if it decides that it needs it (if I recall correctly, it depends on whether the filename template uses [contenthash]). If we could ensure that Webpack sets that every time, we could just use it instead of hashing the file contents ourself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think the algorithm you proposed is a nice improvement so let's rephrase the comment so it presents the benefits, instead of the limitations of the webpack. I can fully confirm that it's going to solve the issue that Jetpack struggles with as explained by @kraftbj with hashes changing depending on the absolute path for files. We can also link this PR so people can learn the full context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to changing the comment, but I don't know what to rephrase it to. Care to make a suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, ideally we would just change (line 208):

version: entrypointChunk.hash

to:

version: entrypointChunk.contenthash

and be done? But we can't do that because webpack doesn't always set it? And/or the value doesn't have the properties that we need?

I noticed that webpack has an optimization.realContentHash config option. Which means that there are multiple valid ways how to compute contenthash, and that the default ways is in some sense "not real."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the goal to use the same hash as the real [contenthash], and we are only forced to calculate it ourselves, they should match. I.e., if my webpack config uses a [name].[contenthash].js filename template, and also uses the extraction plugin, the [contenthash] and version should be the same. But they currently aren't. The differences I see are:

  1. webpack uses md4 algorithm by default, configurable with output.hashFunction option, while the plugin uses sha512.
  2. webpack uses only the assets' contents to update the hash, while we hash the filename, too.
    It's weird that even making these two changes, the hashes are still different for me.

It would be worthwhile to spend some timeboxed time trying to fix this before merging. Other than that, I think this patch is ready to land.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md4 no longer works with Node 17+.

Replicating the same algorithm as webpack uses internally is rather complex. This is the initial commit when the support for contentHash was added:
webpack/webpack@b929d4c.

Maybe we should try to detect contentHash first, and fall back to the custom handler otherwise. Example contentHash objects:

{
  'css/mini-extract': '9dc6bf4629f53268df7c',
  javascript: 'bc28cb02479bf7449a77'
}
{ javascript: '684ed11ffa88cd017286' }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md4 no longer works with Node 17+.

webpack ships its own WASM module that implements md4. Instead of importing createHash from Node's crypto module, we can use the webpack version:

const webpack = require( 'webpack' );
const hash = webpack.util.createHash( 'md4' );

Replicating the same algorithm as webpack uses internally is rather complex.

The RealContentHashPlugin source is indeed very complex, and I'm not sure what set of possible use cases it covers, but I've been able to replicate the contenthash quite easily:

const hash = createHash( 'md4' );
// asset.source.updateHash( hash );
hash.update( asset.source.buffer() );
const version = hash.digest( 'hex' );

Here the non-obvious step is to not use asset.source.updateHash, because that hashes a string "RawSource" + source, instead of just source. That's all the difference and when computing a content hash, we only want the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the non-obvious step is to not use asset.source.updateHash, because that hashes a string "RawSource" + source, instead of just source. That's all the difference and when computing a content hash, we only want the source.

Looking at it another way, we don't actually care what exactly goes into the hash as long as changes to the output result in changes to the hash, and non-changes to the output do not result in changes to the hash. Whether or not "RawSource" is incorporated in the hash makes no difference from that perspective.

OTOH, using asset.source.updateHash may have a performance advantage if the source is something like ReplaceSource or ConcatSource that does non-trivial work in source() or buffer().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't actually care what exactly goes into the hash as long as changes to the output result in changes to the hash, and non-changes to the output do not result in changes to the hash.

I believe that my suggestions make an actual observable improvements to this behavior. I stopped hashing the file name, so renaming the file won't change the hash. In the end, the loading URL is going to be like script.js?ver=hash, where name change doesn't require a hash change to load the right asset.

Also, incorporating internal structure like RawSource tags changes the hash when webpack internals change, even if the content remains the same. It seems to me that .updateHash is more fit for internal purposes, not for calculating contenthash.

The performance advantage, I'm afraid it will never materialize. At some point webpack will write the asset to a file, and will call source.buffer() to construct the buffer to write. When calculating contenthash we'll just just construct the buffer a bit earlier, at a very late compilation stage where the source is unlikely to change further.

hash.update( `${ filename }: ` );
asset.source.updateHash( hash );
}

const entrypointChunk = isWebpack4
Copy link
Member

@gziolo gziolo May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now move entrypointChunk constant further in the code and closer to the usage. Disregard this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we probably could get files from the entrypointChunk.files rather than from entrypoint.getFiles() to align with the handling in other places:

const entrypointChunk = isWebpack4
	? entrypoint.chunks.find( ( c ) => c.name === entrypointName )
	: entrypoint.getEntrypointChunk();

const entrypointChunkHash = createHash( 'sha512' );
for ( const filename of Array.from( entrypointChunk.files ).sort() ) {
	entrypointChunkHash.update( filename + ':' + compilation.getAsset( filename ).source.source() );
}

Copy link
Member

@gziolo gziolo May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In @wordpress/scripts there is this logic:

splitChunks: {
cacheGroups: {
style: {
type: 'css/mini-extract',
test: /[\\/]style(\.module)?\.(sc|sa|c)ss$/,
chunks: 'all',
enforce: true,
name( _, chunks, cacheGroupKey ) {
const chunkName = chunks[ 0 ].name;
return `${ dirname(
chunkName
) }/${ cacheGroupKey }-${ basename( chunkName ) }`;
},
},
default: false,
},
},

With the current implementation it processes 3 files:

[ './style-index.css', 'index.css', 'index.js' ]

Array.from( entrypointChunk.files ) would process only 2 files because ./style-index.css goes to its own chunk:

[ 'index.css', 'index.js' ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still suggesting to change the implementation here as in your second comment, or is the third comment convincing yourself not to?

It seems to me that your last comment is the argument not to make the change you suggested in your second comment. The extra file is part of the asset and so should be covered by the hash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment provides the reasoning why the change included in #34969 (comment) would be a good improvement. We don't care about ./style-index.css that goes into its chunk so it shouldn't matter when calculating the hash for the JS entry point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you care if I point out that someone using .optimization.runtimeChunk would find that the runtime.js or runtime~index.js isn't included in the version hash either?

If you really want this change I'm not going to fight it since our use case has neither of these sorts of extra files. But I do think if we're going to have one hash for the asset, it should cover the whole asset rather than excluding parts of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the possibility that someone is using splitChunks like you do there but on JS code chunks, or with Webpack's automatic vendor splitting.

Copy link
Member

@gziolo gziolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this PR extensively using the project scaffolded with npx @wordpress/create-block -t @wordpress/create-block-tutorial-template and all the changes were applied directly to the node_modules folder. I can confirm it works as expected and fixes the issue described by @kraftbj. I left some comments to discuss the current implementation, but overall we can land this change.

We should also update the title for this PR because it doesn't reflect the current implementation.

We maintain the CHANGELOG files for the package, and it would be great to list this change as well because it will change version in all generated assets files with this change included:
https://github.com/WordPress/gutenberg/blob/trunk/packages/dependency-extraction-webpack-plugin/CHANGELOG.md

@gziolo gziolo added the [Type] Enhancement A suggestion for improvement. label May 5, 2022
@anomiex anomiex changed the title Use contentHash in dependency-extraction-webpack-plugin dependency-extraction-webpack-plugin: Calculate vendor hash from file output rather than Webpack internal state May 5, 2022
Copy link
Member

@gziolo gziolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my previous comment
#34969 (review), the proposed implementation works. Let's land it to remove the friction and explore improvements separately. @anomiex, thank you for providing the fix.

@gziolo gziolo merged commit 1397b15 into WordPress:trunk May 9, 2022
@github-actions github-actions bot added this to the Gutenberg 13.3 milestone May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository [Tool] Dependency Extraction Webpack Plugin /packages/dependency-extraction-webpack-plugin [Type] Enhancement A suggestion for improvement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

@wordpress/dependency-extraction-webpack-plugin should use content hash
5 participants