Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go_repository: read configuration from WORKSPACE #529

Closed
jayconrod opened this issue May 17, 2019 · 20 comments · Fixed by #550
Closed

go_repository: read configuration from WORKSPACE #529

jayconrod opened this issue May 17, 2019 · 20 comments · Fixed by #550

Comments

@jayconrod
Copy link
Contributor

When go_repository runs Gazelle in an external repository, Gazelle should configure itself using the WORKSPACE file in the main repository.

This would be primarily useful for learning names and import paths of other external repositories. It would reduce the need to go out to the network to find external repository root paths. This process is not only slow but error-prone in module mode. Module boundaries may shift across versions, and it's important to know the version that's actually being used. This is the root cause of #525.

If #12 is implemented, we'd probably use this as a starting point to discover other external repositories.

@blico
Copy link
Contributor

blico commented Jun 11, 2019

@jayconrod Any advice on a generic way to discover the main WORKSPACE file from a go_repository rule? So far the only way I can think of is using a relative path from outputBase, e.g. go_repository is executed from (outputBase)/external/<repo> so we can get to the main WORKSPACE with ../../execroot/<main_repo>/WORKSPACE, as long as we pass the main repo name to go_repository .

@jayconrod
Copy link
Contributor Author

I'm not sure what the best way to do this will be. I ran a quick test with a dummy repository rule, and the only environment variable that points to the main workspace was PWD; I don't think that's reliable. Need more research.

@asuffield
Copy link
Contributor

I have a workingish patch but it has some issues. Note to self: I owe one partially-broken example, get that uploaded tomorrow.

@asuffield
Copy link
Contributor

Okay, so here is what I've been using for the past few months: https://github.com/bazelbuild/bazel-gazelle/compare/master...asuffield:go-repository-workspace?expand=1

It has some flaws. Whenever WORKSPACE is touched for any reason, all go_repository rules will rerun. If rules are defined in a file other than WORKSPACE, this approach cannot work.

I think we need a bazel feature to make this really clean, but I haven't figured out precisely what. It'll be something like "expose the parsed repository rules programmatically", but that's not possible today because of the way WORKSPACE parsing works. I had an old email thread with Laurent about a way out of this....

@blico
Copy link
Contributor

blico commented Jun 12, 2019

Nice! I didn't know that Label("@//:WORKSPACE")) was valid. I agree that re-running all go_repository rules every time WORKSPACE changes is not ideal. This may be required for builds to remain hermetic though? Otherwise, the same build may return different results depending on when its go_repository rule was invoked. Maybe we should make this feature opt-in?

If rules are defined in a file other than WORKSPACE, this approach cannot work.

4f524f2 adds the repository_macro directive. Using this directive is all you need to do for your current change to work with rules outside of WORKSPACE. However, with the current approach when a rule outside of WORKSPACE is modified go_repository rules will not know to re-run. We will also need to declare all of the files referenced via repository_macro as inputs to the rule.

@jayconrod
Copy link
Contributor Author

Re-running go_repository rules when WORKSPACE is changed might not be catastrophic if the cache is intact. That only helps go_repository rules in module mode, but they should be the common case in the future. Perhaps non-module rules can ignore WORKSPACE.

@jayconrod
Copy link
Contributor Author

To be clear, I'm not sure what the overhead will be for re-evaluating all the rules; even without needing to download source it might not be acceptable.

That said, the original purpose of the cache was to be able to change the Gazelle version or go_repository implementation without needing to download module sources again. Changing configuration in WORKSPACE is only a small expansion of that scope.

@asuffield
Copy link
Contributor

Declaring files referenced via repository_macro would require that data being available to the repository rule, which is a nest of complexity (doable, but tough).

Rebuilding all go_repository rules has seemed to be tolerable if you have sha256 entries for every download, and frustrating if you have to redownload. I don't have a good solution for git repos (I'm doing them via http instead). The correct behaviour here would be to rebuild them if and only if some repo names or import paths have changed in WORKSPACE, which is a level of subtlety not currently possible without some new bazel features.

Otherwise, the same build may return different results depending on when its go_repository rule was invoked.

I'll just note that this is true in general: repo rules are not hermetic and change behaviour when things on the network give different results (canonical example: somebody moves a git tag that your rule references). Still, we would like to be cache-correct against local files.

@blico
Copy link
Contributor

blico commented Jun 14, 2019

Another option we have here is to use a copy of the main go.mod file within every go repository that uses modules. With this approach calls to go list will use the proper source of truth, which should fix the related issues. I tested this approach in our repo, and it fixed the related issues.

Using go.mod instead of WORKSPACE should make it so we only need to re-evaluate go_repository rules when it is actually necessary (the go.mod has changed). The downsides of this approach are we may run into weird issues if WORKSPACE and go.mod are out of sync. Also, we will still have many calls to go list which can possibly go out to the network.

@asuffield
Copy link
Contributor

asuffield commented Jun 14, 2019 via email

@blico
Copy link
Contributor

blico commented Jun 15, 2019

The problem of unnecessary rebuilds still exists if we depend on the go.mod file as the source of truth. We only care about changes within the go.mod file that affect the known module paths of the repository. Any other type of change (such as versioning and directives) ideally should not trigger a re-build.

Why don't we just add an option to pass in all of the known imports to every go_repository rule? It is possible for update-repos to generate a .bzl file which contains a list of all known imports. update-repos could then pass this list into the known_imports attribute for all the go_repository rules that it generates. The only downside I can see to this approach is that it will require checking in another auto-generated file into the repository. In my opinion, that is a fine price to pay if we can avoid triggering unnecessary rebuilds.

I will try to code this up when I get a chance and post it here.

@jayconrod
Copy link
Contributor Author

I did a quick experiment with github.com/gohugo/hugo. The repository uses modules which can be imported into WORKSPACE, and the main binary depends on packages in 70 modules.

I split go_repository, go_repository_tools, and go_repository_cache into three files. Making a change to go_repository.bzl thus invalidates go_repository rules without invalidating go_repository_tools and go_repository_cache. Rebuilding the main binary after such a change took 22.5s on my MBP. That's the time to copy modules out of the cache, then generate build files with an already-built copy of Gazelle.

22.5s doesn't seem totally unreasonable to me for a workspace change in a medium sized project. I think go_repository should just resolve the main workspace file, then recursively any files referenced by # gazelle:repository_macro directives. The path to the main WORKSPACE file can be passed to Gazelle, and that can be used for further configuration.

About WORKSPACE vs go.mod as the source of truth: sorry, it's got to be WORKSPACE, since it can contain directives and customizations on top of what's imported from go.mod. If the delay is too much of a problem, we can provide a way for users to point to a .bzl configuration file instead of WORKSPACE (for example, one that contains a macro with all the go_repository rules).

I'll try and hack this together today.

jayconrod pushed a commit to jayconrod/bazel-gazelle that referenced this issue Jun 16, 2019
'gazelle fix' and 'gazelle update' now accept -repo_config, the path
to a file where information about repositories can be loaded. By
default, this is WORKSPACE in the repository root directory.
'gazelle fix' and 'gazelle update-repos' still update the WORKSPACE
file in the repository root directory when this flag is set.

go_repository passes the path to @//:WORKSPACE to -repo_config.

go_repository resolves @//:WORKSPACE and any files mentioned in
'# gazelle:repository_macro' directives. When these files, all
go_repository rules will be invalidated. It should not be necessary to
download cached repositories (except vcs repositories; see bazel-contrib#549).
On a Macbook Pro, it takes about 22.5s to re-evaluate 70 cached,
invalidated go_repository rules for github.com/gohugoio/hugo. If this
becomes a project for large projects, we can provide a way to disable
or limit this behavior in the future.

go_repository_tools and go_repository_cache are moved to their own
.bzl files. Changes in go_repository.bzl should not invalidate these
in the future.

Fixes bazel-contrib#529
jayconrod pushed a commit that referenced this issue Jun 17, 2019
'gazelle fix' and 'gazelle update' now accept -repo_config, the path
to a file where information about repositories can be loaded. By
default, this is WORKSPACE in the repository root directory.
'gazelle fix' and 'gazelle update-repos' still update the WORKSPACE
file in the repository root directory when this flag is set.

go_repository passes the path to @//:WORKSPACE to -repo_config.

go_repository resolves @//:WORKSPACE and any files mentioned in
'# gazelle:repository_macro' directives. When these files, all
go_repository rules will be invalidated. It should not be necessary to
download cached repositories (except vcs repositories; see #549).
On a Macbook Pro, it takes about 22.5s to re-evaluate 70 cached,
invalidated go_repository rules for github.com/gohugoio/hugo. If this
becomes a project for large projects, we can provide a way to disable
or limit this behavior in the future.

go_repository_tools and go_repository_cache are moved to their own
.bzl files. Changes in go_repository.bzl should not invalidate these
in the future.

Fixes #529
@blico
Copy link
Contributor

blico commented Jun 18, 2019

If the delay is too much of a problem, we can provide a way for users to point to a .bzl configuration file instead of WORKSPACE (for example, one that contains a macro with all the go_repository rules).

My main concern with this change is the implications it has on our CI builds. Our CI runs a changed target calculation to figure out what it should build and test. So now a WORKSPACE change will mean changes to every go_repository rule, meaning anything that depends on a go_repository rule also has changed. Essentially resulting in a full rebuild whenever WORKSPACE is modified.

So far the best I optimization I can think of is if instead, we pass in all of the known imports to every go_repository rule. This works because the only configuration Gazelle actually uses from WORKSPACE is the import paths. So now we won't need to trigger unnecessary rebuilds for version changes and other irrelevant changes to WORKSPACE. I can see this not being an ideal solution if Gazelle ever changes to care about more than just import paths. Also, we still run into the same problem whenever a new repo is added or removed.

I am hoping there is a better solution available here because as it is right now I don't think we can downstream.

@jayconrod
Copy link
Contributor Author

Our CI runs a changed target calculation to figure out what it should build and test. So now a WORKSPACE change will mean changes to every go_repository rule, meaning anything that depends on a go_repository rule also has changed. Essentially resulting in a full rebuild whenever WORKSPACE is modified.

Could you elaborate on how this works? I'm expecting most people run something like bazel test //... (or some static subset). The go_repository rules may be re-evaluated, but if the contents and generated build files are identical, everything built from them will already be cached.

So far the best I optimization I can think of is if instead, we pass in all of the known imports to every go_repository rule. This works because the only configuration Gazelle actually uses from WORKSPACE is the import paths.

This doesn't seem like a good user experience. WORKSPACE would be O(n^2) in size with the number of go_repository rules.

Also, I expect there will be other reasons why WORKSPACE configuration will be necessary. #132 would still be useful for declaring a repository name and import path for repositories fetched with git_repository, http_archive, or some other rule. In #477, I'm also planning to look for go_repository rules fetched as modules (with the version attribute set) in order to implement minimal module compatibility.

I am hoping there is a better solution available here because as it is right now I don't think we can downstream.

I think this can be tweaked a little bit to point to a file other than WORKSPACE (or to disable this entirely by pointing to no file). I didn't fully implement that or expose it in the API because I was hoping it wouldn't be necessary. If you could point to a .bzl file (or a list of .bzl files), would that be workable? Or perhaps a file that just contained directives and nothing else? If #132 were implemented, I think that would be equivalent to passing a list of known repos to every rule.

@blico
Copy link
Contributor

blico commented Jun 18, 2019

Could you elaborate on how this works? I'm expecting most people run something like bazel test //... (or some static subset).

We run a Bazel query twice (query --output=proto --order_output=full "//external:all-targets + deps(//...:all-targets"), one including the new commit and the other on the tip of master. We then compute a hash for all of the targets outputted by both queries. If the target has any inputs/deps, the already calculated hashes for the inputs are included in the current target's hash computation. We then compare all of the target hashes from both commits, and if any differ they are considered changed targets. So with this change, because each go_repository rule will have a new hash, any target with a go_repository rule as an input will also have a new hash.

This doesn't seem like a good user experience. WORKSPACE would be O(n^2) in size with the number of go_repository rules.

During update-repos we could auto-generate a .bzl file with a list of all known imports and each go_repository rule could use that list.

If you could point to a .bzl file (or a list of .bzl files), would that be workable? Or perhaps a file that just contained directives and nothing else? If #132 were implemented, I think that would be equivalent to passing a list of known repos to every rule.

The main thing we need here is to eliminate the number of unnecessary re-builds. Pointing to a .bzl file with go_repository rules will still trigger rebuilds with version changes, so I don't think it is good enough. Implementing #132 and passing in a build file with all of the directives sounds fine with me. Ideally, we can autogenerate this file in a similar way to my idea above.

I still think we could do better... a go_repository rule really only needs to care about the imports that it has and maybe some extra configuration. If a go_repository rule could somehow declare its imports, then we could try to limit passing in only the imports that it cares about. I am still struggling to think about how this could be implemented, or if it is even possible.

@jayconrod
Copy link
Contributor Author

We run a Bazel query twice (query --output=proto --order_output=full "//external:all-targets + deps(//...:all-targets"), one including the new commit and the other on the tip of master. We then compute a hash for all of the targets outputted by both queries. If the target has any inputs/deps, the already calculated hashes for the inputs are included in the current target's hash computation. We then compare all of the target hashes from both commits, and if any differ they are considered changed targets. So with this change, because each go_repository rule will have a new hash, any target with a go_repository rule as an input will also have a new hash.

This seems like it should be fine, but I might be missing something. You're hashing the output of bazel query, right? If a go_repository rule is re-evaluated, but its content is identical to what it was before (which it usually will be), all the targets (and the hashes) should be the same, and nothing will need to be rebuilt. It would be like copying the workspace to a new directory (assuming a shared cache).

Implementing #132 and passing in a build file with all of the directives sounds fine with me. Ideally, we can autogenerate this file in a similar way to my idea above.

This seems like the most promising approach to me. I'll implement this soon when I have some spare bandwidth. I still think WORKSPACE will be the best option for most folks in the common case, but such a file could be generated from WORKSPACE with a small local modification or a script.

@blico
Copy link
Contributor

blico commented Jun 18, 2019

You are right... I was assuming the query output would list @//:WORKSPACE as an input to every go_repository rule, but it does not.

Thanks for your help here, Jay.

@asuffield
Copy link
Contributor

asuffield commented Jul 4, 2019

Okay, this is almost usable for me now. My current bugbear is that I have to list all the repos in the root repository somewhere. In particular, if I'm calling a deps() function from some other repo, there's no way to tell gazelle to go scrape that function. repository_macro appears to be single-use, and also can't reference other repositories.

This can probably be solved by letting me write things like:

# gazelle:repository-macro external/foo/deps.bzl%foo_deps
# gazelle:repository-macro external/bar/deps.bzl%bar_deps 

@jayconrod
Copy link
Contributor Author

@asuffield You should be able to use repository_macro multiple times within WORKSPACE. Please file an issue if not.

It will only be able to reference files within the main workspace though. Referencing files in other workspaces is hard, since they may or may not have been fetched and may be out of date. I'm not sure whether it's technically possible for Gazelle to recursively invoke Bazel to fetch those repositories. It used to be forbidden, but now I think Bazel releases the lock on the output directory before it starts running a command. In any case, it would add a lot of complexity and slow things down, so I'd rather avoid doing that.

#132 may be an adequate workaround for this if it were implemented.

@jayconrod
Copy link
Contributor Author

Closing since this was fixed as part of #564.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants