Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autobuild: Mac legacy CI failure: hdiutil: create failed - No child processes #3207

Closed
hoffie opened this issue Dec 11, 2023 · 10 comments · Fixed by #3223
Closed

Autobuild: Mac legacy CI failure: hdiutil: create failed - No child processes #3207

hoffie opened this issue Dec 11, 2023 · 10 comments · Fixed by #3223
Assignees
Labels
bug Something isn't working

Comments

@hoffie
Copy link
Member

hoffie commented Dec 11, 2023

Describe the bug
CI Mac Legacy regularly fails with this error:

To Reproduce
Run CI

Expected behavior
Clean build

Screenshots

🍺  /usr/local/Cellar/create-dmg/1.2.1: 12 files, 67.9KB
Creating disk image...
hdiutil: create failed - No child processes
Error: Process completed with exit code 1.

Operating system
Mac legacy

Version of Jamulus
git

Additional context

@hoffie hoffie added the bug Something isn't working label Dec 11, 2023
@hoffie hoffie added this to the Release 3.11.0 milestone Dec 11, 2023
@hoffie hoffie self-assigned this Dec 11, 2023
@hoffie hoffie added this to Tracking Dec 11, 2023
@github-project-automation github-project-automation bot moved this to Triage in Tracking Dec 11, 2023
@hoffie
Copy link
Member Author

hoffie commented Dec 11, 2023

Some first hints:

  • We don't pin create-dmg, so it is updated whenever its version is bumped on brew.
  • One of the last working autobuilds on main was 5fdd7073efae7165859d229eab706fb5be6eefc3 on 2023-10-05
  • One of the first failing autobuilds on main was f6e3bf6973eec459ec329c2c1b5fc2bca8c49768 on 2023-10-14
    • Comparing the build outputs does not yield differences in any of the following:
      • Mac version
      • Runner version
      • Runner image version
      • XCode version
      • Actions versions
      • create-dmg version
    • Pushing a branch based on a commit which once worked, leads to a failure as well.
  • The failure is pretty frequent. However, when retrying builds they sometimes work. I once managed to get a better success rate when adding debug code (e.g. set -x in the build script or bash -x create-dmg or create-dmg --hdiutil-verbose).

All in all, I believe the reason is outside of code and dependencies which are managed in our git.

Not the reason:

Things to try/rule out:

@hoffie
Copy link
Member Author

hoffie commented Dec 13, 2023

After a huge amount of debugging, I come to the conclusion that the issue stems from CodeQL which seems to cause problems in a helper program which is called by hdiutil (/System/Library/PrivateFrameworks/DiskImages.framework/Resources/diskimages-helper).

  • Disabling CodeQL in the workflow makes the problem go away
  • Interactively (tmate/ssh) using sudo -u $RUNNER create-dmg ... works, probably because CodeQL is unable to (or logic prevents it from) infecting sudo, which works as some kind of privilege barrier for us (note: The example switches user to the invoking user. We do not need to run create-dmg with elevated privileges such as root).
    • I've initially tried unsetting the relevant LD_PRELOAD-MAC-equivalents (SEMMLE_PRELOAD_libtrace and DYLD_INSERT_LIBRARIES), but they seem to be actually viral. See this example:
bash-3.2$ env - bash -c env | grep -i dylib
SEMMLE_PRELOAD_libtrace=/Users/runner/hostedtoolcache/CodeQL/2.15.4/x64/codeql/tools/osx64/libtrace.dylib
DYLD_INSERT_LIBRARIES=/Users/runner/hostedtoolcache/CodeQL/2.15.4/x64/codeql/tools/osx64/libtrace.dylib
  • Clearing the variables seems to be uneffective if the current process (bash) is instrumented by CodeQL. CodeQL seems to re-inject the environment variables upon exec*().

I'm going to prepare a workaround PR for Jamulus tomorrow, will report my findings to the other issues around this topic and will file a bug report with CodeQL. IMO there must be an easy way to get rid of CodeQL selectively (e.g. using environment variables), but I haven't found such a way in docs or in the dylib itself.

We might want to re-consider whether running CodeQL on ordinary builds is such a good idea.

hoffie added a commit to hoffie/jamulus that referenced this issue Dec 13, 2023
@hoffie
Copy link
Member Author

hoffie commented Dec 13, 2023

@hoffie
Copy link
Member Author

hoffie commented Dec 13, 2023

* One of the last working autobuilds on `main` was [5fdd7073efae7165859d229eab706fb5be6eefc3 on 2023-10-05](https://github.com/jamulussoftware/jamulus/actions/runs/6422492124/job/17471158018#step:10:20621)

This uses CodeQL v2.14.6.

* One of the first failing autobuilds on `main` was [f6e3bf6973eec459ec329c2c1b5fc2bca8c49768 on 2023-10-14](https://github.com/jamulussoftware/jamulus/actions/runs/6519903696/job/17706941147#step:10:20621)

This uses CodeQL v2.15.0.

https://github.com/github/codeql-action/blob/main/CHANGELOG.md#2222---12-oct-2023

@hoffie hoffie moved this from Triage to In Progress in Tracking Dec 16, 2023
@hoffie
Copy link
Member Author

hoffie commented Dec 16, 2023

I've been working on this every now and then and can prove that CodeQL changes the behavior of hdiutil.
What I can't prove yet is why the problems started in October. Using earlier CodeQL actions and/or CodeQL CLI versions did not yield clear results yet.

https://github.com/hoffie/codeql-hdiutil-breakage/actions

@hoffie
Copy link
Member Author

hoffie commented Dec 18, 2023

My test cases no longer reproduce the issue. CI on main is also green now. There has been another CodeQL actions update which changed the underlying CodeQL version. I suspect that might have fixed it (timing matches). I don't see anything relevant in the Changelog though.

github/codeql-action#2016
https://github.blog/changelog/2023-12-13-codeql-2-15-4-performance-improvements-and-updated-language-support/
https://codeql.github.com/docs/codeql-overview/codeql-changelog/codeql-cli-2.15.4/#improvements

@softins
Copy link
Member

softins commented Jan 12, 2024

We might want to re-consider whether running CodeQL on ordinary builds is such a good idea.

I agree. I found this issue independently just this week, and found disabling CodeQL fixed it.

I'm not sure what the usefulness is of having CodeQL run every time on almost every platform. I suppose it catches platform-specific code?

@softins
Copy link
Member

softins commented Jan 12, 2024

Strangely, as of today, without any changes in that area, the Mac Legacy build has started working again. See https://github.com/jamulussoftware/jamulus/actions/runs/7506943410 and https://github.com/softins/jamulus/actions/runs/7505799220

@softins
Copy link
Member

softins commented Jan 29, 2024

Strangely, as of today, without any changes in that area, the Mac Legacy build has started working again.

And then it stopped working again. I think this commit dbe7286 should be raised as a PR sooner rather than later. It helped the Mac Legacy build run properly for some test builds on my fork, and did not upset the main Mac build either.

I am happy to raise the PR myself if it helps - I have cherry-picked the above commit into a branch of my own, and that carries the original author attribution over.

softins pushed a commit to softins/jamulus that referenced this issue Jan 29, 2024
@softins
Copy link
Member

softins commented Jan 29, 2024

Having added this to my branch, I then moved the CodeQL from the Legacy build back to the Main build. It completed without error: https://github.com/softins/jamulus/actions/runs/7699409889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants