Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when spinning up a child process from inside of ruby #1083

Closed
david22swan opened this issue Jun 3, 2021 · 5 comments · Fixed by #1084
Closed

Error when spinning up a child process from inside of ruby #1083

david22swan opened this issue Jun 3, 2021 · 5 comments · Fixed by #1084
Assignees
Labels

Comments

@david22swan
Copy link
Member

david22swan commented Jun 3, 2021

When running the Puppet.Dsc acceptance tests on GHA we hit an error during the setup for the first test.
This error occurs whenever the pdk is called upon in order to create a module and states that there has been a ChildProcess::LaunchError due to our access having been denied.

The exact error that we have been getting can be seen here.
And the PR on which we were working when we discovered this bug can be shown here.

A conversation between @michaeltlombardi the inimitable @glennsarti, who went out of his way to try a minimal local repro using the GHA docker image, discovered the problem seems to be this line in the PDK, which sets process.leader:

@process.leader = true

In his manual testing, attempting to start a process in GHA with this flag breaks

p = ChildProcess.build('ls')
p.leader = true
p.start
p.wait
puts p.exit_code
# Access is denied

But with it removed, no error is returned:

p = ChildProcess.build('ls')
p.start
p.wait
puts p.exit_code

We had opened a PR on the pdk in order to try and implement two possible solutions, shown here.

The first of these solutions was to move the pin of the childprocess gem from v0.7.1 to v4.0.0 in order to bring in the patch shown here, however this did not resolve the issue as the error still occured.

The second of our attempted solutions was to comment out the above linked line:

@process.leader = true

But this did not work either as the change somehow caused a dependency error in the bundle install as shown here

A draft for the experimental branch that we used is currently open can can be found here: #1081

We are hoping for some help in finding a fix for this as it is a major blocker in moving our CI off of appveyor.

@david22swan david22swan added bug needs-triage Newly created issue that has not been reviewed by a PDK contributor labels Jun 3, 2021
@jpogran
Copy link
Contributor

jpogran commented Jun 4, 2021

Attempted to run locally, works fine.
Attempted to run in a windows container, works fine.
Attempted to run in a bolt remote command, works fine.

Did a little research, it seems likes others are having similar problems in Github: https://github.com/enkessler/childprocess/pull/178/checks?check_run_id=2725513511

@sanfrancrisko sanfrancrisko linked a pull request Jun 7, 2021 that will close this issue
@sanfrancrisko
Copy link
Contributor

Thanks to @da-ar for the investigation in to this, so far.

We use the childprocess to abstract the spawning / handling of exec threads across platforms. Setting the leader attribute will ensure any process trees created will die when the child process does.

On Windows platforms, this is done by setting the CREATE_BREAKAWAY_FROM_JOB flag when calling [AssignProcessToJobObject](https://docs.microsoft.com/en-us/windows/win32/api/jobapi2/nf-jobapi2-assignprocesstojobobject] in the Win32 API. According to [the Win32 API documentation]https://docs.microsoft.com/en-us/windows/win32/api/jobapi2/) when this flag is set:

The child process breaks away from every job in the job chain unless a job in the chain does not allow breakaway. In this case, the child process does not break away from that job or any job above it in the job chain

Jobs (and nested jobs) created through this API are created via the Task Scheduler. According to this bug write up from Mozilla, and the associated patch to fix on Windows < 8 / Server 2008 R2 ( NT < 6.2 ), the CREATE_BREAKAWAY_FROM_JOB flag required JOB_OBJECT_LIMIT_BREAKAWAY_OK flag to be set too (which childprocess does here, however, on Windows > Windows 8 / Server 2008 R2, the Win32 API will return an ACCESS_DENIED error if the JOB_OBJECT_LIMIT_BREAKAWAY_OK flag is set. On a Windows 10 system it was observed that Task Scheduler did not set this flag as nested jobs are supported.

For our scenario, we should conditionally set the @process.leader parameter to true if the OS is <= Windows 7 / Server 2003, and in other cases, leave it unset / explicitly set to false.

More investigation / understanding would be required to offer a fix upstream to enkessler/childprocess, as there's a lot of unknowns about how how these jobs are created (i.e. where Task Scheduler fits in, how the process / child processes are run - logon service vs user) that I'm simply not comfortable enough with at the moment.

david22swan added a commit to puppetlabs/Puppet.Dsc that referenced this issue Jun 8, 2021
Due to upstream issues with one of our dependencies we are currently unable to run the acceptance test's on GHA, as such for now we will move only the General and Unit tests while we search for a fix for this issue.
Further information on the issue can be found within the following link: puppetlabs/pdk#1083
david22swan added a commit to puppetlabs/Puppet.Dsc that referenced this issue Jun 8, 2021
Due to upstream issues with one of our dependencies we are currently unable to run the acceptance test's on GHA, as such for now we will move only the General and Unit tests while we search for a fix for this issue.
Further information on the issue can be found within the following link: puppetlabs/pdk#1083
@sanfrancrisko
Copy link
Contributor

Bumping childprocess to version 4.0.0 and ensuring @process.leader is unset has seemed to resolve the issue on Windows Server 2016 and 2019:

Using this test workflow, on a repo pulling in a version of PDK with the above changes, I was able to get bundle exec pdk validate to run successfully:

This run is with PDK 2.1.0, reproducing the original issue.

@jpogran jpogran added pdk-all and removed needs-triage Newly created issue that has not been reviewed by a PDK contributor labels Jun 9, 2021
@michaeltlombardi
Copy link
Contributor

I spiked a test PR on Windows based on @sanfrancrisko's work here:

And it works!

@sanfrancrisko
Copy link
Contributor

We have a number of AppVeyor tests that have been failing. We got some issues resolved with #1089, however, there are still failures that seem to be specific to AppVeyor. See @jpogran 's comment here

We have tested this manually on:

  • Github Actions Windows Server 2019 instance
  • Github Actions Windows Server 2016 instance
  • ABS Windows Server 2019 instance
  • ABS Windows Server 2016 instance
  • ABS Windows Server 2008 R2 instance
  • ABS Windows 10 instance

@michaeltlombardi also created a workflow with the config of the original issue, but this fixed pulled in: https://github.com/puppetlabs/Puppet.Dsc/pull/168/checks?check_run_id=2785882419#step:4:357

jpogran added a commit that referenced this issue Jun 11, 2021
…s_win

(GH-1083) Bump childprocess to '~> 4.0.0'; Disable @process.leader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants