-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
starting an existing collector service causes a non-0 exit with systemctl #141
Comments
If you could explain to my how this block works normally, I may be able to debug this further on my own. I'm not clear on why this block wouldn't throw an error during a normal run where the collector service is already running? In both cases (when chef-client runs with the collector service running succeeds and when it fails) |
This is not a chef problem, if a service start returns an error then that is a problem with the init script regardless of which subsystem is involved. We could try working around this temporarily by switching from systemd to sysvinit defaults but that is not really a real solution but a work around. I would opt for allowing the init subsystem to be specified but for very different reasons. Giving people the choice to choose their init subsystem is important because not everyone is hopping on the systemd bandwagon for reasons I will not go into here as my intention is not to start a systemd religious war as they are rarely based on facts and are more often driven by emotions. |
Technically this is not our problem but it does not look like sumo is gonna solve it any time soon. I am gonna put together a PR this weekend and will probably want to get @dannenberg to validate that it fixes his scenario. |
relates to #141 We now allow using `node['sumologic']['init_style']` with any valid chef provider. Default is `nil` and will defer to ohai unless overridden. Note: it does not attempt an validation that you chose a valid provider. Signed-off-by: Ben Abrams <[email protected]>
relates to #141 We now allow using `node['sumologic']['init_style']` with any valid chef provider. Default is `nil` and will defer to ohai unless overridden. Note: it does not attempt an validation that you chose a valid provider. Signed-off-by: Ben Abrams <[email protected]>
I have merged this and am awaiting @duchatran to cut a release. This does not fix the issue reported so I am not sure if I want to leave this open as it's really not a chef issue. The "fix" allows someone to work around the problem but it is not really solved because we can't solve it since we do not manage the init scripts in this cookbook. |
Well basically the first time it installs (assuming you have not pre-created this dir) this will not run: https://github.com/SumoLogic/sumologic-collector-chef-cookbook/blob/v1.2.22/recipes/default.rb#L31-L52 I think it is to avoid reinstalling (and modifying config possibly which does seem wrong, I would have to look deeper but am not terribly motivated at the moment and would likely just wait until I have time to refactor the whole cookbook) every time since they do not version the deb/rpm files so there is no way to know if we need to upgrade. This cookbook does not attempt to upgrade collectors (something I find sorta annoying). Like I said though this has nothing to do with the issue at hand, they have a bad systemd unit template and until they fix it all I can say is use sysvinit. |
Wont this section of code just make that last PR not apply? If systemd is available, it's going to use it, even if you try to override it with the init_style attribute. |
Hmm the short answer is sorta? When using the recipes it will indeed use the requested subsystem as the call is to I confirmed with the original user it allows them to work around their issue. |
I'm just using the sumologic_collector LWRP. This cookbook has sporadically broken my provisioning pipeline for weeks now. Have you got any tips for getting things working again? I understand we can point fingers at sumo, but at this point, I just need to find some hack to get things functional again. |
We can do a similar change to the lwrp to allow you to be unblocked. I can hopefully get to it tonight. |
this whole thing may be related to this chef-client change. |
I don't think that is the root cause it may cause it to more often than not favor systemd. That does not account for why sending a |
Could someone give me the output of: |
in my case, I was installing the collector, and then enabling/starting the service (which would fail because the collector was already running. And after that happened, checking the status of the service with systemctl returned this:
|
Hmm if you can hit me up in the sumo community slack maybe we can do some more interactive troubleshooting. I will be available after noon pacific time today. |
Hi guys, just wondering if this issue is fixed as I am still running the same errors. Any ideas? Any help is welcome or ideas on how to work around this problem. Thank you. |
I am not using sumologic at my current job but you can work around it by specifying the init style via an attribute: #145 |
I was running into this issue also. The systemd unit file given as an example doesn't work. Also the one installed with the deb file didn't work for me either. It fails with some opaque messages
The above comment gave me a hint. By adding a
|
@mbigras interesting I will ping some folks at sumologic to see about them patching their systemd unit file in their packages and documentation. |
I'm having this issue on Amazon Linux 2. The fix by @mbigras doesn't seem to fix the issue like I hoped it would. What's also odd is that in my kitchen.yml I have RHEL7, CentOS7, and Oracle Linux 7 and all of those work fine; and I confirmed they are using systemd. It just seems to be an issue with Amazon linux 2 when I use a systemd file like @mbigras suggested. |
That is strange, is it the same error as reported? Also can you run:
If you get a:
Then it means that the process is not able to launch, this is technically a different problem than the original report as the main report was that sending a start to an existing running service errors out because systemd is confused about where to find the main running process. If this is the case (which I saw at least once above) perhaps running:
might shed some additional light if the service shows as running as that means that means that the systemd unit file is broken and does not know how to track the PID for the process. |
Yeah I suppose this error might be slightly different then. Here is more info:
This issues seems to be that the systemd file for collector is missing on amazon linux 2. What's also weird is that all other systemd OSes don't have Regardless I suppose my issue is different than the original issue so you might want to disregard it. |
any updates with Amazon linux 2 issue? i have faced with the same case ? |
I do not, I do not run amazon 2 linux. I suspect that in the case of amazon 2 linux if you are seeing a |
Environment Information
Please include the following:
Expected Behavior
What should have happened?
chef-client should run to completion regardless of the state of the collector service
Actual Behavior
What actually happened?
chef-client exits with an error due to this line https://github.com/SumoLogic/sumologic-collector-chef-cookbook/blob/master/recipes/default.rb#L44 which is attempting to start an existing, running collector. The error output is included below:
Steps to Reproduce
Please list the steps required to reproduce the issue.
We are using terraform to spin up our instances
The initial chef-client run complete successfully, but the next one throws that error.
NB: If we stop the collector service and run chef-client again it completes successfully from then on.
Important Factoids
Are there anything atypical about the setup that we should know? For example everything goes through an HTTP/HTTPS proxy or I do not use rabbitmq.
NB: If we stop the collector service and run chef-client again it completes successfully from then on.
References
Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:
The text was updated successfully, but these errors were encountered: