Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to parse bmc-info" when the system does not provide "System Firmware Version" #57

Closed
Supermathie opened this issue Sep 21, 2020 · 21 comments · Fixed by #67
Closed
Assignees

Comments

@Supermathie
Copy link

On my Supermicro servers, I'm getting the following error with the bmc collector enabled:

ERRO[0004] Failed to parse bmc-info data from [local]: Could not find value in output: Device ID             : 32
Device Revision       : 1
Device SDRs           : unsupported
Firmware Revision     : 1.52
Device Available      : yes (normal operation)
IPMI Version          : 2.0
Sensor Device         : supported
SDR Repository Device : supported
SEL Device            : supported
FRU Inventory Device  : supported
IPMB Event Receiver   : supported
IPMB Event Generator  : supported
Bridge                : unsupported
Chassis Device        : supported
Manufacturer ID       : Super Micro Computer Inc. (10876)
Product ID            : 6937
Auxiliary Firmware Revision Information : 00000000h

Device GUID : 00000000-0000-0000-0000-000000000000

System GUID : 00000000-d43a-46ef-ec3c-534d31303136
…

As you can see the BMC on SMCI doesn't always provide a System Firmware Version - can we make this optional?

@dswarbrick
Copy link
Member

I also noticed this a few days ago when trying out v1.3.0 for the first time. I initially thought it was just one extremely old SMC board, but it seems to occur on a wider range of systems than expected.

@bitfehler bitfehler self-assigned this Sep 24, 2020
@bitfehler
Copy link
Contributor

Thanks for the report! This is certainly a bit problematic, I had not thought of this scenario. I will try to release a fixed version ASAP

@Supermathie
Copy link
Author

@dswarbrick FWIW, the boards for which bmc-info is not reporting a System Firmware Version are brand new.

@bitfehler
Copy link
Contributor

Released 1.3.1 - sorry it took so long!

@dswarbrick
Copy link
Member

dswarbrick commented Oct 22, 2020

Just because we love IPMI so much... I have an old SMC host that does not seem to like running bmc-info without the --get-device-id param, which was removed in 2c6a73b#diff-f1c80b360812fd52bef43d56f3c4ef40853f02711810857332ccedfd9f463f79L291

$ bmc-info -l user -u MONITOR -p $password -h tank-ipmi
Device ID             : 32
Device Revision       : 1
Device SDRs           : unsupported
Firmware Revision     : 3.29
Device Available      : yes (normal operation)
IPMI Version          : 2.0
Sensor Device         : supported
SDR Repository Device : supported
SEL Device            : supported
FRU Inventory Device  : supported
IPMB Event Receiver   : supported
IPMB Event Generator  : supported
Bridge                : unsupported
Chassis Device        : supported
Manufacturer ID       : Nedam ENG. Co., ltd. (47488)
Product ID            : 43025

Device GUID : 00000000-0000-0000-0000-000000000000

System GUID : 00000000-dfad-2990-2500-534d31303037

ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient

The fact that bmc-info exits with status code 1 means that the collector is treating it as a failed command, which is fair enough. But it results in the ipmi_bmc_info metric disappearing for that host. We don't really want to use a higher privilege level for monitoring.

Edit: Running bmc-info without --get-device-id, or running it explicitly with --get-system-info will trigger the failure. Also, using a higher privilege level user fails, with the same error, e.g.

$ bmc-info -l ADMIN -u ADMIN -p $adminpw -h tank-ipmi --get-system-info
ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient

Versions of freeipmi tested: 1.6.3, 1.6.4.

@dswarbrick
Copy link
Member

I discovered about another 40 more systems which now fail the bmc-info command. I did not previously see them because the ipmi_exporter was running with --log.level=fatal.

In addition to the privilege level error in my previous comment, I've seen some fail with:

ipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: command invalid or unsupported

@bitfehler
Copy link
Contributor

Oh my. Thanks for the detailed report. Give me a moment to figure out how to solve this mess somewhat gracefully...

@bitfehler bitfehler reopened this Oct 26, 2020
@dswarbrick
Copy link
Member

Any chance this could be resolved before debian bullseye freeze? I'd quite like to update the package there (which I co-maintain), but am hesitant to use 1.3.x versions, as they will require a bit of patching to work around some of the rough edges.

@bitfehler
Copy link
Contributor

Just wanted to provide a quick update: I had to take a break from maintenance for a while for private matters, sorry for letting this sit for so long. I am tackling this now, is before the hard freeze still sufficient, or is it already too late?

@dswarbrick
Copy link
Member

@bitfehler So long as the diff isn't too huge, we can probably still get it into bullseye, since it's not a new package - it will just take a bit longer to transition. Give me a shout if you need any additional testing on a zoo of old(er) IPMI beasts.

bitfehler added a commit that referenced this issue Feb 15, 2021
When the `bmc` collector was extended to attempt to collect the system
firmware revision the resulting IPMI request broke on some older
systems. This collector can be used as a fallback for these systems. For
obvious reasons, the `bmc` and `bmc-device-id` collectors cannot be used
at the same time, which is checked at config load time.

This should finally fix #57.
@bitfehler
Copy link
Contributor

@dswarbrick could you let me know if #65 seems like a reasonable solution that does not involve refactoring the heck out of the exporter?
If you think it would help I would merge that and the (small) fix for #62 and then do another release, before doing a larger refactoring to account for the all the things i have learned about the dark sides of IPMI from maintaining this exporter for a while 😇

@dswarbrick
Copy link
Member

@bitfehler As I mentioned above, running bmc-info without --get-device-id results in a non-zero exit status for some devices, which makes the collector treat it as a failed command, despite the partially good output.

My hack that that I ended up applying to our inhouse packages was to explicitly run bmc-info <target> --get-device-id, which should generally be fine, and additionally run bmc-info <target> --get-system-info, which can potentially return non-zero - and in such cases, set the systemFirmwareVersion var to "N/A".

@bitfehler
Copy link
Contributor

The idea was that if you run into this issue, you just use the bmc-device-id collector for these machines, which will only do --get-device-id and set the system firmware version to "N/A". I would like to avoid doing two roundtrips for a single collector, as each one is pretty costly with IPMI.

@dswarbrick
Copy link
Member

dswarbrick commented Feb 16, 2021

I agree that two IPMI roundtrips is not ideal, however it's not always feasible to know which devices are going to fail without the --get-device-id. And it means that we would need to group the scrape targets or somehow label them with the ipmi_exporter module that they should refer to in the scrape.

Part of the problem is the fact that ipmi_exporter treats the non-zero exit status of the bmc-info command as a completely failed command, despite the partially good output, as I wrote in #57 (comment). Maybe if it didn't treat that as a completely failed command, it could parse the useful output of that command, and assume that bmc-info ... --get-system-info was going to fail, and thus preemptively just export the system firmware version as "N/A".

@bitfehler
Copy link
Contributor

And it means that we would need to group the scrape targets or somehow label them with the ipmi_exporter module that they should refer to in the scrape

Hrm, that is a good point indeed, seems quite impractical.

Maybe if it didn't treat that as a completely failed command, it could parse the useful output of that command

I was hoping to avoid something like that, but it might actually be worth a try...

bitfehler added a commit that referenced this issue Feb 16, 2021
This is a workaround for an issue described in #57. The bmc-info command
might produce usable output minus the system firmware revision, but then
choke on that. Try to recover in that scenario by attempting to parse
the output even if the command failed. Since the system firmware
revision is already optional, this should at least produce all other
values.

It is not pretty, but it avoids both folks having to change their
configs as well a second round-trip, which can be quite expensive in
IPMI.
@bitfehler
Copy link
Contributor

@dswarbrick
Copy link
Member

@bitfehler That looks pretty okay to me.

@bitfehler
Copy link
Contributor

Cool. Could I get you to give that a spin? I don't have any such hosts available that would trigger this path...

@dswarbrick
Copy link
Member

Looks mostly fine, however it's going to be a bit noisy at log.level=error.

Feb 16 21:03:42 ber-prom1 prometheus-ipmi-exporter[31853]: time="2021-02-16T21:03:42Z" level=error msg="Error while calling bmc-info for tank-ipmi: Device ID             : 32\nDevice Revision       : 1\nDevice SDRs           : unsupported\nFirmware Revision     : 3.29\nDevice Available      : yes (normal operation)\nIPMI Version          : 2.0\nSensor Device         : supported\nSDR Repository Device : supported\nSEL Device            : supported\nFRU Inventory Device  : supported\nIPMB Event Receiver   : supported\nIPMB Event Generator  : supported\nBridge                : unsupported\nChassis Device        : supported\nManufacturer ID       : Nedam ENG. Co., ltd. (47488)\nProduct ID            : 43025\n\nDevice GUID : 00000000-0000-0000-0000-000000000000\n\nSystem GUID : 00000000-dfad-2990-2500-534d31303037\n\nipmi_cmd_get_system_info_parameters_system_firmware_version_first_set: privilege level insufficient\n" source="collector.go:278"

We're running most instances at log.level=fatal already however.

@bitfehler
Copy link
Contributor

Thanks for the feedback, I demoted the severity of that log to warning. It was misguided to log an error at that point anyways. Current master should work ok for you now. I'll push one more fix and then make another release, the changes will be rather small compared to the next release, so I hope that works for you to get it into Bullseye.

@dswarbrick
Copy link
Member

Super, thanks for your perseverance on this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants