Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use px/agent_status_diagnostics script within px cli to detect missing kernel headers #2065

Conversation

ddelnano
Copy link
Member

@ddelnano ddelnano commented Dec 18, 2024

Summary: Use px/agent_status_diagnostics script within px cli to detect missing kernel headers

This PR leverages the script added in #2064 to detect missing kernel headers during cli deploys and px collect-logs commands. This solves 2/3 of the use cases I was hoping to identify for #2051 (the last being helm installs).

A recent example of this problem is #1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught.

Relevant Issues: #2051

Type of change: /kind feature

Test Plan: Verified the following scenarios

Test cases
  • px collect-logs works against a cloud that doesn't have a px/agent_status_diagnostics script
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
  • px collect-logs works against a cloud that does have a px/agent_status_diagnostics script
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
  • px collect-logs identifies when kernel headers are missing when px/agent_status_diagnostics present
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
  • Artificially forcing context deadline (timeout) results in an error
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e008..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
  • px collect-logs prompts auth flow when credentials don't match current cloud
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
  • px deploy on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
  • px deploy on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
  • px deploy on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.

Changelog Message: Enhanced the px cli's deploy and collect-logs commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers.

@ddelnano ddelnano force-pushed the ddelnano/use-agent-diagnostics-px-deploy-collect-logs branch 4 times, most recently from 52994ea to d942389 Compare December 23, 2024 14:19
@ddelnano ddelnano force-pushed the ddelnano/use-agent-diagnostics-px-deploy-collect-logs branch from d942389 to b632cd4 Compare December 23, 2024 16:27
@ddelnano ddelnano marked this pull request as ready for review December 24, 2024 14:18
@ddelnano ddelnano requested a review from a team as a code owner December 24, 2024 14:18
@ddelnano
Copy link
Member Author

ddelnano commented Jan 2, 2025

@pixie-io/vizier please give this a look when you have the chance!

ddelnano added a commit that referenced this pull request Jan 6, 2025
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: #2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for #2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)


Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <[email protected]>
@ddelnano ddelnano merged commit 3c9c4bd into pixie-io:main Jan 6, 2025
25 checks passed
@ddelnano ddelnano deleted the ddelnano/use-agent-diagnostics-px-deploy-collect-logs branch January 6, 2025 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants