Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 357: Integrating health checks #569

Merged
merged 11 commits into from
Aug 17, 2021

Conversation

SrishT
Copy link
Contributor

@SrishT SrishT commented Aug 2, 2021

Signed-off-by: SrishT [email protected]

Change log description

Integrates the apis exposed by the new healthcheck framework which is compatible with pravega version >= 0.10.0, while not breaking backward compatibility.

Purpose of the change

Fixes #357

What the code does

Uses the latest healthcheck apis for monitoring the health of the controller and segmentstore pods when any pravega version starting 0.10.0 is deployed. Also, uses the earlier liveness and readiness probes for the controller and segmentstore pods when pravega version lower than 0.10.0 has been deployed.

How to verify it

When pravega version is lesser than 0.10.0, the following healthchecks are invoked

for SegmentStore

 livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        -  netstat -ltn 2> /dev/null | grep 12345 || ss -ltn 2> /dev/null | grep 12345
     
   readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        -  netstat -ltn 2> /dev/null | grep 12345 || ss -ltn 2> /dev/null | grep 12345

for Controller

 livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - netstat -ltn 2> /dev/null | grep 9090 || ss -ltn 2> /dev/null | grep 9090
       
   readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - 'echo $JAVA_OPTS | grep ''controller.auth.tlsEnabled=true'' &&  curl -s
          -X GET ''https://localhost:10080/v1/scopes/'' -H ''accept: application/json''
          | grep ''_system''|| (echo $JAVA_OPTS | grep ''controller.auth.tlsEnabled=false''
          && curl -s -X GET ''http://localhost:10080/v1/scopes/'' -H ''accept: application/json''
          | grep ''_system'' ) || (echo $JAVA_OPTS | grep ''controller.security.tls.enable=true''
          && echo $JAVA_OPTS | grep -v ''controller.auth.tlsEnabled'' && curl -s -X
          GET ''https://localhost:10080/v1/scopes/'' -H ''accept: application/json''
          | grep ''_system'' ) || (curl -s -X GET ''http://localhost:10080/v1/scopes/''
          -H ''accept: application/json'' | grep ''_system'') '

However, when deployed pravega version is equal to 0.10.0, the latest healthcheck apis are invoked.

for SegmentStore

  livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - curl -s -X GET 'http://localhost:6061/v1/health/liveness'
     
  readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - curl -s -X GET 'http://localhost:6061/v1/health/readiness'

for Controller

  livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - curl -s -X GET 'http://localhost:10080/v1/health/liveness'
      
  readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - curl -s -X GET 'http://localhost:10080/v1/health/readiness'

Following scenarios have been tested (with Auth and TLS disabled, with only Auth enabled, and with both Auth and TLS enabled)

  • deploy pravega operator 0.5.3, pravega 0.10.0; upgrade pravega operator to latest
  • deploy latest pravega operator, pravega 0.9.0; upgrade pravega to 0.10.0

@SrishT SrishT marked this pull request as ready for review August 4, 2021 14:18
@SrishT SrishT requested a review from anishakj August 4, 2021 14:18
pkg/util/pravegacluster.go Outdated Show resolved Hide resolved
@anishakj anishakj requested a review from pbelgundi August 4, 2021 16:23
@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2021

Codecov Report

Merging #569 (0225c57) into master (8587324) will increase coverage by 0.12%.
The diff coverage is 95.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #569      +/-   ##
==========================================
+ Coverage   74.56%   74.68%   +0.12%     
==========================================
  Files          15       15              
  Lines        4144     4164      +20     
==========================================
+ Hits         3090     3110      +20     
  Misses        932      932              
  Partials      122      122              
Impacted Files Coverage Δ
pkg/apis/pravega/v1beta1/pravega.go 98.34% <ø> (ø)
pkg/apis/pravega/v1beta1/pravegacluster_types.go 26.73% <50.00%> (ø)
pkg/controller/pravega/pravega_controller.go 100.00% <100.00%> (ø)
pkg/controller/pravega/pravega_segmentstore.go 100.00% <100.00%> (ø)
...roller/pravegacluster/pravegacluster_controller.go 54.25% <100.00%> (ø)
pkg/controller/pravegacluster/upgrade.go 73.03% <100.00%> (ø)
pkg/util/pravegacluster.go 97.83% <100.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8587324...0225c57. Read the comment docs.

@pbelgundi pbelgundi requested review from gaddamas and removed request for pbelgundi August 9, 2021 05:30
@SrishT SrishT force-pushed the issue-357-integrating-health-check-apis branch from 0cd62bb to 747ddec Compare August 9, 2021 15:18
@SrishT SrishT requested a review from jkhalack August 11, 2021 14:37
if IsVersionBelow(version, compareVersion) {
command = fmt.Sprintf("netstat -ltn 2> /dev/null | grep %d || ss -ltn 2> /dev/null | grep %d", port, port)
} else {
command = fmt.Sprintf("(netstat -ltn 2> /dev/null | grep %d || ss -ltn 2> /dev/null | grep %d) && (curl -s -X GET 'http://localhost:%d/v1/health/liveness' || curl -s -k -X GET 'https://localhost:%d/v1/health/liveness')", port, port, restport, restport)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really required to check the port using netstat? Why not to send a REST request right away and check for return code?

It might happen that netstat will not be present in pravega containers starting from 0.10...

Copy link
Contributor

@anishakj anishakj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anishakj
Copy link
Contributor

@SrishT Seeing e2e are failing, Please check

Signed-off-by: SrishT <[email protected]>
Copy link

@jkhalack jkhalack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: SrishT <[email protected]>
command = fmt.Sprintf("echo $JAVA_OPTS | grep 'controller.auth.tlsEnabled=true' && curl -s -X GET 'https://localhost:%d/v1/scopes/' -H 'accept: application/json' | grep '_system'|| (echo $JAVA_OPTS | grep 'controller.auth.tlsEnabled=false' && curl -s -X GET 'http://localhost:%d/v1/scopes/' -H 'accept: application/json' | grep '_system' ) || (echo $JAVA_OPTS | grep 'controller.security.tls.enable=true' && echo $JAVA_OPTS | grep -v 'controller.auth.tlsEnabled' && curl -s -X GET 'https://localhost:%d/v1/scopes/' -H 'accept: application/json' | grep '_system' ) || (curl -s -X GET 'http://localhost:%d/v1/scopes/' -H 'accept: application/json' | grep '_system') ", port, port, port, port)
}
} else {
command = fmt.Sprintf("curl -s -X GET 'http://localhost:%d/v1/health/readiness' || curl -s -k -X GET 'https://localhost:%d/v1/health/readiness'", port, port)
Copy link
Contributor

@sarlaccpit sarlaccpit Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something in the readiness call payload that indicates whether or not pravega controller is running with auth on (I am not suggesting to send bad credentials to test out a 401 in return :) rather, just a field that literally says "auth: enabled" in the health readiness response). If so, then if auth was requested in the java options, we should find in the readiness a confirmation that auth is on. And the opposite if auth was not set or set to off in the java options.

If that's not something that is in the readiness payload today; I think it should be added, but there is nothing for this PR to leverage and line 109 is the best it can be at this time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarlaccpit the /liveness and /readiness endpoints are configured to return only a boolean response. There is another /getDetails endpoint which is exposed, which can be configured to return additional details like the one you mentioned, but this endpoint can only be accessed by authenticated users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think it's fine the way it is then. It makes sense the auth/no auth (and probably many other details) are protected behind the get details endpoint.

SrishT added 2 commits August 17, 2021 09:07
Signed-off-by: SrishT <[email protected]>
Signed-off-by: SrishT <[email protected]>
@anishakj anishakj merged commit a554d95 into master Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change health(readiness) checks for Controller and SegmentStore
7 participants