Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More metrics for the EL monitor #4707

Merged
merged 1 commit into from
Mar 9, 2023
Merged

Conversation

zah
Copy link
Contributor

@zah zah commented Mar 9, 2023

  • engine_api_response_time provides a histogram for the Engine API response times for each unique pair ot URL and request type.

  • All engine API requests are now tracked

Other changes:

The client will no longer exit on start-up if it fails to connect to a properly configured EL node.

* `engine_api_response_time` provides a histogram for the Engine API
  response times for each unique pair ot URL and request type.

* All engine API requests are now tracked

Other changes:

The client will no longer exit on start-up if it fails to connect to
a properly configured EL node.
@zah zah enabled auto-merge (squash) March 9, 2023 14:35
@jakubgs
Copy link
Member

jakubgs commented Mar 9, 2023

Looking good:

[email protected]:~ % c http://10.14.0.70:9202/metrics | grep engine_api_response_time | grep newPayload
engine_api_response_time_sum{url="http://localhost:8554",request="newPayload"} 6.565
engine_api_response_time_count{url="http://localhost:8554",request="newPayload"} 65.0
engine_api_response_time_created{url="http://localhost:8554",request="newPayload"} 1678377338.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="0.1"} 36.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="0.25"} 65.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="0.5"} 65.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="1.0"} 65.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="2.0"} 65.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="5.0"} 65.0
engine_api_response_time_bucket{url="http://localhost:8554",request="newPayload",le="+Inf"} 65.0

@jakubgs
Copy link
Member

jakubgs commented Mar 9, 2023

Seems to work, but I'm not getting values on the vertical scale.Weird.

image

@@ -259,6 +259,11 @@ declareCounter engine_api_responses,
"Number of successful requests to the newPayload Engine API end-point",
labels = ["url", "request", "status"]

declareHistogram engine_api_response_time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more conventional naming for histograms is duration, not time. Also the name should include units used.
So maybe more like engine_api_request_duration_seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

  • http_request_duration_seconds
  • node_memory_usage_bytes

https://prometheus.io/docs/practices/naming/#metric-names

@@ -259,6 +259,11 @@ declareCounter engine_api_responses,
"Number of successful requests to the newPayload Engine API end-point",
labels = ["url", "request", "status"]

declareHistogram engine_api_response_time,
"Time(s) used to generate signature usign remote signer",
buckets = [0.1, 0.25, 0.5, 1.0, 2.0, 5.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe have a bit more fine grained buckets? Here's an example from Cortex:

[email protected]:~ % c 0:9092/metrics | grep cortex_request_duration_seconds_bucket | grep ingester_ring
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.005"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.01"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.025"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.05"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.1"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.25"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="0.5"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="1"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="2.5"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="5"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="10"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="25"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="50"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="100"} 2
cortex_request_duration_seconds_bucket{method="GET",route="ingester_ring",status_code="200",ws="false",le="+Inf"} 2

@jakubgs
Copy link
Member

jakubgs commented Mar 9, 2023

Ok, I see what was wrong. The Legend has to show only the le:

image

And now it looks correct:

image

But yeah, I'd appreciate some more granular buckets.

@zah zah disabled auto-merge March 9, 2023 17:28
@zah zah merged commit ef20e83 into unstable Mar 9, 2023
@zah zah deleted the el-manager-tweaks-2023-03-09 branch March 9, 2023 17:29
@zah
Copy link
Contributor Author

zah commented Mar 9, 2023

The suggested changes have been pushed to unstable in e808fda. The new metric name is engine_api_request_duration_seconds

@jakubgs
Copy link
Member

jakubgs commented Mar 9, 2023

Broken tho:

/data/beacon-node-prater-unstable/repo/beacon_chain/eth1/eth1_monitor.nim(304, 7) Error: undeclared identifier: 'engine_api_response_time'
candidates (edit distance, scope distance); see '--spellSuggest':
 (4, 6): 'engine_api_responses' [var declared in /data/beacon-node-prater-unstable/repo/beacon_chain/eth1/eth1_monitor.nim(258, 16)]

@jakubgs
Copy link
Member

jakubgs commented Mar 9, 2023

Pretty cool:

linux-04.he-eu-hel1.nimbus.prater

image

linux-05.he-eu-hel1.nimbus.prater

image

linux-06.he-eu-hel1.nimbus.prater

image


Here we can clearly see that the more validators you have the more strained the EL node is.

@zah
Copy link
Contributor Author

zah commented Mar 10, 2023

That's expected. The earlier version of the code didn't track all requests.

etan-status added a commit that referenced this pull request Feb 6, 2024
Since #4465, compilation with `-d:has_deposit_root_checks` fails. #4707
further built on top of it but the additions also don't compile. Fix it.
etan-status added a commit that referenced this pull request Feb 6, 2024
Since #4465, compilation with `-d:has_deposit_root_checks` fails. #4707
further built on top of it but the additions also don't compile. Fix it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants