Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new module to convert cp stats into prometheus format suited for CMS monitoring #9940

Merged
merged 1 commit into from
Nov 30, 2022

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Sep 25, 2020

Fixes #9939

Status

ready

Description

I provide a new module which can be integrated into WMCore stack. It provides useful set of functions to flatten CherryPy stats, make static schema and provide them in format suitable for Prometheus server.

I suggest to add an additional metrics endpoint to complement stats one shown in here
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/REST/Server.py#L700
with the following

@expose
def metrics(self):
       name = self.__class__.__name__
       return prom_metrics(name)

This will provide ability to scrape all cherrypy metrics from Prometheus for any DMWM/REST server. Please refer to docstring of provided module for format description/example.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

External dependencies / deployment changes

this module uses cherrypy lib to get cherrypy metrics

@vkuznet vkuznet requested a review from amaltaro September 25, 2020 15:00
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a review of the code you proposed, but a couple of things caught my attention:

  • please use lower camelCase for the variable and function names (prom_metrics --> promMetrics)
  • given that this is going to be a common Utils module, please provide docstrings as completed as you can (there is no model defined yet, but it looks like we are adopting reStructured Text)
  • can you also please create unit test(s) for this module

Can you please take care of the comments above? Thanks

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 10 comments to review
  • Pycodestyle check: succeeded
    • 12 comments to review
  • Python3 compatibility checks: failed
    • fails python3 compatibility test

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10458/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet force-pushed the cp_stats branch 2 times, most recently from 29fa5d7 to 3984c29 Compare September 25, 2020 16:17
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 5 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10459/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 25, 2020

Alan, I added requested changes, please review at your convenience.

@vkuznet vkuznet requested a review from amaltaro September 25, 2020 16:30
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 2 tests added
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10460/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 2 tests added
  • Pylint check: succeeded
    • 6 comments to review
  • Pycodestyle check: succeeded
    • 4 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10465/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2020

@todor-ivanov please review this PR which provides metrics. But as I explained in description once this PR is in place we'll need a new end-point similar to /stats for WMCore/REST code to serve these metrics (the code is shown in description and will be required for monitoring).

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valya, I made few minor comments in the code. But, if you think those are too picky, just ignore them, we can definitely live with the code as it is.
All the rest looks good to me!

src/python/Utils/CPMetrics.py Outdated Show resolved Hide resolved
src/python/Utils/CPMetrics.py Outdated Show resolved Hide resolved
src/python/Utils/CPMetrics.py Outdated Show resolved Hide resolved
src/python/Utils/CPMetrics.py Outdated Show resolved Hide resolved
@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 29, 2020

@todor-ivanov please review again as I resolved all requested suggestions.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 2 tests added
  • Pylint check: succeeded
    • 6 comments to review
  • Pycodestyle check: succeeded
    • 4 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10476/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Valya, I think it is looking good now. Just to clear my own understanding. From your comment in the PR I read that provided the code like that it is not referred anywhere outside. So we should make an explicit call to it from here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/REST/Server.py#L700

@expose def metrics(self): name = self.__class__.__name__ return prom_metrics(name)

If I am correct it would be nice to have it called in this same PR so that we avoid making too many PRs for the same functionality. @amaltaro, Please take a look and tell me if you like to have the code merged with or without being referred in this PR.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 30, 2020

Todor,
it is fine from my side to add changes to REST Server in this PR. But I think either you or Alan should decide on it. Here is a proposed changes (in diff format):

diff --git a/src/python/WMCore/REST/Server.py b/src/python/WMCore/REST/Server.py
index b3a3b8a73..450040e49 100644
--- a/src/python/WMCore/REST/Server.py
+++ b/src/python/WMCore/REST/Server.py
@@ -15,6 +15,7 @@ from cherrypy.lib import cpstats
 from WMCore.REST.Error import *
 from WMCore.REST.Format import *
 from WMCore.REST.Validation import validate_no_more_input
+from Utils.CPMetrics import promMetrics

 try:
     from cherrypy.lib import httputil
@@ -360,6 +361,11 @@ class RESTFrontPage:
         "Return CherryPy stats dict about underlying service activities"
         return cpstats.StatsPage().data()

+    @expose
+    def metrics(self):
+        "Return CherryPy stats metrics in Prometheus format"
+        name = self.__class__.__name__
+        return promMetrics(name)


 ######################################################################

@amaltaro
Copy link
Contributor

amaltaro commented Oct 1, 2020

Valentin, just to let you know that this PR is on my todo list for the coming days. I still think there is either something missing or a misunderstanding with this proposal, so I'll have to test a few things myself first.

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 2, 2020

@amaltaro , I incorporated your changes from #9961 over here. I removed cherrypy deps and TestMetricsServer from CPMetrics.py code. But it would be nice to keep TestMetricsServer some place that users can easily see how to construct required end-point and test it easily. I created this file but did not include it in this PR. If you think it is relevant let me know where to put it, otherwise we may just skip it.

#!/usr/bin/env python
"""
File       : cp_metrics.py
Author     : Valentin Kuznetsov <vkuznet AT gmail dot com>
Description: Example of Python REST Server to provide cherrypy stats as prometheus metrics
"""



import cherrypy
from cherrypy.lib import cpstats

class MetricsServer(object):
    """
    Example of metrics server which can serve cherrypy metrics
    in Prometheus format
    """
    def metrics(self):
        "metrics end-point for prometheus to scrape"
        data = cpstats.StatsPage().data()
        pdata = promMetrics(data, 'test')
        return pdata
    metrics.exposed = True

    def index(self):
        "default index end-point"
        return "Hello World"
    index.exposed = True


def TestMetricsServer():
    "Test MetricsServer function"
    cherrypy.root = MetricsServer()
    cherrypy.config.update({'server.socket_port': 8080,
                            'server.thread_pool': 20,
                            'environment': 'production',
                            'log.screen': True,
                            'log.error_file': "crab.log"})
    conf = {'/': {'tools.staticdir.root': os.getcwd()}}
    cherrypy.quickstart(MetricsServer(), '/', config=conf)

if __name__ == '__main__':
    TestMetricsServer()

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 2, 2020

@amaltaro , I found where stats and metrics can be plugged into WebTools server. Looking at DBS code it inherits WebTools RESTModel, which by itself inherits from WebAPI.py. The whole chain of dependencies is the following:

So, I added stats and metrics to WebAPI and exposed them. Doing this way all DBS servers (inherited from WebTools) will get these end-points.

Please review.

@vkuznet vkuznet requested a review from todor-ivanov October 2, 2020 15:45
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 2 tests added
  • Pylint check: failed
    • 48 warnings and errors that must be fixed
    • 20 warnings
    • 92 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10491/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 2 tests added
  • Pylint check: failed
    • 61 warnings and errors that must be fixed
    • 24 warnings
    • 114 comments to review
  • Pycodestyle check: succeeded
    • 37 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10493/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 2, 2020

I had a look at pylint report and all of the failures related to WebAPI.py codebase which I didn't touched. I don't think it is my responsibility to go through them and include in this PR. Please let me know how you'll treat it.

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 2, 2020

I had a look at Alan's VM and even though I can access provided end-points via https calls they are not accessible from localhost and plain http:

# https requests works fine
https://alancc7-cloud1.cern.ch/reqmgr2/data/metrics

# http request is not allowed giving 403 forbidden page
http://localhost:8246/reqmgr2/data/metrics

Such strict constrains make entire work a waste since prometheus does not care about cmsweb auth x509 schema and all metrics should be accessible through normal localhost request. Therefore, unless I miss something or unless you tell me where to change REST server code to allow http requests from localhost I don't see a point how it can be used because REST server requires cmsweb x509 authentication for all exposed methods.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 52 warnings and errors that must be fixed
    • 25 warnings
    • 175 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 44 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13743/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 52 warnings and errors that must be fixed
    • 25 warnings
    • 174 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13744/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 28, 2022

Alan, pylint score is 10 for new code and its unit test. All others are old code which I don't want to touch in this PR. The unit tests are unstable. Please review once you'll get a chance.

@vkuznet vkuznet requested a review from amaltaro November 28, 2022 18:34
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
Before we get it merged, can you please deploy these changes to your test9 cluster and run a basic curl test of the new REST API?

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 30, 2022

@amaltaro , after testing on test9 with curl I found that additional changes to support bytes I/O is required (new python CP server should return bytes for its end-points). So I added separte commit 3813d7e Please review this commit and then I can squash it.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 56 warnings and errors that must be fixed
    • 25 warnings
    • 174 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13774/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, please see my partial review on your latest changes.

src/python/Utils/CPMetrics.py Outdated Show resolved Hide resolved
src/python/WMCore/REST/Server.py Outdated Show resolved Hide resolved
@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 30, 2022

Alan, I made necessary changes, please let me know.

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 30, 2022

Now commits are squashed too.

@vkuznet vkuznet requested a review from amaltaro November 30, 2022 14:41
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests no longer failing
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 52 warnings and errors that must be fixed
    • 25 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13776/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 52 warnings and errors that must be fixed
    • 25 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13775/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

The Jenkins CI report is your friend, please take advantage of it ;)

Unit tests added
    Utils_t.CPMetrics_t.CPMetricsTests:testFlattenStats was added with status error. Must be fixed
    Utils_t.CPMetrics_t.CPMetricsTests:testPromMetrics was added with status error. Must be fixed

Once you are done with code changes, please run another test in your central services setup testing reqmgr2 endpoints like: stats, metrics and info. Please let me know how that goes.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 52 warnings and errors that must be fixed
    • 25 warnings
    • 177 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13777/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 30, 2022

Alan, test is fixed (adding encodeUnicodetoBytes still requires a check) and code is tested on test9, e.g.

scurl -s https://cmsweb-test9.cern.ch/ms-output/data/metrics | head
# HELP ms-output_cherrypy_http_server_accepts
# TYPE ms-output_cherrypy_http_server_accepts counter
ms-output_cherrypy_http_server_accepts 0
# HELP ms-output_cherrypy_http_server_accepts_sec
# TYPE ms-output_cherrypy_http_server_accepts_sec gauge
ms-output_cherrypy_http_server_accepts_sec 0.0
# HELP ms-output_cherrypy_http_server_bytes_read
# TYPE ms-output_cherrypy_http_server_bytes_read counter
ms-output_cherrypy_http_server_bytes_read -1
# HELP ms-output_cherrypy_http_server_bytes_written

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@amaltaro amaltaro merged commit efd8db5 into dmwm:master Nov 30, 2022
@mapellidario
Copy link
Member

From a first quick pick, this is beautiful! We had the idea of doing something similar and will likely steal from this in the future :)
Fyi: @novicecpp

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 30, 2022

FYI @amaltaro , @mapellidario , @novicecpp I made corresponding monitoring ticket: https://its.cern.ch/jira/browse/CMSMONIT-514 to request CherryPy dashboard. Please follow up monitoring ticket and request your dashboard for your python based service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide proper metrics for web based services
5 participants