-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HT-3211: push job metrics to pushgateway #145
Conversation
bd7ef97
to
5e38380
Compare
I'd suggest looking at the two commits individually, since the second one is a bit noisy and perhaps some may be hard to follow. We can discuss if needed. |
5e38380
to
a8f201a
Compare
This uses the pushgateway via decorator for waypoint, which means individual jobs don't need to change. This includes a 'success interval' metric, which can be used to help make a generic prometheus alert - we can alert by looking for jobs where current time - job_last_success > job_expected_success_interval
This fixes issues with the logger getting set to STDERR (thus polluting test output) as well as with unexpected external HTTP calls. - Consistently wrap stuff in bin scripts in "main" method to avoid side effects when loading in specs (in particular to avoid changing stuff in Services). This then leads to rubocop complaints about stuff in main, which is probably reasonable, but also not really caused by this commit - hence added to .rubocop_todo.yml. - Avoid use of leaky constant BATCH_SIZE - Avoid things that happened when autoscrub classes or specs were loaded rather than when a specific spec is run or a class is instantiated -- in particular fetching the max OCN - Use webmock to stub logging calls to Slack and to the pushgateway
a8f201a
to
a0834d8
Compare
This is ready for review now, including tracking the expected interval between completions. See description on the individual commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual changes are pretty limited. No objections.
Great. @mwarin if this looks good to you too (especially see the cleanup in autoscrub) I'll go ahead and merge, and (hopefully) we can see those metrics start showing up in prometheus. We'll need to set the expected completion interval via environment variable for each job (over in ht_tanka) before we would start getting alerts. Later down the road we can look as necessary at routing alerts for holdings jobs. |
@mwarin I went ahead and merged this, but let us know if you have any feedback on the autoscrub stuff. |
uses pushgateway via waypoint - this is probably not the final shape
we want, but it's a convenient entry point for now since both are
concerned with tracking progress. A better idea might be a progress
tracker object that delegates to waypoint but also reports to the
pushgateway.
no specs for the pushgateway integration yet
we probably want to label the job completion time with the expected
job frequency