Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tf): add log metric and alert for detecting signup throttling #983
feat(tf): add log metric and alert for detecting signup throttling #983
Changes from 1 commit
de5b5b2
9fd700b
5914363
882f0c5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to make this similar to the SQL alerting failure alignment period or this will auto resolve after 5mins (unless people keep trying to make accounts) and we'll probably ignore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL alerting is where I copied the value from
wbaas-deploy/tf/modules/monitoring/replication-lag-alert.tf
Line 31 in 0a5b6f6
I also think a shorter value might be better as it will bring us closer to telling us how many users are experiencing throttling, instead of telling us "it's broken right now, not sure how many people are affected".
That being said, I gotta admit I wonder if those alerts are the right primitive to use for such a thing to begin with. Is there anything in GCP monitoring that is designed for being notified of point-in-time events that cannot be "resolved" in an automatically measurable fashion? I took a brief look, but couldn't find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want is something that works in the way "Sentry" would work, which seems to be close to this https://cloud.google.com/error-reporting/docs/
However, it inverses control and the application would need to push the error to the service. Not sure right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if we already have Error Reporting enabled via Stack Driver, we just never use it for anything application level: https://console.cloud.google.com/errors?referrer=search&project=wikibase-cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it seems terraform doesn't support creating notifications from the Error Reporting service yet hashicorp/terraform-provider-google#12068
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was unclear. I was thinking about
wbaas-deploy/tf/modules/monitoring/sql-backup-failure-alert.tf
Line 26 in 0a5b6f6
I think we should try using this error reporting thing; I agree that log based metrics exceeding a count doesn't feel like the right concept for this type of alerting (if it happens even once we would like to know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how would we do this? Close this PR, update
api
toreport
and see what it brings us?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense; let's try to not totally lose these commits though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in
api
wbstack/api#620