-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tf): add log metric and alert for detecting signup throttling #983
Conversation
Co-authored-by: Thomas Arrow <[email protected]>
Co-authored-by: Thomas Arrow <[email protected]>
duration = "0s" | ||
comparison = "COMPARISON_GT" | ||
aggregations { | ||
alignment_period = "300s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment_period = "300s" | |
alignment_period = "88200s" # 24.5 hours |
Probably want to make this similar to the SQL alerting failure alignment period or this will auto resolve after 5mins (unless people keep trying to make accounts) and we'll probably ignore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to make this similar to the SQL alerting failure
SQL alerting is where I copied the value from
alignment_period = "300s" |
I also think a shorter value might be better as it will bring us closer to telling us how many users are experiencing throttling, instead of telling us "it's broken right now, not sure how many people are affected".
That being said, I gotta admit I wonder if those alerts are the right primitive to use for such a thing to begin with. Is there anything in GCP monitoring that is designed for being notified of point-in-time events that cannot be "resolved" in an automatically measurable fashion? I took a brief look, but couldn't find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want is something that works in the way "Sentry" would work, which seems to be close to this https://cloud.google.com/error-reporting/docs/
However, it inverses control and the application would need to push the error to the service. Not sure right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if we already have Error Reporting enabled via Stack Driver, we just never use it for anything application level: https://console.cloud.google.com/errors?referrer=search&project=wikibase-cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it seems terraform doesn't support creating notifications from the Error Reporting service yet hashicorp/terraform-provider-google#12068
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL alerting is where I copied the value from
Ah, I was unclear. I was thinking about
duration = "88200s" |
I think we should try using this error reporting thing; I agree that log based metrics exceeding a count doesn't feel like the right concept for this type of alerting (if it happens even once we would like to know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how would we do this? Close this PR, update api
to report
and see what it brings us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense; let's try to not totally lose these commits though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in api
wbstack/api#620
Co-authored-by: Thomas Arrow <[email protected]>
Ticket: https://phabricator.wikimedia.org/T335963
This PR just adds the changes to the module, which will then need to be consumed in a second PR.