Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tf): add log metric and alert for detecting signup throttling #983

Closed
wants to merge 4 commits into from

Conversation

m90
Copy link
Contributor

@m90 m90 commented Jul 10, 2023

Ticket: https://phabricator.wikimedia.org/T335963

This PR just adds the changes to the module, which will then need to be consumed in a second PR.

Co-authored-by: Thomas Arrow <[email protected]>
@m90 m90 requested a review from tarrow July 17, 2023 11:14
duration = "0s"
comparison = "COMPARISON_GT"
aggregations {
alignment_period = "300s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alignment_period = "300s"
alignment_period = "88200s" # 24.5 hours

Probably want to make this similar to the SQL alerting failure alignment period or this will auto resolve after 5mins (unless people keep trying to make accounts) and we'll probably ignore it

Copy link
Contributor Author

@m90 m90 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to make this similar to the SQL alerting failure

SQL alerting is where I copied the value from

I also think a shorter value might be better as it will bring us closer to telling us how many users are experiencing throttling, instead of telling us "it's broken right now, not sure how many people are affected".

That being said, I gotta admit I wonder if those alerts are the right primitive to use for such a thing to begin with. Is there anything in GCP monitoring that is designed for being notified of point-in-time events that cannot be "resolved" in an automatically measurable fashion? I took a brief look, but couldn't find anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want is something that works in the way "Sentry" would work, which seems to be close to this https://cloud.google.com/error-reporting/docs/

However, it inverses control and the application would need to push the error to the service. Not sure right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if we already have Error Reporting enabled via Stack Driver, we just never use it for anything application level: https://console.cloud.google.com/errors?referrer=search&project=wikibase-cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it seems terraform doesn't support creating notifications from the Error Reporting service yet hashicorp/terraform-provider-google#12068

Copy link
Contributor

@tarrow tarrow Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL alerting is where I copied the value from

Ah, I was unclear. I was thinking about

which is another (and functionally different) sql alert.

I think we should try using this error reporting thing; I agree that log based metrics exceeding a count doesn't feel like the right concept for this type of alerting (if it happens even once we would like to know).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how would we do this? Close this PR, update api to report and see what it brings us?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense; let's try to not totally lose these commits though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR in api wbstack/api#620

@m90 m90 closed this Jul 20, 2023
@m90 m90 deleted the fr/signup-throttle branch August 15, 2024 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants