-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tf): add log metric and alert for detecting signup throttling #983
Closed
+39
−0
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
de5b5b2
feat(tf): add log metric and alert for detecting signup throttling
m90 9fd700b
fix: reference name
m90 5914363
fix: alert filter is looking at k8s logs instead of application ones
m90 882f0c5
fix: metric needs to look at container level logs
m90 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
resource "google_logging_metric" "api_signup_throttling_applied" { | ||
name = "${var.cluster_name}-api_signup_throttling_applied" | ||
description = "A log based metric that detects when api was throttling sign up attempts" | ||
|
||
# Filter for the secondary pod and look for readiness probe failures | ||
filter = "resource.labels.cluster_name=\"${var.cluster_name}\" AND resource.labels.container_name:\"api-\" AND severity=ERROR AND resource.type=\"k8s_container\" AND textPayload:\"WARN_SIGNUP_THROTTLED\"" | ||
|
||
metric_descriptor { | ||
metric_kind = "DELTA" | ||
value_type = "INT64" | ||
} | ||
} | ||
|
||
resource "google_monitoring_alert_policy" "alert_api_signup_throttling_applied" { | ||
display_name = "(${var.cluster_name}): Signup was throttled" | ||
combiner = "OR" | ||
notification_channels = [ | ||
"${var.monitoring_email_group_name}" | ||
] | ||
documentation { | ||
content = "Alert triggers when a user tried to create a wikibase.cloud account, but the configured limits were already exceeded." | ||
} | ||
conditions { | ||
display_name = "(${var.cluster_name}): Signup was throttled" | ||
condition_threshold { | ||
filter = "metric.type=\"logging.googleapis.com/user/${google_logging_metric.api_signup_throttling_applied.name}\" AND resource.type=\"k8s_container\"" | ||
duration = "0s" | ||
comparison = "COMPARISON_GT" | ||
aggregations { | ||
alignment_period = "300s" | ||
per_series_aligner = "ALIGN_SUM" | ||
cross_series_reducer = "REDUCE_SUM" | ||
} | ||
trigger { | ||
count = 1 | ||
} | ||
} | ||
} | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to make this similar to the SQL alerting failure alignment period or this will auto resolve after 5mins (unless people keep trying to make accounts) and we'll probably ignore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL alerting is where I copied the value from
wbaas-deploy/tf/modules/monitoring/replication-lag-alert.tf
Line 31 in 0a5b6f6
I also think a shorter value might be better as it will bring us closer to telling us how many users are experiencing throttling, instead of telling us "it's broken right now, not sure how many people are affected".
That being said, I gotta admit I wonder if those alerts are the right primitive to use for such a thing to begin with. Is there anything in GCP monitoring that is designed for being notified of point-in-time events that cannot be "resolved" in an automatically measurable fashion? I took a brief look, but couldn't find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want is something that works in the way "Sentry" would work, which seems to be close to this https://cloud.google.com/error-reporting/docs/
However, it inverses control and the application would need to push the error to the service. Not sure right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if we already have Error Reporting enabled via Stack Driver, we just never use it for anything application level: https://console.cloud.google.com/errors?referrer=search&project=wikibase-cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it seems terraform doesn't support creating notifications from the Error Reporting service yet hashicorp/terraform-provider-google#12068
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was unclear. I was thinking about
wbaas-deploy/tf/modules/monitoring/sql-backup-failure-alert.tf
Line 26 in 0a5b6f6
I think we should try using this error reporting thing; I agree that log based metrics exceeding a count doesn't feel like the right concept for this type of alerting (if it happens even once we would like to know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how would we do this? Close this PR, update
api
toreport
and see what it brings us?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense; let's try to not totally lose these commits though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in
api
wbstack/api#620