Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tf): add log metric and alert for detecting signup throttling #983

Closed
wants to merge 4 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions tf/modules/monitoring/signup-throttling-alert.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
resource "google_logging_metric" "api_signup_throttling_applied" {
name = "${var.cluster_name}-api_signup_throttling_applied"
description = "A log based metric that detects when api was throttling sign up attempts"

# Filter for the secondary pod and look for readiness probe failures
filter = "resource.labels.cluster_name=\"${var.cluster_name}\" AND resource.labels.pod_name:\"api-app-\" AND resource.type=\"k8s_pod\" AND severity=WARNING AND jsonPayload.message:\"WARN_SIGNUP_THROTTLED\""
m90 marked this conversation as resolved.
Show resolved Hide resolved

metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}
}

resource "google_monitoring_alert_policy" "alert_api_signup_throttling_applied" {
display_name = "(${var.cluster_name}): Signup was throttled"
combiner = "OR"
notification_channels = [
"${var.monitoring_email_group_name}"
]
documentation {
content = "Alert triggers when a user tried to create a wikibase.cloud account, but the configured limits were already exceeded."
}
conditions {
display_name = "(${var.cluster_name}): Signup was throttled"
condition_threshold {
filter = "metric.type=\"logging.googleapis.com/user/${google_logging_metric.api_signup_throttling_applied}\" AND resource.type=\"k8s_pod\""
m90 marked this conversation as resolved.
Show resolved Hide resolved
duration = "0s"
comparison = "COMPARISON_GT"
aggregations {
alignment_period = "300s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alignment_period = "300s"
alignment_period = "88200s" # 24.5 hours

Probably want to make this similar to the SQL alerting failure alignment period or this will auto resolve after 5mins (unless people keep trying to make accounts) and we'll probably ignore it

Copy link
Contributor Author

@m90 m90 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to make this similar to the SQL alerting failure

SQL alerting is where I copied the value from

I also think a shorter value might be better as it will bring us closer to telling us how many users are experiencing throttling, instead of telling us "it's broken right now, not sure how many people are affected".

That being said, I gotta admit I wonder if those alerts are the right primitive to use for such a thing to begin with. Is there anything in GCP monitoring that is designed for being notified of point-in-time events that cannot be "resolved" in an automatically measurable fashion? I took a brief look, but couldn't find anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want is something that works in the way "Sentry" would work, which seems to be close to this https://cloud.google.com/error-reporting/docs/

However, it inverses control and the application would need to push the error to the service. Not sure right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if we already have Error Reporting enabled via Stack Driver, we just never use it for anything application level: https://console.cloud.google.com/errors?referrer=search&project=wikibase-cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it seems terraform doesn't support creating notifications from the Error Reporting service yet hashicorp/terraform-provider-google#12068

Copy link
Contributor

@tarrow tarrow Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL alerting is where I copied the value from

Ah, I was unclear. I was thinking about

which is another (and functionally different) sql alert.

I think we should try using this error reporting thing; I agree that log based metrics exceeding a count doesn't feel like the right concept for this type of alerting (if it happens even once we would like to know).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how would we do this? Close this PR, update api to report and see what it brings us?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense; let's try to not totally lose these commits though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR in api wbstack/api#620

per_series_aligner = "ALIGN_SUM"
cross_series_reducer = "REDUCE_SUM"
}
trigger {
count = 1
}
}
}
}