Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to create DD monitor like New Relic's "alert if X is over Y for entire time period" #691

Closed
3 tasks
robrap opened this issue Jun 17, 2024 · 12 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Jun 17, 2024

A/C:

Proposal is to create a doc that shares what we know, and link to it from our how-to docs.

Datadog Support ticket asking for ideas, as well as to submit a feature request: https://help.datadoghq.com/hc/en-us/requests/1994936 (private link)

@robrap robrap added this to Arch-BOM Jun 17, 2024
@robrap robrap converted this from a draft issue Jun 17, 2024
@robrap
Copy link
Contributor Author

robrap commented Jun 18, 2024

[inform] Copying a Slack request to @julianajlk:

Could you:

  • Add me to your support ticket, and
  • Have them convert it to a feature request (if you haven't already)?
    • We should note that what was super simple in NR has taken lots of discussion and no agreement on the best way to handle this situation.

@timmc-edx
Copy link
Member

Ideas posted in Slack:

Troy:

maybe try smoothing the metric first by applying a rolling average window of 5 minutes, then calculate the min() of that over 1 hour?

if the threshold is crossed, then that means the rolling average 5-minute window was consistently crossing the threshold for the entire hour

only tricky thing about my approach is that the moving average functions only operate on # of intervals, so if you use a metric with a 10 second interval, 20 invervals (via ewma_20) would be about 3.3 minutes.

Ray:

I think if you clampMax(), you can trigger if the average over that time is that max value.

Because average can't go above that threshold value, you ensure you hit that threshold for the entire period if average is at the threshold

Troy notes that this requires coordinating the same number in two places. (Might be OK for Terraform-managed metrics.)

Juliana got some advice from DD support:

this case a rollup might be the best solution. Rollups should usually be avoided in monitor queries, because of the possibility of misalignment between the rollup interval and the evaluation window of the monitor. The start and end of rollup intervals are aligned to UNIX time, not to the start and end of monitor queries. Therefore, a monitor may evaluate (and trigger on) an incomplete rollup interval containing only a small sample of data. To avoid this issue, delay the evaluation of your monitor by (at least) the length of the setup rollup interval. See Rollups in monitors
For your monitors to evaluate as expected, you need to add the following configurations:

  • 15minutes (900 seconds) avg rollup function
  • Evaluation window of 1 hour
  • Monitor evaluation delay of 900 seconds

For example. It's currently 9:30 am, the monitor will evaluate from 8:15 to 9:15, it'll evaluate the average of four 15 minutes segments. The average of those 4 segments, needs to be below threshold for the monitor to trigger.

@robrap robrap mentioned this issue Jun 24, 2024
7 tasks
@robrap
Copy link
Contributor Author

robrap commented Jun 28, 2024

[inform] @julianajlk also initiated a feature request with DD on https://help.datadoghq.com/hc/en-us/requests/1754847 (which you probably won't be able to see).

@jristau1984 jristau1984 moved this to Backlog in Arch-BOM Jul 1, 2024
@timmc-edx
Copy link
Member

Datadog Support notes that Metrics monitors support evaluation over an entire time period. So this capability may be present for some monitor types and not others.

@julianajlk
Copy link
Member

Linking here some Datadog metrics monitor docs, the chart of "Definitions", highlighting the below -

"any single value" in max aggregation method vs. "all points in the evaluation window" in min method, specifically "For monitors that alert when below the threshold, the max and min behavior is reversed."

@robrap robrap moved this from Backlog to Ready For Development in Arch-BOM Aug 21, 2024
@robrap
Copy link
Contributor Author

robrap commented Aug 21, 2024

@timmc-edx:

  1. I'm moving this to "Ready for Development" to get it on our board, and I'm assigning it to you, because I think you have the most experience, and would be able to complete this the fastest.
  2. I adjusted the AC to have this more about moving what we know today to a doc, so we can close out this ticket.

@timmc-edx timmc-edx removed their assignment Sep 6, 2024
@timmc-edx timmc-edx self-assigned this Dec 12, 2024
@timmc-edx timmc-edx moved this from Ready For Development to In Progress in Arch-BOM Dec 12, 2024
@timmc-edx
Copy link
Member

I created a wiki page with the results of my experiments: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1581023295/Options+for+Datadog+time-period+APM+monitors

Unfortunately, none of the suggestions seem to work for APM monitors.

@timmc-edx timmc-edx moved this from In Progress to Blocked in Arch-BOM Dec 16, 2024
@timmc-edx
Copy link
Member

Blocked until after winter break; revisit Jan 6 by opening a ticket with DD to ask for a workable option.

@robrap
Copy link
Contributor Author

robrap commented Jan 14, 2025

We weren't able to find a solution for sustained alerts on APM data.

  • One idea is creating a metric from the APM data (https://app.datadoghq.com/apm/traces/generate-metrics) and then using the metric-based solution for sustained alerts.
  • Additionally, we should open a feature request around supporting sustained alert capabilities for APM data.

@timmc-edx timmc-edx moved this from Ready For Development to In Progress in Arch-BOM Jan 15, 2025
@timmc-edx
Copy link
Member

I've filed a Datadog Support ticket asking for ideas, as well as to submit a feature request: https://help.datadoghq.com/hc/en-us/requests/1994936

@timmc-edx timmc-edx moved this from In Progress to Backlog in Arch-BOM Jan 16, 2025
@timmc-edx timmc-edx moved this from Backlog to Blocked in Arch-BOM Jan 16, 2025
@robrap robrap removed the status in Arch-BOM Jan 27, 2025
@robrap
Copy link
Contributor Author

robrap commented Jan 27, 2025

Reminder: @timmc-edx will fold this ticket into the following issue(s), and to pre-groom them so anyone (including Ray) could pick them up.

@timmc-edx: Since there will be a relationship between docs, and experimentation, and closing all the tickets, I propose you consolidate down to a single ticket and close this one and one of the others as duplicates. Note: You can now "Close as duplicate" in Github.

@timmc-edx
Copy link
Member

We've documented some options, but experimenting with them and communicating the results will now be part of #830 -- A/C have been copied over and expanded.

@github-project-automation github-project-automation bot moved this to Done in Arch-BOM Jan 28, 2025
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

4 participants