-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUM-2925 feat: Report non-fatal App Hangs as RUM errors #1685
RUM-2925 feat: Report non-fatal App Hangs as RUM errors #1685
Conversation
Datadog ReportBranch report: ✅ 0 Failed, 2752 Passed, 0 Skipped, 12m 1.64s Wall Time 🔻 Code Coverage Decreases vs Default Branch (4) |
3fa1377
to
a091990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration-unit test.
continue // ignore likely false-positive | ||
} | ||
|
||
// TODO: RUM-2925 Capture stack trace of the pending App Hang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be done in next PRs, but the actual stack trace generation will start from this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! Reads really well, and it was quite interesting to go through tests. 💪
Left some minor comments 🤓
private var rumConfig = RUM.Configuration(applicationID: .mockAny()) | ||
|
||
override func setUp() { | ||
super.setUp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor, but I've learnt that recently.
There's no need to call super in setUp/tearDown:
https://forums.developer.apple.com/forums/thread/113314
Apple tweaked the docs:
https://developer.apple.com/documentation/xctest/xctestcase/set_up_and_tear_down_state_in_your_tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks a lot for bringing this up 🏅. I always wondered if this is needed or not and didn't know that Apple finally updated their docs to make it clear from their examples. Fixed ✅
dateProvider: dateProvider, | ||
telemetry: telemetry | ||
) | ||
watchdogThread.onHangEnded = { [weak self] appHang in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but it looks like onHangEnded
could be let
and part of the init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We access self
(AppHangsObserver
) in this block and it doesn't yet exist until AppHangsWatchdogThread
's .init()
returns. There are ways to workaround it, but all come with trade-offs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super clean code and nice documentation 👏
Easy to read this crucial piece!
} | ||
|
||
override func main() { | ||
let mainThreadTask = self.mainThreadTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the benefit of doing that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to avoid capturing [weak self]
as part of the frequently scheduled task on mainQueue
. There's no gain, just practical choice (semaphore is a class, so this is correct).
} | ||
|
||
internal final class AppHangsWatchdogThread: Thread { | ||
struct Constants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor - we could use enum
to remove ability to init.
Also probably can be private / fileprivate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum makes sense, fixed ✅. This constant is referenced from unit tests, keeping internal
.
// The hang has finished. | ||
let hangDuration = interval(from: waitStart, to: .now()) | ||
|
||
if hangDuration > 30 { // sanity-check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since WD termination is roughly 10s, Maybe lower to > 10s to gather more insights, so we can establish better baseline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's "roughly" 10s we may cause false-negatives and actually lost some real hangs. I really want to see some false-positive examples in telemetry, so we can adjust it better, based on real data.
private func interval(from t1: DispatchTime, to t2: DispatchTime) -> TimeInterval { | ||
TimeInterval(t2.uptimeNanoseconds - t1.uptimeNanoseconds) / 1_000_000_000 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could export this to some internal extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer adding convenience APIs after the time - if we see this pattern is repeated over and over, then it could be a moment for refactor and adding convenience.
@@ -75,10 +75,10 @@ attributes: | |||
custom_rules: | |||
todo_without_jira: # enforces that all TODO comments must be followed by JIRA reference | |||
name: "TODO without JIRA" | |||
regex: "(TODO|TO DO|FIX|FIXME|FIX ME|todo)(?!:? RUMM-[0-9]{2,})" # "TODO: RUMM-123", "TODO RUMM-123", "FIX RUMM-123", etc. | |||
regex: "(TODO|TO DO|FIX|FIXME|FIX ME|todo)(?!:? (RUMM|RUM)-[0-9]{2,})" # "TODO: RUM-123", "TODO RUM-123", "FIX RUM-123", etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
DatadogRUM/Sources/Instrumentation/AppHangs/AppHangsWatchdogThread.swift
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! very nice work with the custom Thread
!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, overall looks great.
Thread.sleep(forTimeInterval: idleInterval) | ||
} | ||
|
||
let waitStart: DispatchTime = .now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps
let waitStart: DispatchTime = .now() | |
let waitStart: DispatchTime = self.dateProvider.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intentional. The dateProvider.now()
returns Date
, but we need DispatchTime
to later await on the semaphore. We don't need true Date
to measure the hang, CPU time is enough.
|
||
internal struct AppHang { | ||
/// The date of hang end. | ||
let date: Date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let date: Date | |
let endTime: Date |
I kind of prefer having startTime and duration in general, endTime and duration seems odd from API point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// SDK date provider. | ||
private let dateProvider: DateProvider | ||
/// Telemetry interface. | ||
private let telemetry: Telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, having strong reference to Telemetry, can it cause reference cycle and in the end memory leak here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implemented by CoreTelemetry
which weakifies the underlying core
reference.
|
||
// Enable configured instrumentations: | ||
self.viewControllerSwizzler?.swizzle() | ||
self.uiApplicationSwizzler?.swizzle() | ||
self.longTasks?.start() | ||
self.appHangs?.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is TBD with the product, see this ADR: Decision #2 (internal). This PR is against a feature branch, the final API will be added before it lands on develop
.
What and why?
📦 First part of non-fatal App Hangs monitoring as RUM errors. App Hangs lasting more than 2s (TBD) will be reported as RUM errors.
Before this change, App Hangs were only tracked as RUM long tasks through
RUM.LongTaskObserver
, using the default threshold of 0.1s. Because long tasks are tracked on the main thread (usingCFRunLoopObserver
) it is impossible to capture their backtrace during the hang. Lack of stack traces makes it harder to troubleshoot, hence we implement additional layer of reporting hangs as RUM errors using secondary watchdog thread.Next PRs will add logic for capturing the stack trace of the hanged main thread. In this PR we only send RUM error
"App Hang"
when the hang ends.How?
The solution is using secondary thread (
com.datadoghq.app-hang-watchdog
, called "watchdog thread") to schedule periodic tasks on the main queue, await their completion and determine "hang" or "no hang" situation accordingly:threshold
, there is no hang;threshold
, SDK considers it a hang.By using secondary thread, the SDK will capture the stack trace of the main thread between
treshold
and "task completion" to associate it with RUM error. This work will be done in next PRs.To not cause excessive use of CPU, the "watchdog thread" implements "IDLE" periods, which prevent it from scheduling main queue tasks too often. It is defined as
Constants.tolerance
value in watchdog thread implementation. The trade-off is adding maring of error in hangs monitoring, so the whole solution requires tuning to avoid to many false-negatives:"IDLE" periods make the impact of watchdog thread negligible:
Review checklist
Custom CI job configuration (optional)
tools/