Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-2925 feat: Report non-fatal App Hangs as RUM errors #1685

Merged

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Feb 19, 2024

What and why?

📦 First part of non-fatal App Hangs monitoring as RUM errors. App Hangs lasting more than 2s (TBD) will be reported as RUM errors.

Before this change, App Hangs were only tracked as RUM long tasks through RUM.LongTaskObserver, using the default threshold of 0.1s. Because long tasks are tracked on the main thread (using CFRunLoopObserver) it is impossible to capture their backtrace during the hang. Lack of stack traces makes it harder to troubleshoot, hence we implement additional layer of reporting hangs as RUM errors using secondary watchdog thread.

Next PRs will add logic for capturing the stack trace of the hanged main thread. In this PR we only send RUM error "App Hang" when the hang ends.

How?

The solution is using secondary thread (com.datadoghq.app-hang-watchdog, called "watchdog thread") to schedule periodic tasks on the main queue, await their completion and determine "hang" or "no hang" situation accordingly:

Screenshot 2024-02-19 at 09 23 33
  • at the beginning of watchdog thread loop, main thread task is scheduled;
  • if task is executed under hang threshold, there is no hang;
  • if task completion arrives after threshold, SDK considers it a hang.

By using secondary thread, the SDK will capture the stack trace of the main thread between treshold and "task completion" to associate it with RUM error. This work will be done in next PRs.

To not cause excessive use of CPU, the "watchdog thread" implements "IDLE" periods, which prevent it from scheduling main queue tasks too often. It is defined as Constants.tolerance value in watchdog thread implementation. The trade-off is adding maring of error in hangs monitoring, so the whole solution requires tuning to avoid to many false-negatives:

Screenshot 2024-02-19 at 09 32 10

"IDLE" periods make the impact of watchdog thread negligible:

Screenshot 2024-02-19 at 09 28 26

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@ncreated ncreated self-assigned this Feb 19, 2024
@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Feb 19, 2024

Datadog Report

Branch report: ncreated/RUM-2925/report-non-fatal-app-hangs-in-RUM
Commit report: 97f4286
Test service: dd-sdk-ios

✅ 0 Failed, 2752 Passed, 0 Skipped, 12m 1.64s Wall Time
🔻 Test Sessions change in coverage: 4 decreased, 4 increased, 6 no change

🔻 Code Coverage Decreases vs Default Branch (4)

  • test DatadogRUMTests iOS 79.45% (-0.09%) - Details
  • test DatadogRUMTests tvOS 79.60% (-0.05%) - Details
  • test DatadogInternalTests tvOS 81.24% (-0.03%) - Details
  • test DatadogLogsTests iOS 44.51% (-0.01%) - Details

@ncreated ncreated force-pushed the ncreated/RUM-2925/report-non-fatal-app-hangs-in-RUM branch from 3fa1377 to a091990 Compare February 19, 2024 08:12
@ncreated ncreated changed the title RUM-2925 feat: Report non-fatal App Hangs in RUM RUM-2925 feat: Report non-fatal App Hangs as RUM errors Feb 19, 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration-unit test.

continue // ignore likely false-positive
}

// TODO: RUM-2925 Capture stack trace of the pending App Hang
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be done in next PRs, but the actual stack trace generation will start from this line.

@ncreated ncreated marked this pull request as ready for review February 19, 2024 08:35
@ncreated ncreated requested review from a team as code owners February 19, 2024 08:35
Copy link
Member

@maciejburda maciejburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Reads really well, and it was quite interesting to go through tests. 💪

Left some minor comments 🤓

private var rumConfig = RUM.Configuration(applicationID: .mockAny())

override func setUp() {
super.setUp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, but I've learnt that recently.

There's no need to call super in setUp/tearDown:
https://forums.developer.apple.com/forums/thread/113314

Apple tweaked the docs:
https://developer.apple.com/documentation/xctest/xctestcase/set_up_and_tear_down_state_in_your_tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks a lot for bringing this up 🏅. I always wondered if this is needed or not and didn't know that Apple finally updated their docs to make it clear from their examples. Fixed ✅

dateProvider: dateProvider,
telemetry: telemetry
)
watchdogThread.onHangEnded = { [weak self] appHang in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but it looks like onHangEnded could be let and part of the init.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We access self (AppHangsObserver) in this block and it doesn't yet exist until AppHangsWatchdogThread's .init() returns. There are ways to workaround it, but all come with trade-offs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clean code and nice documentation 👏

Easy to read this crucial piece!

}

override func main() {
let mainThreadTask = self.mainThreadTask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of doing that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to avoid capturing [weak self] as part of the frequently scheduled task on mainQueue. There's no gain, just practical choice (semaphore is a class, so this is correct).

}

internal final class AppHangsWatchdogThread: Thread {
struct Constants {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - we could use enum to remove ability to init.

Also probably can be private / fileprivate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum makes sense, fixed ✅. This constant is referenced from unit tests, keeping internal.

// The hang has finished.
let hangDuration = interval(from: waitStart, to: .now())

if hangDuration > 30 { // sanity-check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since WD termination is roughly 10s, Maybe lower to > 10s to gather more insights, so we can establish better baseline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's "roughly" 10s we may cause false-negatives and actually lost some real hangs. I really want to see some false-positive examples in telemetry, so we can adjust it better, based on real data.

Comment on lines +128 to +130
private func interval(from t1: DispatchTime, to t2: DispatchTime) -> TimeInterval {
TimeInterval(t2.uptimeNanoseconds - t1.uptimeNanoseconds) / 1_000_000_000
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could export this to some internal extension?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer adding convenience APIs after the time - if we see this pattern is repeated over and over, then it could be a moment for refactor and adding convenience.

@@ -75,10 +75,10 @@ attributes:
custom_rules:
todo_without_jira: # enforces that all TODO comments must be followed by JIRA reference
name: "TODO without JIRA"
regex: "(TODO|TO DO|FIX|FIXME|FIX ME|todo)(?!:? RUMM-[0-9]{2,})" # "TODO: RUMM-123", "TODO RUMM-123", "FIX RUMM-123", etc.
regex: "(TODO|TO DO|FIX|FIXME|FIX ME|todo)(?!:? (RUMM|RUM)-[0-9]{2,})" # "TODO: RUM-123", "TODO RUM-123", "FIX RUM-123", etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! very nice work with the custom Thread!!

Copy link
Contributor

@ganeshnj ganeshnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, overall looks great.

Thread.sleep(forTimeInterval: idleInterval)
}

let waitStart: DispatchTime = .now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps

Suggested change
let waitStart: DispatchTime = .now()
let waitStart: DispatchTime = self.dateProvider.now()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional. The dateProvider.now() returns Date, but we need DispatchTime to later await on the semaphore. We don't need true Date to measure the hang, CPU time is enough.


internal struct AppHang {
/// The date of hang end.
let date: Date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let date: Date
let endTime: Date

I kind of prefer having startTime and duration in general, endTime and duration seems odd from API point of view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not blocking, I'd keep it as it is. Hangs are reported as RUM Errors, so they have no duration on view's timeline and we want to have them reported at the time of hang end.

Hang error will follow the Long Task that visualizes the duration:
Screenshot 2024-02-23 at 12 33 59

/// SDK date provider.
private let dateProvider: DateProvider
/// Telemetry interface.
private let telemetry: Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, having strong reference to Telemetry, can it cause reference cycle and in the end memory leak here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented by CoreTelemetry which weakifies the underlying core reference.


// Enable configured instrumentations:
self.viewControllerSwizzler?.swizzle()
self.uiApplicationSwizzler?.swizzle()
self.longTasks?.start()
self.appHangs?.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is TBD with the product, see this ADR: Decision #2 (internal). This PR is against a feature branch, the final API will be added before it lands on develop.

@ncreated ncreated merged commit c85902b into app-hangs Feb 23, 2024
8 checks passed
@ncreated ncreated deleted the ncreated/RUM-2925/report-non-fatal-app-hangs-in-RUM branch February 23, 2024 14:05
@maciejburda maciejburda mentioned this pull request Mar 19, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants