Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-1490 Migrate to Intake API v2 #562

Merged
merged 9 commits into from
Aug 23, 2021

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Aug 12, 2021

What and why?

🚚 This PR updates the SDK's upload layer for Intake API v2, by:

  • changing the path from /v1/input/<token> to /api/v2/<track-type> where track-type is rum, logs or spans;
  • adding new HTTP headers:
    • DD-API-KEY - the client token,
    • DD-EVP-ORIGIN - the origin sdk tag (e.g. ios or react-native),
    • DD-EVP-ORIGIN-VERSION: the SDK version,
    • DD-REQUEST-ID: a debug Intake request ID;
  • adding support to v2 HTTP response codes: 202, 400, 401, 403, 408, 413, 429, 500 and 503.

It also adds more verbosity to the console logs produced by DataUploadWorker, examples:

[DATADOG SDK] 🐶 → 10:13:05.933 [DEBUG]    → (rum) accepted, won't be retransmitted: [response code: 202 (accepted), request ID: 74293D75-FFEE-43A9-B328-70A71966AA69]
[DATADOG SDK] 🐶 → 10:15:17.138 [DEBUG]    → (logging) not delivered, will be retransmitted: [error: The Internet connection appears to be offline.]
[DATADOG SDK] 🐶 → 10:17:02.941 [DEBUG]    → (logging) accepted, won't be retransmitted: [response code: 403 (forbidden), request ID: 4DC9965B-357E-426D-B21C-4B61450C7634]

Last, it adds few improvements to the internal monitoring we do around the upload stack:

  • certain NSURLError codes are now ignored to not send false positives (e.g. when device goes offline it's not an error);
  • it handles new HTTP response codes to send errors on certain ones (e.g. 413 Payload Too Large).

How?

The upload architecture remains the same, but I made few changes to make testing easier with new HTTP status codes:

  • The "upload needs retry" consideration was moved from DataUploadWorker to the DataUploadStatus type. With that change, I was able to write simple tests for DataUploadStatus and only test two scenarios in DataUploadWorker: needs retry or does not.
  • As console log output highly depends on the HTTP status code (or network error), it is now part of the DataUploadStatus, which enables the same easy testing as for .needsRetry.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference

@ncreated ncreated self-assigned this Aug 12, 2021
Comment on lines +130 to +139
private let ignoredNSURLErrorCodes = Set([
NSURLErrorNetworkConnectionLost, // -1005
NSURLErrorTimedOut, // -1001
NSURLErrorCannotParseResponse, // - 1017
NSURLErrorNotConnectedToInternet, // -1009
NSURLErrorCannotFindHost, // -1003
NSURLErrorSecureConnectionFailed, // -1200
NSURLErrorDataNotAllowed, // -1020
NSURLErrorCannotConnectToHost, // -1004
])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this list by looking at Datadog iOS app telemetry. None of these codes looks like SDK fault:
Screenshot 2021-08-12 at 14 35 40

Comment on lines 113 to 119
case .badRequest, .payloadTooLarge, .requestTimeout:
// These codes mean that something wrong is happening either in the SDK or on the server - produce an error.
return (
message: "Data upload finished with status code: \(statusCode)",
error: nil,
attributes: ["dd_request_id": requestID ?? "(???)"]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I'm not sure if 408 Request Timeout is worth sending error for Internal Monitoring. On one side this could indicate a slow response from Intake (monitored by other means), but on the other the SDK might be lagging on uploads. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally this case should be handled by URLSession, right? if the connection stays idle, it produces connection timed out error.
so i didn't fully understand what 408 means but i guess we can keep internal-monitor it anyway, if we can filter those errors out in Datadog UI.

maybe we can internal-monitor tooManyRequests too. as this shouldn't normally happen as we batch the requests, it can indicate a configuration problem in the client's side. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can merge once this discussion is resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's distinction between client (URLSession) timeout and server timeout - 429 means the server.

maybe we can internal-monitor tooManyRequests too.

Good idea, I added .tooManyRequests to the list for alerting 👌.

.ddAPIKeyHeader(clientToken: configuration.clientToken),
.ddEVPOriginHeader(source: configuration.common.source),
.ddEVPOriginVersionHeader(),
.ddRequestIDHeader(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an open question from @0xnm (in DataDog/dd-sdk-android#671) if the DD-REQUEST-ID should be included all the time or only conditionally with a user setting.

@ncreated ncreated marked this pull request as ready for review August 12, 2021 12:55
@ncreated ncreated requested a review from a team as a code owner August 12, 2021 12:55
@ncreated ncreated force-pushed the ncreated/RUMM-1490-migrate-to-intake-api-v2 branch from f6491ec to c4cdb66 Compare August 12, 2021 16:49
Copy link
Contributor

@buranmert buranmert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left a few comments, not blocking but i wanted to discuss before approving

Comment on lines 113 to 119
case .badRequest, .payloadTooLarge, .requestTimeout:
// These codes mean that something wrong is happening either in the SDK or on the server - produce an error.
return (
message: "Data upload finished with status code: \(statusCode)",
error: nil,
attributes: ["dd_request_id": requestID ?? "(???)"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally this case should be handled by URLSession, right? if the connection stays idle, it produces connection timed out error.
so i didn't fully understand what 408 means but i guess we can keep internal-monitor it anyway, if we can filter those errors out in Datadog UI.

maybe we can internal-monitor tooManyRequests too. as this shouldn't normally happen as we batch the requests, it can indicate a configuration problem in the client's side. wdyt?

}

semaphore.signal()
}

_ = semaphore.wait(timeout: .distantFuture)

return uploadStatus ?? .unknown
return uploadStatus ?? DataUploader.unreachableUploadStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can see this case as completion not called error or something along those lines. DataUploader.unreachableUploadStatus isn't very intiutive at the first sight.

example, feel free to change the details:

private static let nonCalledCompletionStatus = DataUploadStatus(needsRetry: false, userDebugDescription: "An internal error occurred", userErrorMessage: nil, internalMonitoringError: "HTTPClient Completion not called")

i mean, at least the status value should try to explain the situation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ?? DataUploader.unreachableUploadStatus here is to only satisfy the compiler - the right side of ?? operator will never be evaluated as enforced by the previous logic in this function. That's why the name and comment:

/// An unreachable upload status - only meant to satisfy the compiler.
private static let unreachableUploadStatus = DataUploadStatus(...)

Alternative to this would be:

guard let uploadStatus = uploadStatus else {
   fatalError()
}
return uploadStatus

but as we avoid using fatalError() in the SDK codebase, I adopted the solution which is already used in other places: returning known default as fallback.

Hence I don't think we should improve the error value here - this would give a false sense of reachability / usability.

Sources/Datadog/Core/Feature.swift Outdated Show resolved Hide resolved
// MARK: - Datadog Headers

/// Datadog request authentication header.
static func ddAPIKeyHeader(clientToken: String) -> HTTPHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be confusing for the outsiders/newcomers as we have client tokens and API keys

what do you think of renaming this function to ddClientTokenHeader? (of course we will keep "DD-API-KEY" as value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I used consistent convention for 3 new headers:

static func ddAPIKeyHeader(clientToken: String) -> HTTPHeader
static func ddEVPOriginHeader(source: String) -> HTTPHeader
static func ddEVPOriginVersionHeader() -> HTTPHeader

to make it clearly visible that we follow the spec for all features:

// in LoggingFeature, TracingFeature, RUMFeature and InternalMonitoringFeature:
requestBuilder: RequestBuilder(
   // ...
   headers: [
      // ...
      .ddAPIKeyHeader(clientToken: configuration.clientToken),
      .ddEVPOriginHeader(source: configuration.common.source),
      .ddEVPOriginVersionHeader(),
   ]
)

The spec says we must send DD-API-KEY, DD-EVP-ORIGIN and DD-EVP-ORIGIN-VERSION headers - and I shaped the above code to tell exactly that story.

IMO if we reverse it by naming the function ddClientTokenHeader(), the DD-API-KEY value used internally would be equally puzzling to the reader.

Copy link
Member

@xgouchet xgouchet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ncreated ncreated merged commit c1b594e into master Aug 23, 2021
@ncreated ncreated deleted the ncreated/RUMM-1490-migrate-to-intake-api-v2 branch August 23, 2021 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants