Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(DataStore): retry initial sync network failures from RemoteSyncEngine #1773

Merged
merged 4 commits into from
May 8, 2022

Conversation

lawmicha
Copy link
Contributor

@lawmicha lawmicha commented May 3, 2022

Description:

This PR fixes RemoteSyncEngine to accurately determine whether it should retry based on the URLError. Previously, when errors were handled, they were ignored and a hardcoded URLError.notConnected is passed to the retryRequestAdvice method to get the advice to retry. This is a bug when the error is a non-retryable error, since the RemoteSyncEngine's retry attempt will immediately fall into the same errored state. For example, data decoding issues which should be non-retryable but the bug caused the sync engine to restart a number of times up to the limit determine by the algorithm in RequestRetryablePolicy.

Changes made

  • RemoteSyncEngine's error handling path to stop using the hardcoded URLError and start checking against a real URLError, retrieved from the underlying error. All components, specifically the sync and subscription components, are responsible for populating the underlying error that it receives from Amplify.API.
  • AWSInitialSyncOrchestrator: If any of the syncQuery requests fail with a network error, pick the first one to be the underlying error. This is returned as a DataStoreError.sync error with the URLError as the underlying error.

Testing Initial Sync - AWSInitialSyncOrchestrator

An end-to-end flow is when the component propagates the URLError back up to RemoteSyncEngine and RemoteSyncEngine accurately determines that it should retry.

  1. network on, DataStore.start(), then breakpoint into InitialSyncOperation's query(lastSyncTime:, nextToken:)
  2. network off, See that API returns a failure and AWSInitialSyncOrchestrator pulls it out and into the DataStoreError.sync error.

Testing data decoding issue

  1. A schema with a model containing enum cases like
type Todo @model {
  id: ID!
  name: String!
  description: String
  status: Status!
}

enum Status {
  notStarted
  inProgress
  completed
  blocked
}
  1. save some data, and then update the Status enum to remove one of the cases
public enum Status: String, EnumPersistable {
  case notStarted
  case inProgress
  case completed
  //case blocked
}
  1. DataStore.start() should sync the data and receive the error
dataCorrupted(Swift.DecodingError.Context(codingPath: [CodingKeys(stringValue: "items", intValue: nil), _JSONKey(stringValue: "Index 0", intValue: 0), CodingKeys(stringValue: "status", intValue: nil)], debugDescription: "Cannot initialize Status from invalid String value blocked", underlyingError: nil))
  1. Observe that no retries are made.

Subscriptions

Amplify.API.subscription API will use the AppSyncRealTimeClient library for subscriptions. When there is a network status changes, it will internally disconnect and reconnect with exponential back-ff. From Amplify.API's perspective, it doesn't receive an error until all retry attempts have been exhausted, and the error is an .operationError, not a URLError. See aws-amplify/aws-appsync-realtime-client-ios#58 for more details

Sync to Cloud Mutations

OutgoingMutationQueue will create a SyncMutationtoCloudOperation for the mutation event, which contains retry logic for network failures. See #1722 for more details

Check points: (check or cross out if not relevant)

  • Added new tests to cover change, if needed
  • Build succeeds with all target using Swift Package Manager
  • All unit tests pass
  • All integration tests pass
  • Security oriented best practices and standards are followed (e.g. using input sanitization, principle of least privilege, etc)
  • Documentation update for the change if required
  • PR title conforms to conventional commit style
  • If breaking change, documentation/changelog update with migration instructions

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lawmicha lawmicha requested a review from a team May 3, 2022 14:07
@lawmicha lawmicha force-pushed the fix/retry-initial-sync branch from a230ac3 to 9161e3f Compare May 3, 2022 14:08
@@ -158,10 +158,16 @@ final class AWSInitialSyncOrchestrator: InitialSyncOrchestrator {
return .successfulVoid
}

var underlyingError: Error?
if syncErrors.contains(where: isNetworkError) {
underlyingError = getFirstUnderlyingNetworkError(errors: syncErrors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any potential side effects on using the first network error? (ie. can different network errors have different retry policies?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use .first(where predicate:) and check if syncErrors has any network error and grab the first in a single operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it is not ideal to have RemoteSyncEngine contain URLError based retry logic, and this is why it's forcing us to propagate one of the network errors up to the RemoteSyncEngine. Ideally, InitialSyncOperation should be responsible for checking the network error, determining that the error is retryable, and retry with the exponential back-off. This means it will cycle on the single InitialSyncOperation trying to sync that one model. Once the network is back up, then the operation will finish, and move onto the next InitialSyncOperation.

This code current assumes that the first network error is probably similar to other network errors if they were run approximately the same time. Of course this assumption could be wrong, but the side effect of it being wrong and pick up a network error that RemoteSyncEngine decides is non-retryable is that the sync engine will move to an errored state, as with all other non-retryable errors.

The retry policy i believe is determined by the implementation in RequestRetryablePolicy which filters down to

case .notConnectedToInternet,
             .dnsLookupFailed,
             .cannotConnectToHost,
             .cannotFindHost,
             .timedOut:

@@ -219,4 +225,39 @@ extension AWSInitialSyncOrchestrator {
}
return false
}

private func isNetworkError(_ error: DataStoreError) -> Bool {
guard case let .sync(_, _, underlyingError) = error,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check required if we are only looking for network error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required because the InitialSyncOrchestrator wraps the error from the operations in a DataStoreError.sync at line 120, stores in syncErrors, and moves onto the next operation. Comes to think of it, i'm not really sure why it needs to be this way, since syncErrors is just an array of [DataStoreError]. But as it is currently, it needs to unwrap the .sync case like it's done above in isUnauthorizedError

if case .failure(let dataStoreError) = result {
                    let syncError = DataStoreError.sync(
                        "An error occurred syncing \(modelSchema.name)",
                        "",
                        dataStoreError)
                    self.syncErrors.append(syncError)


private func getFirstUnderlyingNetworkError(errors: [DataStoreError]) -> Error? {
for error in errors {
guard case let .sync(_, _, underlyingError) = error,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as previous

@lawmicha lawmicha force-pushed the fix/retry-initial-sync branch from 2e581ac to 79532fe Compare May 4, 2022 04:00
@codecov-commenter
Copy link

codecov-commenter commented May 4, 2022

Codecov Report

Merging #1773 (a8fb911) into main (8f635d0) will decrease coverage by 0.05%.
The diff coverage is 18.75%.

@@            Coverage Diff             @@
##             main    #1773      +/-   ##
==========================================
- Coverage   59.23%   59.18%   -0.06%     
==========================================
  Files         716      716              
  Lines       21916    21943      +27     
==========================================
+ Hits        12982    12987       +5     
- Misses       8934     8956      +22     
Flag Coverage Δ
API_plugin_unit_test 62.51% <ø> (ø)
AWSPluginsCore 71.37% <ø> (+0.07%) ⬆️
Amplify 46.83% <ø> (+0.01%) ⬆️
Analytics_plugin_unit_test 71.09% <ø> (ø)
Auth_plugin_unit_test 78.74% <ø> (ø)
DataStore_plugin_unit_test 75.80% <18.75%> (-0.31%) ⬇️
Geo_plugin_unit_test 52.12% <ø> (ø)
Predictions_plugin_unit_test 26.81% <ø> (ø)
Storage_plugin_unit_test 58.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...gin/Sync/InitialSync/InitialSyncOrchestrator.swift 50.76% <0.00%> (-11.50%) ⬇️
...tegoryPlugin/Sync/RemoteSyncEngine+Retryable.swift 77.50% <85.71%> (-0.88%) ⬇️
...gin/Sync/InitialSync/ModelSyncedEventEmitter.swift 89.58% <0.00%> (-1.05%) ⬇️
...ataStoreCategoryPlugin/Sync/RemoteSyncEngine.swift 96.00% <0.00%> (+0.79%) ⬆️
.../DataStore/Model/Internal/Schema/ModelSchema.swift 66.00% <0.00%> (+2.00%) ⬆️
...e/AWSPluginsCore/Auth/AWSAuthServiceBehavior.swift 87.50% <0.00%> (+12.50%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@lawmicha lawmicha self-assigned this May 4, 2022
if syncErrors.contains(where: isNetworkError) {
underlyingError = getFirstUnderlyingNetworkError(errors: syncErrors)
if let error = syncErrors.first(where: isNetworkError(_:)) {
underlyingError = getFirstUnderlyingNetworkError(error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getFirstUnderlyingNetworkError -> getUnderlyingNetworkError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants