Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datastore): retry on subscription connection error #2571

Merged
merged 3 commits into from
Nov 17, 2022

Conversation

lawmicha
Copy link
Contributor

@lawmicha lawmicha commented Nov 15, 2022

Issue #, if available:
#2152

Description of changes:
Once DataStore is in an active state, then the network goes down, then recovers, the subscription websocket will send an error on network recovery. This error propagated back to API/DataStore which handles the error by moving DataStore to a stopped state. It is not processed as retryable. This delays the recoverability of DataStore until the next DataStore operation is called, since it remains in a stopped state. The scenario is reproducible by running an iOS app on a device or running a macOS app on the macbook. Running on an iOS simulator does not have the same effect and the websockets do not disconnect, rather the subscription event is received with a delay (~5-10 seconds) after network recovers.

The AppSyncRealTime websocket client, used by Amplify.API (API plugin) does have its own retry logic on classification of errors coming from the underlying websocket, but in certain scenarios, whether it's an unclassifed error or retry has been exhausted, will eventually propagate a ConnectionProviderError.connection error back to the caller and close the websocket connection. In this case, the API Plugin receives this terminating error and sends back API.operationError to DataStore.

DataStore’s sync engine used to be much more aggressive in retrying the sync process. This caused retry storms on AppSync and was fixed in #1901 It was a code bug that retried on all errors, regardless of the underlying error. By fixing this, it has created another issue where in this scenario where retry would have fixed the problem, was not initiated until the next explicit DataStore operation (start/save/query/delete/etc..) was called, because it has transitioned to the stopped state. When the network is turned back on, DataStore gets the API.operationError and stops. We were able to observe this on multiple attempts, #2152 (comment) and #2152 (comment).

  1. API plugin should classify ConnectionProvider.connection errors from AppSyncRealTimeClient as an APIError.networkError. Convert ConnectionProvider.connection to some Foundation’s URLError case that makes sense like URLError.networkConnectionLost
  2. DataStore should not handle anything related to AppSyncRealTimeClient errors and should only act on APIError and extract out the URLError when it is a APIError.networkError as an indication to retry the sync process. It passes the URLError to RequestRetryablePolicy, which now classifies .networkConnectionLost as retryable.

With this change, when the wifi is turned back on from off, DataStore receives the websocket error and restarts the sync process, and transitions to the active state successfully.

Check points: (check or cross out if not relevant)

  • Added new tests to cover change, if needed
  • Build succeeds with all target using Swift Package Manager
  • All unit tests pass
  • All integration tests pass
  • Security oriented best practices and standards are followed (e.g. using input sanitization, principle of least privilege, etc)
  • Documentation update for the change if required
  • PR title conforms to conventional commit style
  • If breaking change, documentation/changelog update with migration instructions

DataStore checkpoints (check when completed)

  • Ran AWSDataStorePluginIntegrationTests
  • Ran AWSDataStorePluginV2Tests
  • Ran AWSDataStorePluginMultiAuthTests
  • Ran AWSDataStorePluginCPKTests
  • Ran AWSDataStorePluginAuthCognitoTests
  • Ran AWSDataStorePluginAuthIAMTests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lawmicha lawmicha marked this pull request as ready for review November 15, 2022 14:52
@lawmicha lawmicha requested review from a team as code owners November 15, 2022 14:52
@sebsto
Copy link

sebsto commented Nov 16, 2022

Tested on a macOS and iOS (on device) application - looks like the fix enables the syncing to resume without having to write a single line of code.

atierian
atierian previously approved these changes Nov 16, 2022
Copy link
Member

@atierian atierian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants