Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix re sync issue #35

Closed
wants to merge 3 commits into from

Conversation

0xpetersatoshi
Copy link

Description of change

This fix removes the should_sync_stream function and associated logic which prevents the tap from re-syncing all the streams in the case where the previous run has exited with a failure. In the case where a state file is passed containing the currently_syncing key, the should_sync_stream would skip all the streams before reaching the one that matched currently_syncing and would only process that stream and any streams after it. Refactored to start syncing from the currently_syncing stream first and process all streams that also need to be synced. Additionally, modified logic that checks if a record should be written by comparing to the previous timestamp to only write the record if the timestamp is greater than the previous timestamp.

Manual QA steps

  • Run the tap with a state file containing a currently_syncing stream. Ensure that all streams are still processed.

Risks

  • N/A

Rollback steps

  • revert this branch

@@ -105,7 +105,7 @@ def process_records(catalog, #pylint: disable=too-many-branches
last_dttm = transform_datetime(last_datetime)
bookmark_dttm = transform_datetime(transformed_record[bookmark_field])
# Keep only records whose bookmark is after the last_datetime
if bookmark_dttm >= last_dttm:
if bookmark_dttm > last_dttm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to keep this as >= to account for multiple records being synced at the same timestamp.

e.g. if there are 10 records r at timestamp t=2 and on the first sync, we sync through only the first 4 of them, we want to ensure that those remaining 6 records aren't skipped over at the cost of us having to redundantly sync the first 4.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! My intent was to avoid re-syncing previously synced records on subsequent runs of the tap. But I missed that the write_bookmark function was updating the bookmark within the while loop that is batch processing records. What do you think about moving the write_bookmark function to the end of sync_endpoint after the batch processing has finished? I just moved it and tested it out it and it looks to work just fine that way as well. I updated the branch with the change. Let me know if that is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I like that.

👍 sounds good to me.

@jacobrobertbaca
Copy link
Contributor

included in #38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants