-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix re sync issue #35
Conversation
@@ -105,7 +105,7 @@ def process_records(catalog, #pylint: disable=too-many-branches | |||
last_dttm = transform_datetime(last_datetime) | |||
bookmark_dttm = transform_datetime(transformed_record[bookmark_field]) | |||
# Keep only records whose bookmark is after the last_datetime | |||
if bookmark_dttm >= last_dttm: | |||
if bookmark_dttm > last_dttm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to keep this as >=
to account for multiple records being synced at the same timestamp.
e.g. if there are 10 records r
at timestamp t=2
and on the first sync, we sync through only the first 4 of them, we want to ensure that those remaining 6 records aren't skipped over at the cost of us having to redundantly sync the first 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! My intent was to avoid re-syncing previously synced records on subsequent runs of the tap. But I missed that the write_bookmark
function was updating the bookmark within the while loop that is batch processing records. What do you think about moving the write_bookmark
function to the end of sync_endpoint
after the batch processing has finished? I just moved it and tested it out it and it looks to work just fine that way as well. I updated the branch with the change. Let me know if that is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I like that.
👍 sounds good to me.
236d028
to
e9bf13a
Compare
included in #38 |
Description of change
This fix removes the
should_sync_stream
function and associated logic which prevents the tap from re-syncing all the streams in the case where the previous run has exited with a failure. In the case where a state file is passed containing thecurrently_syncing
key, theshould_sync_stream
would skip all the streams before reaching the one that matchedcurrently_syncing
and would only process that stream and any streams after it. Refactored to start syncing from thecurrently_syncing
stream first and process all streams that also need to be synced. Additionally, modified logic that checks if a record should be written by comparing to the previous timestamp to only write the record if the timestamp is greater than the previous timestamp.Manual QA steps
currently_syncing
stream. Ensure that all streams are still processed.Risks
Rollback steps