Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate before committing and emit a log warning #212

Merged
merged 11 commits into from
Mar 15, 2024

Conversation

fqtab
Copy link
Contributor

@fqtab fqtab commented Mar 14, 2024

What?

  • We've seen a single instance in production where a couple of data files were committed to a table twice in the same append operation.
  • While this appears to be a relatively isolated incident, it's obviously not a positive sign wrt exactly-once guarantees.
  • As a result, this PR introduces a short-term fix to avoid committing the same file twice in the same operation AND adds some logging to help detect this quicker and with more contextual information.

How?

  • Deduplicates data and delete files received in a batch of messages from Kafka before committing to the table.
    • Important to note that this does NOT eliminate data/delete files duplicated across batches of messages read from Kafka. We're strictly concerned about data and delete files duplication in a given batch of messages in this PR.
  • Adds logging to help identify generally where the duplicates may be stemming from. Duplication of data files in a batch of messages will manifest generally in one of three ways:
    • same file appears in 2 equivalent envelopes e.g. if the Coordinator read the same message twice from Kafka
      In this case, you should see a log message similar to Deduplicated 2 data files with the same path=data.parquet for table=db.tbl during commit-id=cf602430-0f4d-41d8-a3e9-171848d89832 from the following events=[2x(SimpleEnvelope{...})]
    • same file appears in 2 different envelopes e.g. if a Worker sent the same message twice to Kafka
      In this case, you should see a log message similar to Deduplicated 2 data files with the same path=data.parquet for table=db.tbl during commit-id=cf602430-0f4d-41d8-a3e9-171848d89832 from the following events=[1x(SimpleEnvelope{...}), 1x(SimpleEnvelope{...})]
    • same file appears in a single envelope twice e.g. if a Worker included the same file twice in a single message sent to Kafka. In this case, you should see a log message similar to Deduplicated 2 data files with the same path=data.parquet in the same event=SimpleEnvelope{...} for table=db.tbl during commit-id=cf602430-0f4d-41d8-a3e9-171848d89832

@fqtab fqtab force-pushed the dedupe_before_committing branch from 18002f2 to 0eb22c2 Compare March 14, 2024 13:26
@fqtab fqtab marked this pull request as ready for review March 14, 2024 13:26
@fqtab fqtab force-pushed the dedupe_before_committing branch from 193b1c5 to 0257c6c Compare March 14, 2024 14:08
@fqtab fqtab force-pushed the dedupe_before_committing branch 2 times, most recently from a33da6d to 39eaa8d Compare March 14, 2024 23:07
Copy link
Contributor

@tabmatfournier tabmatfournier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verbose due to the logging cases which we may be able to clean up in a later PR once this has been out for a time.

Minor comments only, I think you can use Pair from iceberg utils.

@fqtab fqtab force-pushed the dedupe_before_committing branch from 6d1a1e7 to 09e1e94 Compare March 15, 2024 13:06
@fqtab fqtab force-pushed the dedupe_before_committing branch from 2e0b959 to 7e1266d Compare March 15, 2024 17:19
@fqtab fqtab merged commit ae72973 into main Mar 15, 2024
1 check passed
@fqtab fqtab deleted the dedupe_before_committing branch March 15, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants