Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add dataset schema versions #2763

Conversation

davidjgoss
Copy link
Contributor

@davidjgoss davidjgoss commented Mar 9, 2024

Problem

This PR is the first step towards implementing the proposal from https://github.com/MarquezProject/marquez/blob/main/proposals/2676-version-dataset-schemas-separately.md.

Solution

The idea is that, without changing what already gets written to the database, we start writing to the new dataset_schema_versions and dataset_schema_versions_field_mapping tables.

Todo

  • Create new database tables
  • Upsert dataset schema version when creating a new dataset version
    • OpenLineage code path
    • Legacy DatasetDao code path
    • Legacy RunDao code path not viable to tackle in this PR
  • Handle in retention cleanup
    • Ensure schema versions and fields get swept up in the cascade when a dataset is deleted
  • Handle input dataset where schema doesn't match current version for now will not change schema version even when input dataset schema has drifted - to be discussed in Handling of input datasets where schema different from current version #2764

One-line summary: Add dataset schema versions to model and start writing to it

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Mar 9, 2024
Copy link

netlify bot commented Mar 9, 2024

Deploy Preview for peppy-sprite-186812 ready!

Name Link
🔨 Latest commit fb182cf
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/665fde0ff4770500084bef13
😎 Deploy Preview https://deploy-preview-2763--peppy-sprite-186812.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@davidjgoss davidjgoss force-pushed the feature/dataset-schema-versions-part-1 branch from 2d26f95 to 7961d23 Compare March 9, 2024 12:04
Copy link

codecov bot commented Mar 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.74%. Comparing base (1f00c9b) to head (fb182cf).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2763      +/-   ##
============================================
+ Coverage     84.56%   84.74%   +0.18%     
- Complexity     1441     1456      +15     
============================================
  Files           251      253       +2     
  Lines          6504     6562      +58     
  Branches        303      305       +2     
============================================
+ Hits           5500     5561      +61     
+ Misses          851      850       -1     
+ Partials        153      151       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidjgoss davidjgoss force-pushed the feature/dataset-schema-versions-part-1 branch from 7961d23 to d70a405 Compare March 13, 2024 13:14
@wslulciuc wslulciuc added this to the Roadmap milestone Apr 16, 2024
@wslulciuc wslulciuc added the db.perf This issue or pull request improves DB performance label Apr 16, 2024
@davidjgoss davidjgoss force-pushed the feature/dataset-schema-versions-part-1 branch from d70a405 to a176fc5 Compare April 17, 2024 07:05
@davidjgoss davidjgoss force-pushed the feature/dataset-schema-versions-part-1 branch 3 times, most recently from df73ec0 to ffbe1aa Compare April 25, 2024 14:54
@davidjgoss
Copy link
Contributor Author

Next step with this is to look at the deprecated DatasetResource path for ingestion. Will timebox this and abandon if it's going to take too much effort.

@davidjgoss davidjgoss force-pushed the feature/dataset-schema-versions-part-1 branch from ffbe1aa to b6bcdac Compare May 4, 2024 11:12
@davidjgoss davidjgoss force-pushed the feature/dataset-schema-versions-part-1 branch from 242d636 to fe9c905 Compare May 24, 2024 09:42
@davidjgoss davidjgoss marked this pull request as ready for review May 24, 2024 10:28
@davidjgoss
Copy link
Contributor Author

This should be suitable to merge and release, so we start writing data into the new tables now, which should ease the migration later.

Next steps in this project will be (in no particular order and can be in parallel):

  • Update DbRetention so it will clean up any orphaned schema versions after it cleans up dataset versions
  • Develop a migration script to backfill dataset schema versions for pre-existing dataset versions
  • Start to change some code paths to read from the new tables where possible

@wslulciuc wslulciuc modified the milestones: Roadmap, 0.48.0 May 28, 2024
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 💯 💯

@wslulciuc wslulciuc enabled auto-merge (squash) June 4, 2024 23:18
@wslulciuc wslulciuc disabled auto-merge June 5, 2024 03:49
@wslulciuc wslulciuc merged commit 635ad9b into MarquezProject:main Jun 5, 2024
15 checks passed
@davidjgoss davidjgoss deleted the feature/dataset-schema-versions-part-1 branch June 5, 2024 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes db.perf This issue or pull request improves DB performance
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants