Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(docs) add RFC file to introduce Notebook entity data model #4237

Merged
merged 5 commits into from
Mar 18, 2022

Conversation

tc350981
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@shirshanka
Copy link
Contributor

Rendered : .md

DataDoc could be viewed as a subset of Notebook. We could model Notebook
instead and make DataDoc a subtype of Notebook
@tc350981 tc350981 changed the title (docs) add RFC file to introduce DataDoc entity data model (docs) add RFC file to introduce Notebook entity data model Mar 15, 2022
docs/rfc/active/000-datadoc-entity/datadoc-entity-rfc.md Outdated Show resolved Hide resolved
Notebook. DataDoc would be viewed as a subset of Notebook. Therefore we are going to model Notebook rather than DataDoc.
We will include "subTypes" aspect to differentiate Notebook and DataDoc

### Notebook Data Model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that seems to be missing is lineage information. Though storing information about notebook contain may help with discovery among datadocs, I think we need some level of lineage between docs and datasets. Regarding this 1) Is this something we can fetch from QueryBooks or Jupyter notebooks? 2) How would we store information? Ideally, think it should be per cell (specifically query cell and chart cell), but up for debate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to include lineage information, like linking the notebook to some datasets.

For querybook and jupyter notebooks, I feel we could only extract the lineage information from the query.

Regarding it's per cell or per notebook, I feel it fits the reality better to make it per cell. But we model the cell as one aspect of a notebook entity. I think it will be better to make it per data entity which is per notebook entity

docs/rfc/active/000-datadoc-entity/datadoc-entity-rfc.md Outdated Show resolved Hide resolved
Copy link
Contributor

@dexter-mh-lee dexter-mh-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great as the first step. Let's iterate on it as we build more support

@dexter-mh-lee dexter-mh-lee merged commit 89a8fa0 into datahub-project:master Mar 18, 2022
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
…project#4237)

* add RFC file to introduce DataDoc entity

* add PR link

* Model Notebook instead of DataDoc

DataDoc could be viewed as a subset of Notebook. We could model Notebook
instead and make DataDoc a subtype of Notebook

* update picture file name

* Put rfc number and resolve pr comments

Co-authored-by: Xu Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants