Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(docs) add RFC file to introduce Notebook entity data model #4237

Merged
merged 5 commits into from
Mar 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 74 additions & 0 deletions docs/rfc/active/4237-datadoc-entity/datadoc-entity-rfc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
- Start Date: (fill me in with today's date, 2022-02-22)
- RFC PR: https://github.com/linkedin/datahub/pull/4237
- Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)
- Implementation PR(s): (leave this empty)

# Extend data model to model Notebook entity

## Background
[Querybook](https://www.querybook.org/) is Pinterest’s open-source big data IDE via a notebook interface.
We(Included Health) leverage it as our main querying tool. It has a feature, DataDoc, which organizes rich text,
queries, and charts into a notebook to easily document analyses. People could work collaboratively with others in a
DataDoc and get real-time updates. We believe it would be valuable to ingest the DataDoc metadata to Datahub and make
it easily searchable and discoverable by others.

## Summary
This RFC proposes the data model used to model DataDoc entity. It does not talk about any architecture, API or other
implementation details. This RFC only includes minimum data model which could meet our initial goal. If the community
decides to adopt this new entity, further effort is needed.

## Detailed design

### DataDoc Model
![DataDoc High Level Model](DataDoc-high-level-model.png)

As shown in the above diagram, DataDoc is a document which contains a list of DataDoc cells. It organizes rich text,
queries, and charts into a notebook to easily document analyses. We could see that the DataDoc model is very similar as
Notebook. DataDoc would be viewed as a subset of Notebook. Therefore we are going to model Notebook rather than DataDoc.
We will include "subTypes" aspect to differentiate Notebook and DataDoc

### Notebook Data Model
This section talks about the mininum data model of Notebook which could meet our needs.
- notebookKey (keyAspect)
- notebookTool: The name of the DataDoc tool such as QueryBook, Notebook, and etc
- notebookId: Unique id for the DataDoc
- notebookInfo
- title(Searchable): The title of this DataDoc
- description(Searchable): Detailed description about the DataDoc
- lastModified: Captures information about who created/last modified/deleted this DataDoc and when
- notebookContent
- content: The content of a DataDoc which is composed by a list of DataDocCell
- editableDataDocProperties
- ownership
- status
- globalTags
- institutionalMemory
- browsePaths
- domains
- subTypes
- dataPlatformInstance
- glossaryTerms

### Notebook Cells
Notebook cell is the unit that compose a Notebook. There are three types of cells: Text Cell, Query Cell, Chart Cell. Each
type of cell has its own metadata. Since the cell only lives within a Notebook, we model cells as one aspect of Notebook
rather than another entity. Here are the metadata of each type of cell:
- TextCell
- cellTitle: Title of the cell
- cellId: Unique id for the cell.
- lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
- text: The actual text in a TextCell in a Notebook
- QueryCell
- cellTitle: Title of the cell
- cellId: Unique id for the cell.
- lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
- rawQuery: Raw query to explain some specific logic in a Notebook
- lastExecuted: Captures information about who last executed this query cell and when
- ChartCell
- cellTitle: Title of the cell
- cellId: Unique id for the cell.
- lastModified: Captures information about who created/last modified/deleted this Notebook cell and when

## Future Work
Querybook provides an embeddable feature. We could embed a query tab which utilize the embedded feature in Datahub
which provide a search-and-explore experience to user.