Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(platform): add support for via nodes #9733

Merged
merged 5 commits into from
Jan 30, 2024

Conversation

shirshanka
Copy link
Contributor

What are 'Via Nodes'

Via nodes are entities designed to represent relationships between two distinct entities in the graph model. They typically represent processes like queries or data jobs that generate column-level lineage between fields, enhancing the model's capabilities in lineage representation. Via nodes are convenient performance and metadata representation shortcuts in cases where paths between two entities transit through a common node that is an intersection point for other paths.
e.g. consider fine-grained lineage produced by a job J that looks like

T1 col c11, c12 --> (job J) --> T2 col c21, c22

and

T1 col c13, c14 --> (job J) --> T2 col 23

via nodes allow us to concisely represent and query these paths, without needing to create edge specific nodes (e.g. in this case, we do NOT need to mint fake ColumnProcess nodes J-1 , and J-2 to keep the paths separated unlike other metadata systems like Apache Atlas)

Caveat emptor: via nodes are experimental - and we'll revisit this decision in case this optimization doesn't yield as many benefits as we think it will.

Summary

This pull request implements 'via nodes' within the graph model and updates the annotation language for compatibility, aiming to improve the model's data representation and querying capabilities.

Model Changes

  • Query Support: Added support for query nodes in both fine-grained and regular lineage models.

Implementation Changes in GMS

  • Graph Storage and Querying: Implemented storing and querying of 'via nodes' in the graph, specific to the Elastic search implementation.
  • Enhanced Search-across-Lineage: Updated the search-across-lineage feature to support multi-path walks, extending beyond shortest path walks.
  • GraphQL Changes: Made adjustments in GraphQL to accommodate changes in lineage results and search flags due to the new 'via nodes' functionality.
  • Bug Fix - Dataset and DataJob Deletions: Addressed a bug related to deletions of datasets or dataJobs where relationships involving schemaFields were not properly deleted. This is resolved through the introduction of 'edgeLifecycleOwner' in the graph store.

Testing

  • Lineage Query Pathway Tests: Added unit and integration (smoke) tests for the new lineage query pathways to ensure functionality

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Jan 29, 2024
* The type of the entity to be grouped.
* e.g. schemaField
*/
rawEntityType: string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one optimization on model:

make this optional, which means "applies to all entity types"

A grouping specification for search results
"""
input GroupingSpec {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we copy the PDL comment here

@@ -143,6 +143,11 @@ input SearchFlags {
Whether to request for search suggestions on the _entityName virtualized field
"""
getSuggestions: Boolean

"""
Additional grouping specifications to apply to the search results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add maybe note:
Notice: This API is experimental and subject to change.

"""
The raw entity type that needs to be grouped
"""
rawEntityType: String!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -45,7 +45,8 @@ public void testDefaultSearchFlags() throws Exception {
.setSkipAggregates(false)
.setSkipHighlighting(true) // empty/wildcard
.setMaxAggValues(20)
.setSkipCache(false));
.setSkipCache(false)
.setConvertSchemaFieldsToDatasets(true));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this flag intentional

@@ -7,16 +7,23 @@

@Slf4j
public class GraphRelationshipMappingsBuilder {
public static final String EDGE_FIELD_SOURCE = "source";
public static final String EDGE_FIELD_DESTINATION = "destination";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

final InputFields inputFields = new InputFields(aspect.data());
updateInputFieldEdgesAndRelationships(
urn, inputFields, edgesToAdd, urnToRelationshipTypesBeingAdded);
} else if (aspectSpec.getName().equals(Constants.DATA_JOB_INPUT_OUTPUT_ASPECT_NAME)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special cases... we'll need to come back on all of these

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming comments but otherwise that looks good

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once Ci is green, I think we are good to ship.

Copy link
Collaborator

@david-leifker david-leifker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending fixing the python smoke test failures related to the graphql api errors related to the changes here.

@shirshanka shirshanka merged commit 1d06d38 into datahub-project:master Jan 30, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants