Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datajob): Datajob graphql query #2242

Merged
merged 12 commits into from
Apr 8, 2021

Conversation

frsann
Copy link
Contributor

@frsann frsann commented Mar 16, 2021

We add GraphQL Query functionality for the DataJob and DataFlow entities.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@frsann frsann marked this pull request as ready for review March 17, 2021 16:33
Comment on lines 345 to 354
private static void configureDataFlowResolvers(final RuntimeWiring.Builder builder) {
builder
.type("Owner", typeWiring -> typeWiring
.dataFetcher("owner", new AuthenticatedResolver<>(
new LoadableTypeResolver<>(
CORP_USER_TYPE,
(env) -> ((Owner) env.getSource()).getOwner().getUrn()))
)
);
}
Copy link
Contributor

@gabe-lyons gabe-lyons Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this wiring- owner: Owner type is already configured in dataset's configure-resolvers method and dataflow + datajobs are able to re-use that

Copy link
Contributor Author

@frsann frsann Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll remove it!

So we basically don't need a configureDataFlowResolvers at all then?

I got confused as I was following the configureMlModelResolvers example, where it was configured as well. Should I remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frsann I would double check using graphiql ui before making that change but I believe that should be fine

cc @jjoyce0510 to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll check with the IDE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gabe's correct - we should be able to remove this. We should likely refactor the configureDatasetResolvers method to extract configuration of the "Owner" type into a dedicated method. For now, feel free to remove it here and with the ML models, I just ask that you issue a test query to verify that we can still traverse to the Owner type!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do!

Copy link
Contributor

@gabe-lyons gabe-lyons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a chance to test this using graphiql? (https://github.com/graphql/graphiql)

This would let you issue graphql queries easily to your endpoint and verify things are wired up correctly.

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really solid.

Few top-level things:

  • Can you run GQL queries against DataFlow / DataJob and paste them in the PR description under a "Validation" header? (I use POST man) -- Let me know if you need assistance with this.
  • Should our modeling be so tightly coupled with Azkaban concepts? AFAIK Azkaban is not the only orchestrator / scheduler in wide use... Airflow among others exist as alternatives
  • Will DataFlows / DataJobs become Browsable? If so, when?

datahub-graphql-core/src/main/resources/gms.graphql Outdated Show resolved Hide resolved
"""
The associated data flow
"""
dataFlow: DataFlow!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modeling question: What if a DataJob is "orphaned". That is, there is no parent DataFlow? (It is run in an ad-hoc manner).

Would someone want to be able to model this? If so, how would we advise they do so?

Copy link
Contributor Author

@frsann frsann Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we have the DataProcess which has the the inputs/outputs of the DataJob and the orchestrator of the DataFlow. I think that would be the best option in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with DataProcess - taking a look

"""
Datajob type
"""
type: AzkabanJobType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always an AzkabanJobType?? What if a DataJob is being run on Airflow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modeling seems too azkaban-specific

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Maybe something for a separate refactor PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would be forced to have backwards compatibility once we accept this one.
So can we think this through before checking in?

outputDatasets: [Dataset!]
}

enum AzkabanJobType {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just made this "JobType" and tried to generify it further? Alternatively we could have a union of complex objects, one per type.

@frsann
Copy link
Contributor Author

frsann commented Mar 23, 2021

  • Can you run GQL queries against DataFlow / DataJob and paste them in the PR description under a "Validation" header? (I use POST man) -- Let me know if you need assistance with this.

Thanks, I'll get back to you on this.

  • Should our modeling be so tightly coupled with Azkaban concepts? AFAIK Azkaban is not the only orchestrator / scheduler in wide use... Airflow among others exist as alternatives

I see no reason for it. Would it make sense to refactor the JobType in a separate PR, though?

  • Will DataFlows / DataJobs become Browsable? If so, when?

Preferably yes, but we currently don't have an ETA.

@frsann frsann force-pushed the datajob-frontend branch 3 times, most recently from 58984f9 to c4026b5 Compare March 24, 2021 04:07
"""
The DATA_FLOW Entity
"""
DATA_FLOW
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good to me so long as you've tested using real GQL queries.. You can see some samples here: https://github.com/linkedin/datahub/blob/master/datahub-gms-graphql-service/README.md

"""
The associated data flow
"""
dataFlow: DataFlow!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with DataProcess - taking a look

@jjoyce0510
Copy link
Collaborator

@frsann Can you confirm you've validated the queries locally?

@frsann
Copy link
Contributor Author

frsann commented Mar 30, 2021

Not yet. Focused on the es7 migration last week. Will try tonget this done this week.

@frsann frsann force-pushed the datajob-frontend branch from 6c12d21 to f4f9944 Compare April 6, 2021 17:45
@frsann
Copy link
Contributor Author

frsann commented Apr 7, 2021

@jjoyce0510 I can now confirm that these queries work, and I've added some sample queries to the README.

Copy link
Contributor

@gabe-lyons gabe-lyons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you- looks good to me

type
jobId
dataFlow {
urn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully fetching all the fields of dataFlow will work :)

Copy link
Contributor Author

@frsann frsann Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I just kept the example short

urn
flowId
}
inputOutput {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this better be named just "lineage"?

if you agree, we can attend to it in a followup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I was basically trying to map these fields as closely to the aspects names as possible. But I dont see a problem changing this if lineage becomes an established term.

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

Quick Question - Do we intend to add DataProcess in a future PR?

Great work. Thanks so much Fredrik!

@frsann
Copy link
Contributor Author

frsann commented Apr 8, 2021

Quick Question - Do we intend to add DataProcess in a future PR?

At the latest when we start ingesting some DataProcesses of our own 😉

@frsann
Copy link
Contributor Author

frsann commented Apr 8, 2021

@shirshanka ok to merge?

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shirshanka shirshanka merged commit fd0923c into datahub-project:master Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants