Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ML models #1721

Merged
merged 164 commits into from
Sep 10, 2020
Merged

feat: add ML models #1721

merged 164 commits into from
Sep 10, 2020

Conversation

arunvasudevan
Copy link
Collaborator

@arunvasudevan arunvasudevan commented Jul 6, 2020

ML Models schema is inspired from Google's Model cards Paper- https://arxiv.org/pdf/1810.03993.pdf

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@arunvasudevan arunvasudevan changed the title ML models Initial Version for feedback ML models - Initial Version for feedback Jul 7, 2020
@arunvasudevan arunvasudevan changed the title ML models - Initial Version for feedback feat: add ML models Jul 9, 2020
Copy link
Contributor

@hshahoss hshahoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Arun for the extensive PR. Left a few overall comments on the models.

@hshahoss hshahoss self-assigned this Jul 13, 2020
@hshahoss hshahoss added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jul 13, 2020
Copy link
Contributor

@jywadhwani jywadhwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initiative @arunvasudevan

@arunvasudevan
Copy link
Collaborator Author

@hshahoss @jywadhwani Thanks for your detailed review. I have addressed your comments.
Kindly take a look at the same.

/*
* Cost Details for an Entity
*/
record Cost {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option might be to define an enum for type of Cost called something like CostType. It could be an enum to identify the different type of costs that can be attached. To start with, we can have ORG_COST_CODE as one of the types. Then, we can define cost value as a object which can be a union of primitive/complex cost objects types. Something similar to a model like below:

/*
* Cost Details for an Entity
*/
record Cost {

    costType: CostType

    cost: CostValue
}
enum CostType { 

  /** 
   * Org Cost Code to which the Cost of this entity should be attributed to
   */ 
  ORG_COST_CODE 

// additional cost types here
} 
typeref CostValue = union[ 
  numericalCost: double
  stringCost: string
  customCost: customCostModel
  ....
]

Referencing the above model, we can now have an array of Cost aspect associated with the Model entity. The above model will provide an option for others to expand it to include their own cost definitions.

I'd encourage for us to think of all common models in this way so that they are extensible to other use cases. Let me know what you think.

/**
* Fabric type where ML Model belongs to or where it was generated.
*/
origin: FabricType
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this Fabric refer to data centers? Just for my understanding, would models be trained per fabric?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FabricType refers to the origin or environment.
In most cases i don't think someone would train in every environment but they can attach the trained datasets (DatasetUrn) per platform, Model and Environment.

@arunvasudevan
Copy link
Collaborator Author

arunvasudevan commented Aug 4, 2020

@hshahoss @zimengy @jywadhwani Thanks a lot for your review comments.
Based on the last round of review it made sense for me to make MlFeature a top level entity.
In addition to that i have also made other changes as per review.
Kindly take a look and let me know what you think.
Thanks!

@arunvasudevan
Copy link
Collaborator Author

arunvasudevan commented Aug 17, 2020

@jywadhwani Updated all model and feature references to MLModel and MLFeature.

Copy link
Contributor

@hshahoss hshahoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, Looks good to me. Let's try to close on the open questions and get this in?

@hshahoss hshahoss requested a review from jywadhwani August 17, 2020 22:46
Copy link
Contributor

@jywadhwani jywadhwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on the RFC, came across some questions that I have added in my review comments.

@arunvasudevan
Copy link
Collaborator Author

@jywadhwani @hshahoss Appreciate your time in the review process.
Kindly take a look and let me know your comments.

mars-lan and others added 24 commits September 10, 2020 15:01
fix: Update defaults of aspectNames params.

The last PR to sync internal code broke the external GMS, as code was now expected aspectNames to be null rather than empty by default. This preventing me logging into DataHub as the corp user request would fail (it assumed I asked for no aspects rather than all aspects).

TESTED: Built locally, launched with docker/dev.sh (so used latest frontend, but whatever). Verified I can now log into DataHub, browse and search for datasets, and view my profile.
…atahub-project#1782)

* 1747 | remove obsolete yaml files

* 1747 | remove configmap and its hardcoded references

* 1747 | add missing input parameter of neo4j.host

* 1747 | remove obsolete secrets and parameterize the rest

* 1747 | auto-generate gms secret

* 1747 | remove fullName overrides

* 1747 | fix parameters in subchart's values.yaml

* 1747 | remove hardcoding from parameters for gms host and port

* 1747 | upgrade chart version

* 1747 | update helm docs

* 1747 | add extraEnv, extraVolume and extraMounts

* 1747 | Alters pull policy of images to 'always' for ldh

Co-authored-by: shakti-garg <[email protected]>
…-tier support (datahub-project#1817)

* feat(data-platforms): Adding rest resource for /dataPlatforms and mid-tier support

* Removed data platforms which are Linkedin internal
* Copy NOTICE from wherehows

Copies the file from the wherehows branch.

* Update notice.
* feature(dashboards): RFC for dashboards

* Change directory structure

* Create goals & non-goals sections

* Removing alternatives section
Also list apache projects specifically.
* Releases updated version of datahub-web client UI code

* Fix typo in yarn lock

* Change yarn lock to match yarn registry directories

* Previous commit missed some paths

* Even more changes to yarnlock missing in previous commit

* Include codegen file for typings

* Add files to get parity for datahub-web and current OS datahub-midtier

* Add in typo fix from previous commit - change to proper license

* Implement proper OS fix for person entity picture url

* Workarounds for open source DH issues

* Fixes institutional memory api and removes unopensourced tabs for datasets

* Fixes search dataset deprecation and user search issue as a result of changes

* Remove internal only options in the avatar menu
* Update README.md

* Update links.md
Copy link
Contributor

@jywadhwani jywadhwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work! Excited to see this come to fruition.

@mars-lan
Copy link
Contributor

Great work @arunvasudevan! Thanks for the unyielding will to see this through.

@mars-lan mars-lan merged commit 66dd008 into datahub-project:master Sep 10, 2020
@arunvasudevan arunvasudevan deleted the ml_models branch March 4, 2022 19:32
chriscollins3456 added a commit to chriscollins3456/datahub that referenced this pull request Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.