What does an ideal metadata setup look like for a SDK-based Singer (or EDK) repo + MeltanoHub? #1205

tayloramurphy · 2023-03-22T16:37:23Z

tayloramurphy
Mar 22, 2023
Maintainer

Problem: The problem we're trying to solve generally is that Meltano doesn't control all of the Python packages in the larger Singer ecosystem. The Hub applies metadata about a specific package, but this is often out of sync with the repository itself. This discussion is about defining an ideal code+metadata state between the repo, Hub, and SDK.

Prior Topics:

Hub SSOT - Do we want to support package + metadata versioning? #219 (points to https://gitlab.com/meltano/hub/-/issues/223)

Miro Board where I'm trying to visualize some of this: https://miro.com/app/board/uXjVMQEGFps=/?share_link_id=882576162622

Proposed Tenets

The connector repository should be able to contain all of the information about itself. This includes:
- Settings, settings metadata (group validation), logos, labels, executables, etc.
The connector repo should be able to contain any necessary data for inclusion on the Hub in a separate hub.yml file
- This would contain things like keywords, additional instructions, etc.
On MeltanoHub, ideally we would have as minimal a setup as possible. Perhaps just a pointer to the repo.
- If we needed to do overrides (which is basically what we do today) they could be stored in the Hub repository.
- Likely we would need to add some metadata around Cloud support / requirements

Ideal Scenario

Connector developer updates metadata and code using Semver.
- non-breaking and style changes to metadata are patch releases
- minor releases are used for things related to settings

pnadolny13 · 2023-05-03T16:18:50Z

pnadolny13
May 3, 2023
Maintainer

@tayloramurphy this is great! I have tons of thoughts so I'm going to split them out as separate comments so we can thread off of them if needed. Overall I agree with the direction though!

1 reply

tayloramurphy May 3, 2023
Maintainer Author

@pnadolny13 thanks for your thoughts on this. Can you bring this to the Singer Guild?

pnadolny13 · 2023-05-03T16:40:08Z

pnadolny13
May 3, 2023
Maintainer

Additional Proposed Tenets

Connector maintainers shouldnt have to make any hub contributions to participate.

From the beginning we wanted the hub to be a place where users and connector maintainers could contribute to keep the ecosystem up to date, it hasnt really played out that way. Its too difficult for contributors to add new plugins or differentiate between a tap bug and a metadata bug so we dont get very many contributions. I also dont think a tap developer should have to think about updating metadata on the hub when they make changes. Ideally they would make one contribution to the hub, which is an issue to request that the connector gets added. Thats it. We could have dedicated hub contributors that help add, update, etc. but thats the responsibility of a few vs expecting the entire ecosystem of tap developers to get on board.

Over time we started to come to terms with our reality that unless we put in the work to maintain the hub metadata then it won't get done, no contributors are coming along to do the hard work of scrape github for new connectors or to add the new setting they implemented, so we did it ourselves.

At this point we have automation and assisted processes in place that make it significantly easier for us to maintain the hub but those tools and know-how arent really available to community members. They have to hand write metadata definitions and get rounds of feedback to get it to the standard of the automated processes. In most cases it ends up being easier for everyone if we just get an issue requesting a plugin to be added or updated.

Proposal:

Update https://github.com/meltano/hub/blob/main/.github/ISSUE_TEMPLATE/new_tap.yml to be a request form. Tell the user we'll manage adding it.

1 reply

tayloramurphy May 3, 2023
Maintainer Author

Ideally they would make one contribution to the hub, which is an issue to request that the connector gets added. Thats it.

I like that 👍

pnadolny13 · 2023-05-03T17:19:37Z

pnadolny13
May 3, 2023
Maintainer

Connector Repo to be SSOT for metadata

The connector repository should be able to contain all of the information about itself. This includes:

Settings, settings metadata (group validation), logos, labels, executables, etc.

The connector repo should be able to contain any necessary data for inclusion on the Hub in a separate hub.yml file

This would contain things like keywords, additional instructions, etc.

I completely agree with this!

One challenge is that we have to go to each individual repo to suggest docs changes vs how we have it now where we could theoretically update all definitions in a single PR. Although having an override mechanism on the hub side would make some of this simpler.

Thoughts:

I think if we want to push this metadata down to the connector maintainer then it should be beneficial to both of us somehow. If its extra work for the developer to add hub docs to their repo and they get little benefit then I think a lot will skip it.
We should avoid putting anything in a the connector repo that's custom formatted for the hub or info thats really only useful for the hub. Stuff like the setting preamble, keywords, maintenance status sort of feels like it could be hub specific unless we find a use for that on the connector side. The metadata in the connector repo should be agnostic to the system consuming it. Right now we're designing it to be used by the hub but it should be useful for any system that wants to understand details about the connector. Outputting everything as json seems like a good way to do this, I'm not sure how we handle markdown in this case though.
We could implement a command in the SDK to output the hub definition file but that is hard to maintain because any change we might need would require the whole community to upgrade. So making the hub is smarter to accept the agnostic output that all SDK connectors already output seems more sustainable.
The hub should get smarter and accept json schema vs reformatted settings and group validation.

An Idea:

I've noticed that connector READMEs are a bit sparse usually and sometimes I have a hard time figuring out what the source actually is (especially if the service is something vague like exact). If we use the cookiecutter to include a file with a standard set of required inputs including domain url, description, label, logo image, etc. we could do more to help auto generate a better README for them. Its a win win for both of us, they add a few pieces of metadata that we need and they get a better auto generated README. Also none of that info is specific to the hub.

Building on that idea we need a way for them to include markdown text for things like general info or advanced settings that need more than a single line description from the tap.py file. We could include something like the hub.yml (maybe call it docs.yml instead) to do this, as long as its easy to include markdown in there. Then again we'd be able to auto generate their README using that file as input, and also use that file as input to the hub so we avoid the issue of the hub and the readme having different info.

It sort of starts to remind me of the dbt docs blocks approach https://docs.getdbt.com/docs/collaborate/documentation#using-docs-blocks.

2 replies

tayloramurphy May 3, 2023
Maintainer Author

I really like the framing of helping them have an up to date README. Calling it docs.yml (or md or whatever) makes sense to me.

If we drink our own champagne here we can make our own READMEs really nice and have a goal for us to not need an override on the Hub for our first-party connectors.

But I think we'll always need a mechanism to have and override and some sort of automated reconciliation check that we do when we see updates on docs.yml.

tayloramurphy May 24, 2023
Maintainer Author

Stitch template https://gist.github.com/erinkcochran87/8280f0779e0e6a75314bfe0e80d53593

pnadolny13 · 2023-05-03T18:59:18Z

pnadolny13
May 3, 2023
Maintainer

Version Constraints

I agree that we need some sort of versioning mechanism so that users get the appropriate metadata based on their pinned package version. This has started to feel less pressing for me over time but still important. Last time I brought it up I thought it was going to cause more problems than it did. I think because the life cycle of most taps is heavy development in the beginning then after that the config/metadata structure doesnt change all that much even if the internals of the tap change a lot. On major version bumps though it does become a problem, for example target-s3 that refactored its whole config to support multi cloud config options.

From my perspective we have 3 options:

Status quo, no versioning. The hub tries to list and serve the latest metadata.
Support versioning only for SDK metadata with the python package as source of truth. Non-SDK plugins dont get metadata versioning. The hub continues to try to display and serve only the latest metadata.
Support versioning for all metadata with the hub as source of truth.

Option 1

Like I said above, its feeling like less of an issue especially with lock files being the recommendation but for breaking changes its painful. Its also hard to evaluate how bad of an issue this is currently, mostly these would cause failures that lead the user to think it's a tap issue.

Option 2

I've been noodling with this idea of solving versioning by bypassing the hub all together. It's not a fully thought out idea yet but the concept being that since a meltano project pins a package version and the package has the metadata for that exact version of code (e.g. v0.3.2). We can avoid the version issue completely by never relying on metadata thats tracked externally to the python package. The idea is that when someone goes to add an SDK plugin into their project, meltano and the hub talk and decide that the metadata should come from the source of truth aka the package itself instead of the hub which only represents the latest metadata (as of today). Maybe a lock file gets generated by meltano by directly calling tap-x --about --format=json and translating it or it creates some new artifact that keeps the data raw and doesnt translate. Heres an issue related meltano/meltano#7156. All non-SDK variants still use lock files as they do today and versioned metadata for stability becomes yet another benefit of being on the SDK. If we care enough about non-SDK plugins then we'd still need to build a versioning mechanism in the hub. Personally I dont think legacy taps change enough for this to be worth it.

The hub starts to go back to its roots as a place to discover plugins and their settings but doesnt try to be the source of truth for all connector metadata, inline with the changes we're discussing in #1205 (comment).

Option 3

Build a versioning mechanism within the hub. Meltano would need to have a more explicit understanding of version because its going to have a hard time requesting metadata for a version if the version needs to be parsed out of a long pip_url string blob. Once meltano knows the exact version that pinned it can call the hub api with that as a parameter and the hub would need to resolve that version to a version range associated with metadata files. I think we could use versioning syntax like poetry does >=0.4.0,<1.0.0 and behind the scenes the hub would have multiple metadata files for version ranges. This ends up requiring the community to be good about semantic versioning properly so breaking changes only happen in major releases and hub can detect when to split out a new metadata file. If the community needs to be good about this then we need to have really well documented processes and recommendations to explain how to version correctly (I know theres sem version docs already but as it relates to tap metadata specifically). Maybe we'd want to define a standard for the hub so packages cant be listed until they have real releases. Although it would be hard to bring the existing ecosystem up to this standard.

My opinion right now would be for Option 2. It bypasses a lot of complex work in the hub and will likely be more accurate. It does requires more complexity on the meltano side to understand raw SDK output but I don't think thats the worst.

1 reply

tayloramurphy May 3, 2023
Maintainer Author

Something like Option 2 seems like a happy path here. We rely on the SDK to bring a ton of functionality to the connectors.

For non-SDK connectors it's likely easier to re-write it on the SDK than it would be to do all of the versioning on the Hub side.

pnadolny13 · 2023-05-04T19:53:19Z

pnadolny13
May 4, 2023
Maintainer

@kgpayne @edgarrmondragon I mentioned this in the guild meeting today but I'm curious what your thoughts are.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does an ideal metadata setup look like for a SDK-based Singer (or EDK) repo + MeltanoHub? #1205

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What does an ideal metadata setup look like for a SDK-based Singer (or EDK) repo + MeltanoHub? #1205

tayloramurphy Mar 22, 2023 Maintainer

Replies: 5 comments · 5 replies

pnadolny13 May 3, 2023 Maintainer

tayloramurphy May 3, 2023 Maintainer Author

pnadolny13 May 3, 2023 Maintainer

Additional Proposed Tenets

tayloramurphy May 3, 2023 Maintainer Author

pnadolny13 May 3, 2023 Maintainer

Connector Repo to be SSOT for metadata

tayloramurphy May 3, 2023 Maintainer Author

tayloramurphy May 24, 2023 Maintainer Author

pnadolny13 May 3, 2023 Maintainer

Version Constraints

Option 1

Option 2

Option 3

tayloramurphy May 3, 2023 Maintainer Author

pnadolny13 May 4, 2023 Maintainer

tayloramurphy
Mar 22, 2023
Maintainer

Replies: 5 comments 5 replies

pnadolny13
May 3, 2023
Maintainer

tayloramurphy May 3, 2023
Maintainer Author

pnadolny13
May 3, 2023
Maintainer

tayloramurphy May 3, 2023
Maintainer Author

pnadolny13
May 3, 2023
Maintainer

tayloramurphy May 3, 2023
Maintainer Author

tayloramurphy May 24, 2023
Maintainer Author

pnadolny13
May 3, 2023
Maintainer

tayloramurphy May 3, 2023
Maintainer Author

pnadolny13
May 4, 2023
Maintainer