-
-
Notifications
You must be signed in to change notification settings - Fork 16
Metadata service, client, UI overview #4
Comments
Another big question that is coming up is how to treat code snippets. While I list them above as metadata, we may want to have a separate code snippet extension in JupyterLab. |
Some notes from our meeting: The primary initial use case is one JupyterLab server that has multiple simultaneous users. One person opens a CSV file and adds some metadata about it, like that it has a author or something, and then another user who has the file open sees this field, when look at the metadata for that file. We discussed using files instead of GraphQL to store this, i.e. like one file per metadata object, but then talked about then how we couldn't have two users edit the same object at the same time, one would get clobbered. Resources:
|
Let me just say I'm really happy about the direction with GraphQL in Jupyter. As far as tooling is concerned, I'd prefer to use the really well supported node.js backend tooling for GraphQL. It also helps that you can share types from backend to frontend. This likely means having to use something like nbserverproxy though. |
Me too. Do you have any suggestions for key packages that would be helpful to look at? Subscriptions seem pretty useful, and I was looking at this: https://github.com/prisma/prisma-examples/tree/master/typescript/graphql-subscriptions EDIT: looks like rec is to move to Apollo Server 2 instead of using yoga dotansimha/graphql-yoga#449 (comment) |
Yeah, I'd recommend Apollo Server. @captainsafia and I have been working on a new server that provides a GraphQL API for managing communication between a Jupyter kernel and clients. |
You can find the code for this work at https://github.com/nteract/nteract/tree/master/packages/kernel-relay. It's designed to provide more interaction-based endpoints as opposed to resource-based endpoints in REST APIs. For example, I want to launch a kernel, I want to subscribe to the status of a kernel, I want to execute this code snippet, etc. |
💯 to apollo's front end stack.
+0 in hub deployments, as we've already got configurable-http-proxy there. 👎 to nodejs on the single user server. I'll toss out this prototype based on graphene which, while having some growth challenges, in 2019 is still a lot more supportable on end-user machines than nodejs. Also if we get too far down the "reference implementations of frontend and backend can share code," there's a pretty good chance there will never be another implementation of either. Stepping back from either implementation, would be to have a canonical:
Given that an (opinionated) GraphQL schema from a json-ld context, i think this gives us the most robustness. There isn't a lot of precedence (or support) for combining un-coordinated GraphQL type libraries at present, and I'd really hate it if this feature set regressed on the hackability of our other tools: if a community wants to add and search by their microscope metadata schema to something, I don't want the answer to be fork, because it doesn't "fit" in "schema.org" or "Jupyter's GraphQL thing" |
One thing to keep in mind is that I believe we would need graphql subscriptions support. In my last chat with Brian, one of the big drivers of using graphql is the ability for two users to be editing the same metadata at the same time and to have the changes mirrored between the two. Subscriptions would allow each client to get notified when another client made a change with the updated data, I believe (I am not a GraphQL expert, someone correct me if you don't need subscriptions for this use case). They aren't included in Graphene at the moment (graphql-python/graphene#430 (comment), graphql-python/graphene#393), but a few people have added custom support to use Django's channels with graphene: https://github.com/eamigo86/graphene-django-subscriptions, https://github.com/datadvance/DjangoChannelsGraphqlWs
TBH I am not familiar with what JSON-LD context really is. I will have to investigate that. I get scared whenever I see RDF/SPARQL cause it's a whole other world that I don't know much about. Do we keep a bunch of RDF files in disk? Run a SPARQL server? AFAIK the initial work on the metadata service isn't about using graphql or schema.org for notebook's themselves. Instead, my understanding is that it's more like a seperate context microservice. You might have comments on a notebook, and those would be stored persistently with some reference to the notebook, like it's file path and the cell ID of the comment. |
Thanks everyone. One of the benefits of going with GraphQL is that in the long term, the details of the server implementation, persistence mechanism, python/node.js, etc. are less important than the protocol, GraphQL schema, etc. I would love to be able use Python for this, but I think that today it makes sense to use the approach that enables us to explore and iterate quickly on the schema and queries, with a solid GraphQL implementation that won't get in our way. To me that suggests starting with Apollo. I do think that jsonld will be important in the long term, but I don't think we need to tackle that starting out. |
Sure, but we have a lot of shipped stuff that look like complex, potentially evented graph data models! Also if we're talking about putting any of this stuff into nbformat itself, it's worth thinking about how these things might play.
GraphQL isn't going to solve the conflict-resolution problem, but it's great that it has a spec for delivering changes and multiple implementations. On that note, added subscriptions to contents on that prototype based on this PR:
Like it or lump, if we're buying schema.org and web annotation, we're getting JSON-LD.
Those two contexts are how an SEO consortium and a standards body, respectively, see parts of the world and name things, and do have some incompatibilities (17, but some are spurious). A key, relevant distinction is that both define
Like any other JSON, you can treat JSON-LD as a serialization format, (de-)normalize it, and then store it however you want, or accept its graph nature (and maybe also normalize it, unless your store speaks JSON-LD directly). Taking the former route, flat files first is great (see FileContentsManager)! However, once you have multiple writers, it's hard to ignore Postgres (BSD-like) with LISTEN for events and the ability to store and query JSON. couchdb (APL) was super fun back in the day, and would probably still work great. Taking the latter, the most cross-platform graph database I've used is virtuoso (GPL). RedisGraph (APL+CrazyClause), edgedb (APL), and gundb (APL) are new and fast and cool. All of them let you do things that can be pretty tricky with an RDBMS, but no two of those suggestions use the same query language natively. But yeah, the major languages with multiple implementations for arbitrary, potentially circular graph traversal are probably SPARQL and Gremlin. |
so it should create a separated server for metadata and each user should specify the metadata server manually (like a settings page) ? |
In response to @saulshanabrook regarding subscriptions not being available in Graphene at the moment, I have had some success using GraphQL_WS in conjunction with AIOHTTP and Graphene for subscriptions. The support from that project seems seriously finicky in some areas, with the Flask implementation seemingly leaking all sessions, but the AIOHTTP variant seems to work acceptable in my testing. |
On Fri, Jan 11, 2019, 15:13 Zoey ***@***.***> wrote:
In response to @saulshanabrook <https://github.com/saulshanabrook>
regarding subscriptions not being available in Graphene at the moment, I
have had some success using GraphQL_WS
<https://github.com/graphql-python/graphql-ws/> in conjunction with
AIOHTTP and Graphene for subscriptions.
Right, my finding was that GraphQL_ws is pretty much good to go. Thanks for
the corroboration!
The support from that project seems seriously finicky in some areas, with
the Flask implementation seemingly leaking all sessions, but the AIOHTTP
variant seems to work acceptable in my testing.
Luckily, tornado's model almost exactly matches other async models, without
any of the gevent black (green?) magic.
While I'll agree the GraphQL implementation doesn't really matter, I'm just
going to highlight the challenges end (science, business, education) users
have been having in even installing a sane environment for JupyterLab's
hidden node dependency, much less running its sole function (webpack). If a
node-based deployment can be gotten to a yarn-like single file, such as the
one bundled with lab, and never even touches npm, and is installable with
pip, it _could_ be viable for the average user.
So back to my original point: starting with a versioned...
- GraphQL sdl
- serialization json schema
- black box conformance suite
...will concentrate the discussion on the types, and not on
language/vendor-specifics. If the reference implementation (and test suite)
is node-based, so be it!
… |
@xmnlab Yeah I think that would be a fine way to do it. As a default, it might make sense to proxy a metadata server through the notebook server using something like https://github.com/jupyterhub/jupyter-server-proxy and start that with the notebook. I see the ideal default workflow being:
I think starting out with the metadata server running separately and having the user input the address in the client makes sense. Then we could always add the proxy later and add a default. |
@xmnlab It looks like with the apollo server you define your mutations and then provide a custom javascript function for the definition. So at first, it seems like we could just story everything in memory if that's easiest in the server (https://blog.apollographql.com/react-graphql-tutorial-mutations-764d7ec23c15#f370) |
@saulshanabrook very nice! I think in-memory should work for now! I will take a look into that, thanks! |
Yeah in-memory is great as you can focus on the API you're exposing via the schema for the types, queries, mutations, and subscriptions. |
I am trying to connect the apollographql server with jupyterlab extension using jupyterlab-server-proxy ... but something is not clear to me. the example used on its documentation assumes that the user already has the server installed. in our case we need to install the apollographql server (on nodejs). what is the recommend approach?
|
Let's not have any more runtime npm installation hijinks, please!
As a user (and admin) i'd like to be able to "pip" or "conda" install
"jupyter-metadata-server", get on an airplane, and start the thing, do some
annotating or local dataset browsing.
As mentioned earlier, the portable, single-file yarn redistributed with
jlab is a great approach, and possibly the only sane one. This has the nice
side effect of ferreting out huge, hidden, barely-managed binary
dependencies (I'm looking at you, puppeteer).
I think once that happens, the server extension I pip installed above would
take care of depending on and setting up the proxy, as well as starting the
server.
Once you add one or more federated endpoints, it should just be a single
jupyter_notebook_config.json change, e.g.
"MetadataManager": {
"metadata_providers": {
"<local>": {enabled: true}, // default, can be disabled here
"https://hub.example.com/graphql": {enabled: true}
}
}
Of course, if it's one and done, local OR a single remote, it can be
easier, but I'd still like us to try to support the n-providers case, as
it's one of the things GraphQL can do very well.
|
In the meeting today, Brian was articulating a metadata explorer UI that exposes the links between objects. For example, if you have a dataset, you should be able to click on it's author and be taken to a person page that describes that author and also shows all the datasets who list them as authors. So we need to implement a generalized linked data display and edit mechanism. This is pretty huge, it's like a generic ORM that is flexible enough to handle many different types of linked data. Does anyone have examples of existing tools that do this? With schema.org types of data? My experience with this sort of thing has been with the auto generated admin in the Django web framework. You specify your models and it will create custom edit pages that work with the relational data: In that framework, it isn't all auto generated though. You have to do a lot of manual work to tell the admin the best way to show the fields. Of course, for an MVP we could hard code in some set of fields and relations and hard code in the UI and all the proper editing capability. But it seems like in the end we need to support all types of relationships and types in schema.org. |
One approach would be very JSON-forward, using something like The rough order of business in the browser would be:
The nice thing about this approach is it would scale to vocabularies other The shortcoming of this approach is that you can end up with the "Jenkins" problem, The other tack to take would be to repurpose the data explorer itself to do |
@ian-r-rose had started a demo of using rjsf over on #5892: Note there are some current limitations that conflict with some stylistic choices made in our schema implementations e.g. Adopting it and building some momentum behind modelling schema that's not only rigorous but also captures the user value of the data at hand seems very powerful... as an extension author, if I can delegate some complex but predictable UI over to a core feature, i'd do it in a heartbeat vs writing a bunch of form stuff myself. |
How do schema.org and JSON schema compare in expressiveness? Is one a superset of the other? schema.org Data Model:
This is most clear to me in their JSON LD representation. It also helped me to reference the "meta" schema. Maybe the better question is: "How should JSON LD and JSON Schema relate to one another?" Because it seems that JSON LD is able to represent schema.org well, but also other schemas. Here is a link to a discussion about JSON LD and JSON Schema, not sure about the conclusion: https://github.com/json-schema-org/json-schema-spec/issues/309 |
The two are orthogonal. They share the property that they are both
expressed, unsurprisingly, in JSON. Both can be applied to JSON documents
that are unaware of the schema/context. Both have implementations in many
languages, but given there are multiple versions of each of them, there can
be implementation differences.
A Json schema describes what a document must/can look like: e.g. the top
level document must specify a property "type", and the value must be
"Dataset".
A json ld context describes what a document's values actually _mean_: e.g.
"type" means "http://www.w3.org/1999/02/22-rdf-schema#type", and "Dataset"
means "https://schema.org/Dataset".
LD generally doesn't care about the actual structure unless explicitly told
to: for example, type can be a list or single value. It used to be that you
couldn't have multiple meanings of the same term in the same document (e.g.
blood type and MHC type n one medical record) but as of 1.1, if you _do_
know something about the structure, you can handle that without changing
the legacy schema.
Where the two intersect is framing: if done properly, you can take
arbitrarily-shaped JSON, get its meaning out, and put it back into a shape
that your UI/algorithm/database needs.
|
Also found this:
https://github.com/vazco/uniforms
Which can infer forms from JSON _and_ GraphQL schemas. No doubt it lacks
support for some of the more weird edge cases that _either_ schema system
has, but it seems nicely put together.
|
Well, this seems freakishly useful:
https://github.com/google/react-schemaorg
Based on:
https://github.com/google/schema-dts
You're not going to be putting any ANNO terms (or any other vocabularies)
on there, out of the box, and it's a bit odd that it's driven off the
nturtles representation rather than the JSON-LD one (not a fun parser to
get right) but heck, they did the work, and it looks great!
Of course, one of the big ideas is that physical data representation didn't
have to know about any @ this and that: business, but clean, canonical,
stuff is great.
|
My bad: you can generate with whatever context you want:
https://github.com/google/schema-dts/blob/master/src/cli/args.ts
But it still only parses the vocabulary from s.d.o:
https://github.com/google/schema-dts/blob/master/src/triples/reader.ts
This suggests it might be possible to make a jupyter-dts from our context,
and however we reconcile the differences between s.d.o and anno, but might
require a fork.
|
we are working on this WIP PR: #6 we added a jupyter-server-proxy lib that ships all js files and installs the graphql server into jupyterlab using pip.
it seems interesting. do you have a suggestion (or reference) about how to change jupyterlab config file? or should be something like just read the file ... add / update the data ... and write it back? |
I actually just submitted google/schema-dts#14 which was requested by a few people to support a totally custom vocabulary altogether, so this might help. It has a few limitations described in the PR (takes a single URL rather than a set of layers, expects schema.org DataTypes, and (probably the most limiting) still expects a "Thing" type to be defined.) Depending on use cases people envision for this, however, I'm happy to accept a PR or just a specific feature request. |
**Lots of action here! @Eyas That's very cool stuff! I hope that PR makes it through! A brief introduction: we want to use schema.org We're also likely going to need some novel We also need the rigor of the W3C Annotation Vocabulary to provide rich commenting (all the selectors, etc.)... and probably need to extend them, too. This will be tricker, as that's OWL-based. Tying these together, we have a few known-
I'll think a bit about those limitations, as I guess we'd need to think about it some... |
Given this use case, one improvement that might work for your usecase is extending the CLI to request .nt files from two URLs, layering the triples of each on top of one another. That actually has general use cases in the Schema.org realm, where you might want to use the "basic" schema.org definitions along with the life-sciences extension only. Right now the CLI allows you to either pick the pre-flattened all-layers file, or the basic file, but not pick individual layers. |
Right, this sounds about like what i was imagining. We'd probably want a configurable I guess I do have to ask the question of why parse triples instead of doing JSON-LD directly? It seems like Further, we'd want everything checked in and static (or at least submoduled). Could the CLI support a config file? Ours would end up being something like: {
"reader": [
{
"url": "schema:version/3.4/all-layers.jsonld"
# some more stuff
},
{
"url": "https://dvcs.w3.org/hg/prov/raw-file/tip/ontology/prov-o.ttl"
},
{
"url": "github:w3c/web-annotation/raw/gh-pages/vocab/wd/ontology/oa.jsonld",
# some more stuff
},
],
"generator": {
"output": "dist",
}
} |
Makes sense. Biggest gaps seem like rdfs:range, rdfs:domain? Should be straightforward.
A config file makes sense. The CLI npm package can also be included, by the way, and individual functions (e.g. WriteDeclarations) can be imported and called in whatever custom way you want. I agree a config file for the CLI makes sense though.
I had two implementations for this and ended up sticking with .nt for a few reasons. Parsing triples is weird but it's pretty straightforward from there. The nice things about triples once they're parsed is that they're very composable and pretty close to the metal as far as what relations they're describing. While it's nice to take JSON-LD in, it's not particularly more ergonomic to handle it's "@graph" definitions (you could iterate over keys, at which point you might as well have parsed triples). You'll also need to resolve "@context", etc. Nothing is inherently hard, it just didn't seem worthwhile in terms of trade-offs. Happy to reconsider if things change. |
A few critiques about points made above. There seems to be confusion between metadata vocabulary and format. Schema.org and others related to it (e.g., VoID, SAGE, etc.) are controlled vocabularies that help define entities and relationships for representing metadata about datasets and their use cases. Generally a project will define a local vocabulary which links to Schema.org or others, and so by their nature these are extensible -- you really never need to just pick one. Historically, there are several formats used and many of the popular open source libraries are quite good at converting between formats. JSON-LD is a good format for use cases that need to be machine readable, while Turtle is arguably the most compact format for use cases that need to be human readable, although it's trivial (~2 lines of Py) to convert between them. The popular upper ontologies such as Library of Congress, DBpedia, etc., will tend to use SKOS for organizing what's being represented -- although ultimately SKOS is built atop OWL, which is built atop RDF, so again these all interoperate well. |
It's curious what GraphQL is being used here for metadata services -- other than perhaps it already has use elsewhere in Jupyter? While there's a notion of "knowledge graph about datasets" in the long term roadmap here, that's definitely not what the first part of the "GraphQL" name implies :) It's a protocol for services, as an alternative to REST, gRPC, etc., and especially good when, hypothetically, the same corporate entity controls the release cycles for both client and server and they're interested in optimizing API overhead for gazillions of ads served daily. However, GraphQL seems more about data served as trees and lists than about graphs; see any graphs in its examples? Also, it's not particularly good for serving metadata. For machine readable metadata, JSON-LD would be the most likely choice, and there is already precedent, e.g., JSON-LD markup in web pages used for metadata that search engines need. That would be much simpler for the consumers of this metadata service if it were a JSON-based service. |
Thanks Paco for explaining this. I started to sketch out a client side API that would be vocabulary agnostic here. A "Metadata Provider" would have to have a In that proposal, GraphQL would not be part of the core API, although you would be free to implement a provider that uses GraphQL. |
In terms of open standards for metadata services, it's odd not to find any mention here of Egeria https://egeria.odpi.org/ and the ASF + ODPi work on open source and open standards for that. Has there been any discussion of having an adapter? https://egeria.odpi.org/open-metadata-publication/website/open-metadata-integration-patterns/ |
Thank you @saulshanabrook I'll add comments over there regarding use cases for vocabularies. |
The other general observation here is that there's been a lot of discussion about using standards and tools that come out of open data, as a guide for how to structure this metadata service for JupyterLab. Those are good to leverage, but they aren't definitive. One caveat is that so many of the use cases for Jupyter will not use open data. Instead, it's probably best to go into this with a notion of "tiered access":
So it's important to keep that distinction between open metadata and open data. |
I am going to close this for now, since we have a solid base for read only viewing of metadata. |
This is an issue that provides an overview of the proposed metadata service, client, and UI being developed in this repo.
Background
Entities in the JupyterLab universe (notebooks, text files, datasets, models, visualizations, etc.) often have rich context and metadata associated with them. Examples include:
This rich context is incredibly useful to groups of people working with code and data. This goal of this work is to build a metadata architecture to enable Jupyter users to collaboratively create and explore metadata for any entity in the Jupyter universe.
What metadata standard
We have considered a number of existing metadata standards, and the one that is emerging as a top candidate is that of https://schema.org/. It appears to be rich enough to describe the different types of metadata we encounter in the Jupyter universe. In talking with potential users of this system, that flexibility seems to be important.
See jupyterlab/jupyterlab#5733 for additional discussion about metadata schema.
Implementation
The current proposal is to create a Jupyter notebook server extension that is a GraphQL service for the relevant subset of the schema.org metadata schema. We haven't worked through what subset of the schema is relevant for Jupyter, but it will probably be the document and data related ones (probably won't start with things like https://schema.org/FlightReservation).
The usage of GraphQL is important because we imagine a wide range of complex UIs being built to display and edit this highly nested metadata. Being able to get back and edit rich data in single queries will be really helpful on the frontend.
We haven't decided if this notebook server extension will be written in Python or node.js (or both), but it shouldn't really matter.
For a client, we are imagining a TypeScript based library that provides a thin, well-typed API for talking to the service.
The notebook server and TypeScript client library should be entirely independent of JupyterLab and useful outside of it.
Finally, we plan on creating a JupyterLab extension that offers a user experience for editing and viewing metadata for entities in JupyterLab. Initial work will focus on notebooks, data sets, text documents.
Initially, this repo will create our explorations of the notebook server, TypeScript client, and JupyterLab UI extension, but these may be separated out over time.
The text was updated successfully, but these errors were encountered: