-
Notifications
You must be signed in to change notification settings - Fork 2
LODVectors
- Toshiaki Katayama (D3SPARQL, lod4ml, UI)
- Shuichi Kawashima (lod4ml)
- Jerven Bollemann (UniProt querying)
- Michel Dumontier (SMART API, Elasticsearch index/vector search)
- Robert Hoehndorf (backend)
Determine if Knowledge Graph Embeddings can enable novel, useful applications in the Semantic Web.
https://github.com/bio-ontology-research-group/lodvectors/
http://biohackathon.org/d3sparql/d3lod4ml.html
(Please do not run large queries on this; please don't use during presentation)
Use some high-level BP classes as labels (sensory perception
, glycolisation
, apoptosis
, behavior
):
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX go:<http://purl.obolibrary.org/obo/GO_>
PREFIX eco:<http://purl.obolibrary.org/obo/ECO_0000>
SELECT
DISTINCT ?protein ?golabel ?proteinlabel
WHERE
{
?protein a up:Protein ;
up:organism taxon:9606 ;
rdfs:label ?proteinlabel .
{
{
SELECT ?protein ?golabel
WHERE {
?protein a up:Protein ;
up:organism taxon:9606 .
VALUES ?go { go:0006915 go:0070085 go:0007600 go:0007610}
{
?protein up:classifiedWith ?go .
?attribution rdf:object ?go .
} UNION {
?protein up:classifiedWith ?term . ?term rdfs:subClassOf ?go .
?attribution rdf:object ?term .
}
?attribution rdf:subject ?protein ;
up:attribution/up:evidence ?evidence.
FILTER(?evidence IN (eco:269, eco:314 ,eco:353, eco:315, eco:316, eco:270, eco:250))
?go rdfs:label ?golabel .
} LIMIT 1000
}
} UNION {
{
SELECT ?protein ("UNKNOWN" AS ?golabel)
WHERE {
?protein a up:Protein ;
up:organism taxon:9606 .
FILTER(NOT EXISTS {
{
?protein up:classifiedWith ?go .
?attribution rdf:object ?go .
} UNION {
?protein up:classifiedWith ?term . ?term rdfs:subClassOf ?go .
?attribution rdf:object ?term .
}
?attribution rdf:subject ?protein ;
up:attribution/up:evidence ?evidence.
FILTER(?evidence IN (eco:269, eco:314 ,eco:353, eco:315, eco:316, eco:270, eco:250))
})
} LIMIT 50
}
}
}
API Documentation is available here
Many (it's a hackathon)!
Known bugs include:
- head of SPARQL JSON not complete
- datatypes not properly types in SPARQL results
- very slow for large result sets (>1500 rows)
- still slow for smaller result sets
At last year's Hackathon, a group of biohackers started to build https://github.com/bio-ontology-research-group/walking-rdf-and-owl to generate vector representations for nodes (and properties) in RDF/OWL knowledge graphs. These vectors represent nodes and their context, and can be used in machine learning models for edge prediction and other applications. However, in the Linked Open Data cloud, there are many more potential applications for such vectors. For example, they may allow us to expand query results by including not only the nodes that directly match a (SPARQL) query, but also nodes that are similar (within a certain threshold) to the query results. We may also use dimensionality reduction techniques to project result sets on a 2D space and use this to visualize a set of nodes.
Our aim was to explore some of the novel applications that vector-based representations of nodes in Linked Data graphs might enable. We focus on visualization of dataset characteristics and similarity-based search.
We used the vectors generated from (human) UniProt entries by the lod4ml project.
We use the t-SNE method to project a set of vectors into 2D space so that they can be plotted. If labels are provided for the nodes (and their corresponding vectors), the labels can be visualized through node colors.
For visualization, we use the D3SPARQL Scatterplot package.
Similarity is computed using cosine similarity between vectors.
We hold vectors in an Elasticsearch instance with the Elastic Vector Scoring plugin. Cosine similarity and retrieval of most similar vectors is performed through an Elasticsearch query.
We used the smartAPI specification to generate an Open API v3 compatible description: file. We added this file to the v3 compatible smartAPI registry. The lodvector api can be now opened and tested with the swagger UI. Try out the example query or just see the response.
The response is structured in two parts: meta and results. The meta field contains 6 elements
"api:meta": {
"api:URLcalled": "http://...",
"prov:wasGeneratedBy": "https://github.com/bio-ontology-research-group/lodvectors/blob/master/tsne.groovy",
"prov:generatedAt": "2017-09-15T02:15:04+0000",
"api:errors": [],
"api:warnings": ["No results found."],
"api:resultCount": 0
}
Additionally, we developed a JSON-LD context file in order to expand this JSON response into triples.
Most code implemented in Groovy and available at https://github.com/bio-ontology-research-group/lodvectors under 2-clause BSD license. The code relies on the GroovySparql package, Elasticsearch and Elastic Vector Search, t-SNE-JAVA, and Jetty as container for servlets.
The API endpoints are