Skip to content
Robert Hoehndorf edited this page Sep 15, 2017 · 26 revisions

What to do with LOD vectors?

Participants

  • Toshiaki Katayama (D3SPARQL, lod4ml, UI)
  • Shuichi Kawashima (lod4ml)
  • Jerven Bollemann (UniProt querying)
  • Michel Dumontier (SMART API, Elasticsearch index/vector search)
  • Robert Hoehndorf (backend)

Aim

Determine if Knowledge Graph Embeddings can enable novel, useful applications in the Semantic Web.

Code

https://github.com/bio-ontology-research-group/lodvectors/

Demo

http://biohackathon.org/d3sparql/d3lod4ml.html

(Please do not run large queries on this; please don't use during presentation)

Queries to run

Use some high-level BP classes as labels (sensory perception, glycolisation, apoptosis, behavior):

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl:<http://www.w3.org/2002/07/owl#> 
PREFIX taxon:<http://purl.uniprot.org/taxonomy/> 
PREFIX go:<http://purl.obolibrary.org/obo/GO_>
PREFIX eco:<http://purl.obolibrary.org/obo/ECO_0000> 
SELECT 
  DISTINCT ?protein ?golabel ?proteinlabel
WHERE
{
  ?protein a up:Protein ;
      up:organism taxon:9606 ;
             rdfs:label ?proteinlabel .
   
  {
    {
      SELECT ?protein ?golabel 
      WHERE {
          ?protein a up:Protein ;
      up:organism taxon:9606 .
      VALUES ?go { go:0006915 go:0070085 go:0007600 go:0007610}
      {
      ?protein up:classifiedWith ?go  .
      ?attribution rdf:object ?go .
      } UNION { 
      ?protein up:classifiedWith ?term . ?term rdfs:subClassOf ?go .
      ?attribution rdf:object ?term .
      }
      ?attribution rdf:subject ?protein ; 
                     up:attribution/up:evidence ?evidence.
FILTER(?evidence IN (eco:269, eco:314 ,eco:353, eco:315, eco:316, eco:270, eco:250))
      ?go rdfs:label ?golabel .
        } LIMIT 1000
    }
  } UNION {
    {
    SELECT ?protein ("UNKNOWN" AS ?golabel)
    WHERE {
      ?protein a up:Protein ;
      up:organism taxon:9606 .
      FILTER(NOT EXISTS {
      {
      ?protein up:classifiedWith ?go  .
      ?attribution rdf:object ?go .
      } UNION { 
      ?protein up:classifiedWith ?term . ?term rdfs:subClassOf ?go .
      ?attribution rdf:object ?term .
      }
      ?attribution rdf:subject ?protein ; 
                     up:attribution/up:evidence ?evidence.
FILTER(?evidence IN (eco:269, eco:314 ,eco:353, eco:315, eco:316, eco:270, eco:250))
      })
    } LIMIT 50
    }
  }
}

Documentation

API Documentation is available here

Example API call

Bugs

Many (it's a hackathon)!

Known bugs include:

  • head of SPARQL JSON not complete
  • datatypes not properly types in SPARQL results
  • very slow for large result sets (>1500 rows)
  • still slow for smaller result sets

Background

At last year's Hackathon, a group of biohackers started to build https://github.com/bio-ontology-research-group/walking-rdf-and-owl to generate vector representations for nodes (and properties) in RDF/OWL knowledge graphs. These vectors represent nodes and their context, and can be used in machine learning models for edge prediction and other applications. However, in the Linked Open Data cloud, there are many more potential applications for such vectors. For example, they may allow us to expand query results by including not only the nodes that directly match a (SPARQL) query, but also nodes that are similar (within a certain threshold) to the query results. We may also use dimensionality reduction techniques to project result sets on a 2D space and use this to visualize a set of nodes.

Our aim was to explore some of the novel applications that vector-based representations of nodes in Linked Data graphs might enable. We focus on visualization of dataset characteristics and similarity-based search.

Methods

Vectors

We used the vectors generated from (human) UniProt entries by the lod4ml project.

Visualization

We use the t-SNE method to project a set of vectors into 2D space so that they can be plotted. If labels are provided for the nodes (and their corresponding vectors), the labels can be visualized through node colors.

For visualization, we use the D3SPARQL Scatterplot package.

Similarity

Similarity is computed using cosine similarity between vectors.

Indexing and computation of similarity

We hold vectors in an Elasticsearch instance with the Elastic Vector Scoring plugin. Cosine similarity and retrieval of most similar vectors is performed through an Elasticsearch query.

Generation of meta-data

We used the smartAPI specification to generate an Open API v3 compatible description: file. We added this file to the v3 compatible smartAPI registry. The lodvector api can be now opened and tested with the swagger UI. Try out the example query or just see the response.

The response is structured in two parts: meta and results. The meta field contains 6 elements

 "api:meta": {
  "api:URLcalled": "http://...",
  "prov:wasGeneratedBy": "https://github.com/bio-ontology-research-group/lodvectors/blob/master/tsne.groovy",
  "prov:generatedAt": "2017-09-15T02:15:04+0000",
  "api:errors": [],
  "api:warnings": ["No results found."],
  "api:resultCount": 0
 }

Additionally, we developed a JSON-LD context file in order to expand this JSON response into triples.

Implementation and availability

Most code implemented in Groovy and available at https://github.com/bio-ontology-research-group/lodvectors under 2-clause BSD license. The code relies on the GroovySparql package, Elasticsearch and Elastic Vector Search, t-SNE-JAVA, and Jetty as container for servlets.

The API endpoints are

Results

Visualization of SPARQL query results: visual classification of proteins

Similarity-based search

A FAIR and SMART API

Discussion