Wiki Page: KG-Construction
Jupyter Notebook: Data_Preparation.ipynb
Required Input: resources/construction_approach/subclass_construction_map.pkl
New data can be added to the knowledge graph using 2 different construction approaches: (1) instance-based
or (2) subclass-based
. Each of these approaches is described further below.
In this approach, each new edge is added as an instance
of an existing class (via rdf:Type
) in the knowledge graph.
EXAMPLE: Adding the edge: Morphine
➞ isSubstanceThatTreats
➞ Migraine
isSubstanceThatTreats
(Morphine
,x1
)Type
(x1
,Migraine
)
In this example, Morphine
is an ontology data node from ChEBI and Migraine
is a Human Phenotype Ontology term. This would result in the following triples:
UUID1 = MD5(Morphine + isSubstanceThatTreats + Migraine + "subject")
UUID2 = MD5(Morphine + isSubstanceThatTreats + Migraine + "object")
UUID1, rdf:type, Morphine
UUID1, rdf:type, owl:NamedIndividual
UUID2, rdf:type, Migraine
UUID2, rdf:type, owl:NamedIndividual
UUID1, isSubstanceThatTreats, UUID2
A table is provided below showing the different triples that are added as function of edge type (i.e. class
-class
vs. class
-instance
vs. instance
-instance
) and relation strategy (i.e. relations only or relations + inverse relations).
Edge Type | Relations | Needed Triples |
---|---|---|
Class-Class | Relations Only | GO_1234567 , REL , DOID_1234567 UUID1 = pkt + MD5(GO_1234567 + <<REL>> + DOID_1234567 + "subject")UUID2 = pkt + MD5(GO_1234567 + <<REL>> + DOID_1234567 + "object")UUID1 , rdf:type , GO_1234567 UUID1 , rdf:type , owl:NamedIndividual UUID2 , rdf:type , DOID_1234567 UUID2 , rdf:type , owl:NamedIndividual UUID1 , REL , UUID2 |
Class-Class | Relations + Inverse Relations | GO_1234567 , REL , DOID_1234567 DOID_1234567 , INV_REL , GO_1234567 UUID1 = pkt + MD5(GO_1234567 + <<REL>> + DOID_1234567 + "subject")UUID2 = pkt + MD5(GO_1234567 + <<REL>> + DOID_1234567 + "object")UUID1 , rdf:type , GO_1234567 UUID1 , rdf:type , owl:NamedIndividual UUID2 , rdf:type , DOID_1234567 UUID2 , rdf:type , owl:NamedIndividual UUID1 , REL , UUID2 UUID2 , INV_REL , UUID1 |
Class-Instance | Relations Only | GO_1234567 , REL , HGNC_1234567 UUID1 = pkt + MD5(GO_1234567 + <<REL>> + HGNC_1234567 + "subject")UUID2 = pkt + MD5(GO_1234567 + <<REL>> + HGNC_1234567 + "object")UUID1 , rdf:type , GO_1234567 UUID1 , rdf:type , owl:NamedIndividual HGNC_1234567 , rdfs:subClassOf , subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class UUID2 , rdf:type , HGNC_1234567 UUID2 , rdf:type , owl:NamedIndividual UUID1 , REL , UUID2 |
Class-Instance | Relations + Inverse Relations | GO_1234567 , REL , HGNC_1234567 HGNC_1234567 , INV_REL , GO_1234567 UUID1 = pkt + MD5(GO_1234567 + <<REL>> + HGNC_1234567 + "subject")UUID2 = pkt + MD5(GO_1234567 + <<REL>> + HGNC_1234567 + "object")UUID1 , rdf:type , GO_1234567 UUID1 , rdf:type , owl:NamedIndividual HGNC_1234567 , rdfs:subClassOf , subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class UUID2 , rdf:type , HGNC_1234567 UUID2 , rdf:type , owl:NamedIndividual UUID1 , REL , UUID2 UUID2 , INV_REL , UUID1 |
Instance-Instance | Relations Only | HGNC_1234567 , REL , HGNC_7654321 UUID1 = pkt + MD5(HGNC_1234567 + <<REL>> + HGNC_7654321 + "subject")UUID2 = pkt + MD5(HGNC_1234567 + <<REL>> + HGNC_7654321 + "object")HGNC_1234567 , rdfs:subClassOf ,subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class UUID1 , rdf:type , HGNC_1234567 UUID1 , rdf:type , owl:NamedIndividual HGNC_7654321 , rdfs:subClassOf ,subclass_dict[HGNC_7654321] HGNC_7654321 , rdf:type , owl:Class UUID2 , rdf:type , HGNC_7654321 UUID2 , rdf:type , owl:NamedIndividual UUID1 , REL , UUID2 |
Instance-Instance | Relations + Inverse Relations | HGNC_1234567 , REL , HGNC_7654321 HGNC_7654321 , INV_REL , HGNC_1234567 UUID1 = pkt + MD5(HGNC_1234567 + <<REL>> + HGNC_7654321 + "subject")UUID2 = pkt + MD5(HGNC_1234567 + <<REL>> + HGNC_7654321 + "object")HGNC_1234567 , rdfs:subClassOf ,subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class UUID1 , rdf:type , HGNC_1234567 UUID1 , rdf:type , owl:NamedIndividual HGNC_7654321 , rdfs:subClassOf ,subclass_dict[HGNC_7654321] HGNC_7654321 , rdf:type , owl:Class UUID2 , rdf:type ,HGNC_7654321 UUID2 , rdf:type ,owl:NamedIndividual UUID1 , REL , UUID2 UUID2 , INV_REL , UUID1 |
Note. UUID
is a BNode
that is created from an md5 hash of concatenated URIs. The URIs used in the hash string includes the subject and object URIs (each appended with "subject" and "object", respectively) in addition to a relation (<<REL>>
). To account for future use cases, we have devised a heuristic to determine what is used for <<REL>>
: (1) for a given relation, determine if it has an inverse (via owl:InverseOf
); (2) sort the relations alphabetically; and (3) select the first relation. The selected relation is then used for creating both UUID
BNodes
(i.e. UUID1
and UUID2
in all examples in the table above). For example, if the relations were causes
and caused_by
both UUID
BNodes
would be created using caused_by
. Please note that all UUID
BNodes
created during the construction process are explicitly defined within the pkt
namespace (https://github.com/callahantiff/PheKnowLator/pkt/
).
In this approach, each new edge is added as a subclass of an existing ontology class (via rdfs:subClassOf
) in the knowledge graph.
EXAMPLE: Adding the edge: TGFB1
➞ participatesIn
➞ Influenza Virus Induced Apoptosis
participatesIn
(TGFB1
,Influenza Virus Induced Apoptosis
)subClassOf
(Influenza Virus Induced Apoptosis
,Influenza A Pathway
)Type
(Influenza Virus Induced Apoptosis
,owl:Class
)
Where TGFB1
is a Protein Ontology term and Influenza Virus Induced Apoptosis
is a non-ontology data node from Reactome. In this example, Influenza A Pathway
is an existing Pathway Ontology class. This would result in the following triples:
UUID1 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis)
UUID2 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis + owl:Restriction)
Influenza Virus Induced Apoptosis, rdfs:subClassOf, Influenza A Pathway
Influenza Virus Induced Apoptosis, rdf:type, owl:Class
UUID1, rdfs:subClassOf, TGFB1
UUID1, rdfs:subClassOf, UUID2
UUID2, rdf:type, owl:Restriction
UUID2, owl:someValuesFrom, Influenza Virus Induced Apoptosis
UUID2, owl:onProperty, participatesIn
A table is provided below showing the different triples that are added as function of edge type (i.e. class
-class
vs. class
-instance
vs. instance
-instance
) and relation strategy (i.e. relations only or relations + inverse relations).
Edge Type | Relations | Needed Triples |
---|---|---|
Class-Class | Relations Only | GO_1234567 , REL , DOID_1234567 UUID1 = pkt + MD5(DOID_1234567 + REL + GO_1234567 )UUID2 = MD5(DOID_1234567 + REL + GO_1234567 + owl:Restriction )UUID1 , rdfs:subClassOf , GO_1234567 UUID1 , rdfs:subClassOf , UUID2 UUID2 , rdf:type , owl:Restriction UUID2 , owl:someValuesFrom , DOID_1234567 UUID2 , owl:onProperty , REL |
Class-Class | Relations + Inverse Relations | GO_1234567 , REL , DOID_1234567 DOID_1234567 , INV_REL , GO_1234567 UUID1 = pkt + MD5(GO_1234567 + REL + DOID_1234567 )UUID2 = MD5(GO_1234567 + REL + DOID_1234567 + owl:Restriction )UUID3 = pkt + MD5(DOID_1234567 + INV_REL + GO_1234567 )UUID4 = MD5(DOID_1234567 + INV_REL + GO_1234567 + owl:Restriction )UUID1 ,rdfs:subClassOf ,GO_1234567 UUID1 ,rdfs:subClassOf ,UUID2 UUID2 , rdf:type , owl:Restriction UUID2 , owl:someValuesFrom ,DOID_1234567 UUID2 , owl:onProperty , REL UUID3 , rdfs:subClassOf ,DOID_1234567 UUID3 ,rdfs:subClassOf ,UUID4 UUID4 , rdf:type , owl:Restriction UUID4 , owl:someValuesFrom , GO_1234567 UUID4 , owl:onProperty , INV_REL |
Class-Instance | Relations Only | GO_1234567 , REL , HGNC_1234567 UUID1 = pkt + MD5(GO_1234567 + REL + HGNC_1234567 )UUID2 = MD5(GO_1234567 + REL + HGNC_1234567 + owl:Restriction )HGNC_1234567 , rdfs:subClassOf , subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class UUID1 , rdfs:subClassOf , GO_12334567 UUID1 , rdfs:subClassOf , UUID2 UUID2 , rdf:type , owl:Restriction UUID2 , owl:someValuesFrom , HGNC_1234567 UUID2 , owl:onProperty , REL |
Class-Instance | Relations + Inverse Relations | GO_1234567 , REL , HGNC_1234567 HGNC_1234567 , INV_REL , GO_1234567 UUID1 = pkt + MD5(GO_1234567 + REL + HGNC_1234567 )UUID2 = MD5(GO_1234567 + REL + HGNC_1234567 + owl:Restriction )UUID3 = pkt + MD5(HGNC_1234567 + INV_REL + GO_1234567 )UUID4 = MD5(HGNC_1234567 + INV_REL + GO_1234567 + owl:Restriction )HGNC_1234567 , rdfs:subClassOf , subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class UUID1 , rdfs:subClassOf , GO_12334567 UUID1 , rdfs:subClassOf , UUID2 UUID2 , rdf:type , owl:Restriction UUID2 , owl:someValuesFrom , HGNC_1234567 UUID2 , owl:onProperty , REL UUID3 , rdfs:subClassOf , HGNC_1234567 UUID3 , rdfs:subClassOf , UUID4 UUID4 , rdf:type , owl:Restriction UUID4 , owl:someValuesFrom , GO_12334567 UUID4 , owl:onProperty , INV_REL |
Instance-Instance | Relations Only | HGNC_1234567 , REL , HGNC_7654321 UUID1 = pkt + MD5(HGNC_1234567 + REL + HGNC_7654321 )UUID2 = MD5(HGNC_1234567 + REL + HGNC_7654321 + owl:Restriction )HGNC_1234567 , rdfs:subClassOf , subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class HGNC_7654321 , rdfs:subClassOf , subclass_dict[HGNC_7654321] HGNC_7654321 , rdf:type , owl:Class UUID1 , rdfs:subClassOf , HGNC_1234567 UUID1 , rdfs:subClassOf , UUID2 UUID2 , rdf:type , owl:Restriction UUID2 , owl:someValuesFrom , HGNC_7654321 UUID2 , owl:onProperty , REL |
Instance-Instance | Relations + Inverse Relations | HGNC_1234567 , REL , HGNC_7654321 HGNC_7654321 , INV_REL , HGNC_1234567 UUID1 = pkt + MD5(HGNC_1234567 + REL + HGNC_7654321 )UUID2 = MD5(HGNC_1234567 + REL + HGNC_7654321 + owl:Restriction )UUID3 = pkt + MD5(HGNC_7654321 + INV_REL + HGNC_1234567 )UUID4 = MD5(HGNC_7654321 + INV_REL + HGNC_1234567 + owl:Restriction )HGNC_1234567 , rdfs:subClassOf , subclass_dict[HGNC_1234567] HGNC_1234567 , rdf:type , owl:Class HGNC_7654321 , rdfs:subClassOf , subclass_dict[HGNC_7654321] HGNC_7654321 , rdf:type , owl:Class UUID1 , rdfs:subClassOf , HGNC_1234567 UUID1 , rdfs:subClassOf , UUID2 UUID2 , rdf:type , owl:Restriction UUID2 , owl:someValuesFrom , HGNC_7654321 UUID2 , owl:onProperty , REL UUID3 , rdfs:subClassOf , HGNC_7654321 UUID3 , rdfs:subClassOf , UUID4 UUID4 , rdf:type , owl:Restriction UUID4 , owl:someValuesFrom , HGNC_1234567 UUID4 , owl:onProperty , INV_REL |
Note. When UUID
is used within the subclass-based construction approach it is a BNode
that is created from an md5 hash of concatenated URIs. See each table row for a sample of what this looks like. Please note that the primary UUID
BNodes
created during the construction process are explicitly defined within the pkt
namespace (https://github.com/callahantiff/PheKnowLator/pkt/
).
🛑 ASSUMPTIONS 🛑
Data: subclass_construction_map.pkl
The algorithm makes the following assumptions:
- Make sure that you have created the non-ontology node data to ontology class mapping dictionary (described below) to the
./resources/construction_approach/*.pkl
directory.
Input requirements for both approaches: A pickled
dictionary (keys contain node identifiers (non-ontology node data) and the values are lists of ontology class identifiers) added to the ./resources/construction_approach/
directory. An example of this dictionary is shown below:
{
'R-HSA-168277' : ['http://purl.obolibrary.org/obo/PW_0001054',
'http://purl.obolibrary.org/obo/GO_0046730'],
'R-HSA-9026286' : ['http://purl.obolibrary.org/obo/PW_000000001',
'http://purl.obolibrary.org/obo/GO_0019372'],
'100129357' : ['SO_0000043'],
'100129358' : ['SO_0000336'],
}
Please see the Reactome Pathways - Pathway Ontology
and Genomic Identifiers - Sequence Ontology
sections of the
Data_Preparation.ipynb
Jupyter Notebook for examples of how to construct this document.