-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please use http://purl.uniprot.org/uniprot or http://purl.uniprot.org/isoform/ IRIs for UniProt concepts #34
Comments
What does the uniprot PURL denote? If other graphs assert it's an IAO ICE we end up with incoherency. We need to treat it as a material entity (not that identifiers.org is clear on this) @nataled are you using uniprot PURLs as ICEs in PRO? |
At the moment we don't use them for anything other than what they are: database entries. In PRO they are only used for cross-references and evidences. However, going beyond that, they would be considered ICEs. Think of the distinction between SO and MSO: UniProtKB would be akin to SO, while PRO would be akin to MSO. |
Thanks! I'm most interested in what they are asserted or entailed to be in OWL. If you have axioms that cause a uniprot PURL (as in an actual purl.uniprot.org PURL) to be entailed as an ICE (for example, through use of an object property with domain/range constraints) then the combined knowledge graph with GO-CAM will have inconsistencies. I believe this is the case. I believe also that @JervenBolleman who is the authority on what the purls mean would say that these denote database records, not proteins. Both of these facts indicate that we should not use these purls for the neo classes (funnily enough, the PRO class has the intended semantics, but GO annotators want identifiers with UniProtKB prefixes, and we need all of at least swiss-prot materialized, which means we can't use PRO). Note that SO classes are not subclasses of ICEs, many SO classes have instances that exist independently of database records. |
Any Swiss-Prot entry can be trivially materialized in PRO. Some will need special treatment, of course, but we know how to deal with those. The only thing that stops us from doing it is that we've not had a request to do so. But, go ahead and try it. Take any Swiss-Prot accession that doesn't have a corresponding PRO, and prefix it with a PRO PURL (purl.obolibrary.org/obo/PR_). Works for TrEMBL too. |
OWL-DL speaking -> identifiers.org states <http://identifiers.org/uniprot/P05067> owl:sameAs <http://purl.uniprot.org/uniprot/P05067> so this is not a modelling change. However, for SPARQL ease of use doing federated queries it helps a lot for the practical adoption of Noctea models if we can cross query them. IRI conversion in SPARQL queries is possible but a pain that we would rather not have. DataRecord = owl:Class when I a talk. It means that a single UniProt record/class represents between 0 and practically infinite numbers of molecules, similar to a PRO class. Changing from rdf:type uniprot:Protein to rdfs:subClassOf uniprot:Protein is still on our todo list. However, with the current state of reasoners our users would have serious problems with the billions of axioms. @cmungall @nataled using PRO or UniProt is a separate discussion from this bug report and I suggest you open an issue and discussion of that separately. IMHO The more flexible semantics of UniProt is actually key for Noctea success as PRO semantic limits make it invalid to express some desired annotations (especially regarding the function of secreted proteins). |
Where does this axiom come from?
doesn't return anything If there is a sameAs axiom, formally it doesn't affect us, since we're in OWL-DL and protected by punning (sameAs only applies to individuals, and we're using classes).
OK, I will try and mentally translate but this extra layer confuses me. Would you apply this to GO too? To me, every GO class represents a process or cellular entity type. Yes, the class is also an information entity but this is implicit. It's most parsimonious to leave out talk of information entities when modeling unless one explicitly wants to talk about information entities.
Actually flexible semantics is not good for us, and much as I want easy federated querying, if we don't have logically consistent models, reasoning doesn't work and we rely on reasoning for everything. We need precise semantics. Can you explain what you mean about secreted proteins? I don't see any challenges representing this as a GO-CAM (in fact we have axiomatized classes like renin secretion in the ontology using PRO semantics). To summarize, we need pro-like semantics (proteins like 'human shh' as classes), but uniprotkb prefixes, as the community wants to annotate to uniprot. It sounds like you might be open to providing the semantics we need, but are blocked by this:
What reasoners are you using? This seems like a fairly tractable technical challenge. And there may be options like using a tbox shadowed in the abox for internal reasoning but publishing as a tbox. |
It seems clear that the UniProt concept and the PRO class are different sorts. Why can't the interoperability be handled in NEO's interface? E.g. Accept either ID as input. Use PRO ids internally, display ids according to preference for one or the other. Generate RDF/OWL suitable for integration into UniProt's SPARQL endpoint that matches UniProt's policy for how a PRO ID maps to a UniProt record. There will be issues to address since PRO isn't strictly one to one with UniProt even at the organism-gene level. but those issues won't be addressed by simply equating the two. Exposing those assumptions clearly, and having the tool users understand what they are accepting by choosing one or the other identifiers would be quite a good thing insofar as making clearer the relation of UniProt to PRO. Using the UniProt ids for protein classes also has the consequence that we no longer have an identifier for the information content entity that is the UniProt record, which we could otherwise use in different ways. For example, the canonical sequence (as information artifact) is part of the UniProt record, but it isn't the sequence of all all the proteins in the class. As an example of an issue consider the relation of the isoform to the organism-gene level. We use a subclass relation, but as far as I can tell, UniProt does not. I think it would be hard, and require substantial commitment, to coordinate the RDF/OWL in the sense of being able to simply add a piece of OBO RDF/XML to UniProt RDF/XML and expect the result to make sense. If we're not going to be able to do that it isn't clear what benefit there is to using the same identifier. |
BTW, I'm happy to chat and discuss the issues, if you are interested. |
I would like to explore this further, as it's not totally clear to me that they are. Sorry I missed the call.
I would like to do this, but this would involve multiple exceptions into the code at different points, increasing overall fragility. On top of that, there are member groups of the GOC who have expressed that they want to annotate to UniProtKB IDs (including prefix) and I have to respect that.
Let's take the case of a GCRP swissprot entry and the corresponding entry. There are definitely issues to address here (e.g. sometimes GCRP will include trembl, but at least for human we should be 99% in agreement), but I think these are separable (and they are already being discussed elsewhere). What would it mean to expose the different assumptions between To a biologist and the users of Noctua these seem to indicate the same thing. And to me as well: I believe they are intended to denote the same thing, the PR purl is just clearer and more explicit about OWL commitments and relationships to other OBO entities.
I'm not convinced you need this level of meta-representation, but in any case I believe you want to use a PURL with the sequence version embedded for this use case, the sequence in the db may change over time. E.g These differ by one residue: https://www.uniprot.org/uniprot/Q9FXT6.fasta?version=1 So if you want to explicitly and logically represent an alignment relative to a sequence you'd need to use the version IRIs, or just encode the string directly.
Yes, this could cause big problems, if asserting a subclass introduces inconsistency.
I think this is the crux of the issue. I agree that if the results of doing the combination are incoherency then it won't work (see the first comment from me in this ticket). At the moment these is a certain amount of shielding due to the punning, but that's not quite satisfactory (although that is a potential long term strategy here). We need to know more about plans for OWL commitments on the uniprot PURLs from their maintainers. Comments above from Jerven like "Changing from rdf:type uniprot:Protein to rdfs:subClassOf uniprot:Protein is still on our todo list." suggest things are moving in the direction of compatibility, so I am hopeful. |
I come to my conclusion about them being distinct sorts from two directions. First, as you say, PRO is very clear about what their entities denote. UniProt is not. Not because they can't or don't want to, but because they view their resource as a database, not an ontology. Without understanding exactly what their entities denote (and verifying that their logical assertions regarding them concord), we can't adequately compare them to PRO. Second, where I have looked for implicit commitments as evidenced in assertions in their RDF, I find incompatibilities. We agree that combining our and their RDF will be incoherent.
My presumption was that UniProt's RDF gave distinct sequences distinct PURLs. If so, then those would be adequate. If not, we would do whatever we have to in order to properly record sequence, but that would also expose another way in which the commitments of the two resources differ. On the matter of respecting your users, I understand that need, but that seems to be something that you need to address with in the tool, not necessarily in the ontologies. I haven't really looked at Noctua/NEO other than what I've seen in a couple of presentations and so at the moment, I don't understand it's model and logical commitments. Because of that I can't speak to the use of UniProt IRIs there. What I do know is that, insofar as OBO ontologies go, these IRIs represent different things.
SMOP. I have trouble sympathizing with the idea that in order to alleviate some bit of programming we should introduce substantial confusion about ontology. From my point of view, there is a perfectly coherent view of UniProt as database consisting of ICEs, and PRO as ontology, a view which is in concordance with what the developers of each resource. Regarding the multiple exceptions, if you are interested we could look at the code together and brainstorm to find a way to handle the interconversion in a clean and minimally disruptive manner. -- If, at some point, UniProt were to decide that they want the resource to be understood as an OBO ontology, something I would love them to do (I've said so in the past), then that would reopen the question for me. A good collaboration between UniProt and PRO might be to undertake that effort assuming all parties were interested and committed, and that the effort could be funded. |
No, the requirement is that uniprotkb is used, regardless of tooling. |
This issue was very specific regarding IRI's for uniprot resources. Where I
have a large preference to
use the resources IRIs directly if they have an RDF form. If for logical
reasons a different concept is required
my preference is to have a new IRI that relates to our IRI with as clear a
semantics as possible.
e.g. something like this
<http://example.org/noctea-(re)interperation-of-uniprot/P05067>
skos:closeMatch <http://purl.uniprot.org/uniprot/P05067>
I also don't mind the axioms added by noctea to a UniProt IRI, I think all
the ones I have seen are valid.
But might not apply if it is about a synthetic peptide so maybe move up one
more level to CHEBI:33695.
I am not sure how that is/should be curated in noctea models and if those
special cases need exact treatment.
My current belief is that it would always be valid to state that PRO:PAAAA
rdfs:subClassOf uniprot:PAAAA but not always uniprot:PAAAA
rdfs:subClassOf PRO:1. Mostly, I worry (too much) about the biological
exceptions that are rather interesting and
(lethally, considering many of them are about toxins) hard to represent
accurately (I might have a fear of ontological over commitment).
@alan Ruttenberg <[email protected]> we would love to work on
formalizing aspects of UniProt curation, but funding for this is so hard to
get :(
I suspect would end up looking a bit different from PRO but inspired by it,
and have so many, many classes and axioms.
@chris Mungall <[email protected]> I also agree with @alan Ruttenberg
<[email protected]> that I would love to attend a Noctea modelling
and logic presentation.
Really of topic of this bug report but do you have a pointer to good intro
material?
…On Wed, Nov 7, 2018 at 8:56 PM Chris Mungall ***@***.***> wrote:
On the matter of respecting your users, I understand that need, but that
seems to be something that you need to address with in the tool, not
necessarily in the ontologies
No, the requirement is that uniprotkb is used, regardless of tooling.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8MFWOwgDKTOfoyHiu2t535TdktygLlks5uszr2gaJpZM4Xy_e6>
.
--
Jerven Bolleman
[email protected]
|
My current belief is that it would always be valid to state that PRO:PAAAA
rdfs:subClassOf uniprot:PAAAA but not always uniprot:PAAAA
rdfs:subClassOf PRO:1. Mostly, I worry (too much) about the biological
exceptions that are rather interesting and
(lethally, considering many of them are about toxins) hard to represent
accurately (I might have a fear of ontological over commitment).
Sorry, I am not following this part
|
synthetic peptides: we don't annotate to these, only gene products of genes, so these would not be in neo
|
I think that using the PRO URIs in combination with skos:closeMatch is the best of both worlds. PRO terms have clear semantics and is already mapping, where appropriate, to UniProt. Using skos:closeMatch is a good bridge between OBO ontology terms and a more RDF-oriented view. What do you think, @nataled |
After further rumination and discussion, I come to the conclusion that the main problem (for PRO) is that the scientific community uses UniProtKB identifiers to mean two different things. One, exemplified by GOA, is that they are basically the same as PRO, that is, that they represent actual proteins that can be annotated with functions, etc. The other, exemplified by Pfam and other protein classification projects, is that they represent the sequences of those proteins. My concern about usage of UniProt vs PRO centers on the need (by PRO) for that latter interpretation, and that imposing the former interpretation on the uniprot purls would leave us without a way to talk about the sequences themselves. So, a question to @JervenBolleman: assuming that http://purl.uniprot.org/uniprot/P05067 refers to a class of proteins, how would you refer to, say, the canonical sequence of that class? If there is a way to separate the two interpretations that solves the immediate problem. Bear in mind the following:
|
In the uniprot triplestore there is a |
Note that the isoform entries are isoforms in name (url) only. The actual
type is up:Sequence documented as "An amino acid sequence".
…On Wed, Nov 21, 2018 at 12:32 PM Chris Mungall ***@***.***> wrote:
In the uniprot triplestore there is a up:sequence property that connects
an entry to isoform entries. But I think what is required is a PURL for the
sequence specifically, e.g. having a PURL for
https://www.uniprot.org/uniprot/Q9FXT6.fasta?version=1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOxDkDwgkL96tiY-mn--CRlTY9jAA3Qks5uxY4bgaJpZM4Xy_e6>
.
|
My interpretation of this is that http://purl.uniprot.org/uniprot/P10403 can (does?) refer to the protein entity (in the PRO sense) while http://purl.unitprot.org/isoforms/P10403-1 refers to (what happens to be) the canonical sequence of that protein. @JervenBolleman can you confirm? I also note the following:
An open question involves whether or not http://purl.uniprot.org/uniprot/P10403-1 is a valid PURL for the protein (material) entity that refers to that specific isoform. |
Mainly repinging folks working on this thread. Wondering if we could try again for a consensus decision as its impacting GO work in multiple projects. For what its worth, after reading through the above it seems that there is a consensus that PRO OWL semantics are a better match for the Noctua use case than what we get from UniProt (RDF) now. I see two things stopping us from switching over. 1) PRO would need to add all of the proteins needed by GOC annotators. According to @nataled above (regarding trembl) it sounds like this would be possible. 2) Either GOC folks are convinced to use the PRO ids (sounds unlikely) or through a SMOP they see what they want to see in the Noctua UI (for selecting genes) and in the Noctua output (especially the flatfile GPAD output). The SMOP would be greatly enabled if PRO maintained a clear semantic structure mapping from PRO classes to UniProt records. (xref is not sufficiently clear in meaning). ? |
On Sun, Sep 29, 2019 at 10:58 PM goodb ***@***.***> wrote:
Mainly repinging folks working on this thread. Wondering if we could try
again for a consensus decision as its impacting GO work in multiple
projects.
For what its worth, after reading through the above it seems that there is
a consensus that PRO OWL semantics are a better match for the Noctua use
case than what we get from UniProt (RDF) now.
It would put it slightly differently: we know the PRO OWL semantics work,
but we don't know enough about the uniprot semantics to know if we can
treat them as equivalent or as something else (but I see you address this
below)
Additionally, we can't entirely put aside sociotechnological constraints of
one set of IDs/URIs vs another...
I see two things stopping us from switching over. 1) PRO would need to add
all of the proteins needed by GOC annotators. According to @nataled
<https://github.com/nataled> above (regarding trembl) it sounds like this
would be possible.
What are the semantics of a non-GCRP trembl ID according to PRO?
But it's not just trembl. It would need to be the whole protein universe.
The consequence of PRO going up to 260m+ gene-level entries in a single OWL
file would need to be determined. At the least PRO needs to start
distributing more ready-cut modules (which I've requested for a while)
2) Either GOC folks are convinced to use the PRO ids (sounds unlikely) or
through a SMOP
SMOP?
they see what they want to see in the Noctua UI (for selecting genes) and
in the Noctua output (especially the flatfile GPAD output). The SMOP would
be greatly enabled if PRO maintained a clear semantic structure mapping
from PRO classes to UniProt records. (xref is not sufficiently clear in
meaning).
+1
Darren and Jerven is there anything we can do to facilitate this?
… ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34?email_source=notifications&email_token=AAAMMONOEN5J4IQ7ELYV6N3QMGIP7A5CNFSM4F6L665KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD74QHUI#issuecomment-536413137>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAMMOPU6IQ4UKO7VL7ZQZTQMGIP7ANCNFSM4F6L665A>
.
|
Sorry, I saw @alanruttenberg 's use of SMOP (small matter of programming) above and liked its connotations.. If we have the mappings from PRO to uniprot up front, I don't think its terrible to handle the translations in the Noctua code. I have a cut at doing this for the reactome entities -> uniprot for GPAD working in noctua-dev now. I don't see how you avoid loading the whole protein universe without a Noctua stack architecture change??? Whether its a PRO expansion or UniProt being ingested into neo, we still end up with a gigantic OWL file. For other's information, as it stands now, Noctua is driven from a 1.45gb merged OWL file (go-lego) of which 1.12gb is neo. This contains all of the classes that can be used to type the instances in the go-cam models, with neo containing the gene product classes. Although it introduces some technical hassle (e.g. that the entire file is loaded by default when attempting to load a GO-CAM owl model into protege or other) it actually works just fine for the Noctua application right now. Its probably drifting off topic here, but if there was a way to grow neo based on curator demand (e.g. one protein at a time as they needed it), we might be able to solve the giant OWL file problem. |
@goodb and @cmungall could you please open separate issues for separate concerns? This issue was quite focussed in it's request and now asks a zillion different things in your discussions. Basically, my request is -> if you annotate UniProt entries use UniProt purls. If you are annotating something else, use something else. Don't have users annotate UniProt but use PRO, nor have users annotate PRO and use UniProt. Not every UniProt case can be represented in PRO (or the other way around), nor are these the only two databases that users of noctua might wish to use. e.g. nextprot and ensembl protein's are valid IRI targets for GO-CAM annotation as well. |
ok shall we do this on the pro tracker since this question (semantic rel
between up and pro purls) isn't really a go issue per se
…On Mon, Sep 30, 2019 at 11:23 AM JervenBolleman ***@***.***> wrote:
@goodb <https://github.com/goodb> and @cmungall
<https://github.com/cmungall> could you please open separate issues for
separate concerns? This issue was quite focussed in it's request and now
asks a zillion different things in your discussions.
Basically, my request is -> if you annotate UniProt entries use UniProt
purls. If you are annotating something else, use something else.
Don't have users annotate UniProt but use PRO, nor have users annotate PRO
and use UniProt. Not every UniProt case can be represented in PRO (or the
other way around), nor are these the only two databases that users of
noctua might wish to use. e.g. nextprot and ensembl protein's are valid IRI
targets for GO-CAM annotation as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34?email_source=notifications&email_token=AAAMMONBSKQ4XPQX4Y2L6TDQMI72TA5CNFSM4F6L665KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76TN4I#issuecomment-536688369>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAMMOMAILCPUFFAOU3FKQLQMI72TANCNFSM4F6L665A>
.
|
I'm fine with using the PRO tracker, even though pretty much all the unanswered questions are about UniProt. Here are the topics discussed (probably missed a few):
Topics 1 and 2 will be further addressed here: PROconsortium/PRoteinOntology#165 Finally, one point of clarification:
Actually, every UniProt case CAN be represented in PRO. It's just that a small subset has to be done manually. |
Thanks Darren! I'm following the issue in the PRO tracker. We will hold up
all discussion on this issue on this tracker for now
…On Wed, Oct 2, 2019 at 5:30 AM Darren A. Natale ***@***.***> wrote:
I'm fine with using the PRO tracker, even though pretty much all the
unanswered questions are about UniProt. Here are the topics discussed
(probably missed a few):
1. What do the UniProt PURLs denote: database entry, protein class, or
sequence?
2. How does PRO relate to UniProt?
3. User needs: a SMOP, or address ontologically?
Topics 1 and 2 will be further addressed here:
PROconsortium/PRoteinOntology#165
<PROconsortium/PRoteinOntology#165>
Finally, one point of clarification:
Don't have users annotate UniProt but use PRO, nor have users annotate PRO
and use UniProt. Not every UniProt case can be represented in PRO (or the
other way around), nor are these the only two databases that users of
noctua might wish to use. e.g. nextprot and ensembl protein's are valid IRI
targets for GO-CAM annotation as well.
Actually, every UniProt case CAN be represented in PRO. It's just that a
small subset has to be done manually.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34?email_source=notifications&email_token=AAAMMOPY5KZEMDE4ZMWHGYLQMSH6TA5CNFSM4F6L665KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAESKIA#issuecomment-537470240>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAMMOIA6HGAK3CAP2B4XLLQMSH6TANCNFSM4F6L665A>
.
|
What's the status of this? |
No progress. Still open. I prefer that Neo uses |
This will make it easier to link the UniProt data with the GO (A) data on RDF and OWL level.
Mostly, it will make it easier for us to introduce Noctea compatible modelling for UniProt->GO term Relations. With the benefit of users loading both data not getting duplicate triples just because we don't use the same IRIs.
The text was updated successfully, but these errors were encountered: