-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78
Comments
It looks like there is already a limited set from chebi coming in to the "NEO" build (~23k) versus the regular GO release build (~177k). Maybe that's what's already coming in on imports? I'm not sure there is actually anything in the current build process to shave that down more, as we're only examining GPIs and GAFs to produce this. |
The correct place to handle this is upstream. go-lego.owl imports go-plus, which uses a chebi_import detemined by the editors file. This is likely both too small (doesn't have any terms that have not been used in the ontology) and too large (includes protonation variants). Unfortunately simply limiting to the 7.3 forms will have issues since the hierarchy for any one protonation form is often incomplete, and you need all branches with the GCIs to get a complete hierarchy (if that sounds strange and complex, that is because it is). My preference would be to first scope out more complete requirements for what we want and don't want in chebi and then prioritize a project based on this. For example, in addition to having a canonical protonation state, we want the labels to be intuitive and searchable, we want to ensure that curators are consistent in the level they choose (e.g. L vs D form), and we want to simplify the process of using CHEBI in the ontology, and simplify things for users who might want to use CHEBI and GO together. We can explore a hack in go-lego that subtracts from the chebi terms in go-plus but I think this will lead to marginal gain at high complexity cost. |
This is the file that RHEA uses: It would be useful to know how many chemicals we'd be missing if we used this. Thanks, Pascale |
Once I figure out how to do it, I will check the RHEA list against all the ChEBI ID's in Reactome. (If someone reading this knows how, that would be great!) |
@deustp01 Is there a good source for that information? If I just munge through reacto.owl
|
@kltm The attached tab-delimited text file contains entries for the reference form of every chemical known to Reactome (including un-released ones), one row for each chemical. ("Reference" means the information we get from an external reference resource, almost always ChEBI, and which we use to construct "working" instances by adding subcellular location information - so there's only one water reference but many working forms differing by location.) The first entry in each row is the chemical's name; the second is its identifier in the reference resource. If you just omitted all the rows whose identifier does NOT start with ChEBI, that would be OK - there aren't many, and basically if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either. Adding @ukemi for a sanity check. |
@deustp01 Processing that file in a similar way:
So, like 3k short. |
I think one question that remains is how to handle entities from imported sources like this and build a robust and complete entity ontology for use in models. In this case Reactome is the straw man, but there have been proposals to do this with other resources as well. I think (correct me if I am wrong) that the plan for Reactome proteoforms and complexes is to move towards using PRO. So there is an ontology for that. We should be able to distinguish location for the Reactome entities using the PRO ids, existing relations and GO cellular components. I would think this could be extended to ChEBI entities, existing relations and GO cellular components. The question that I still have with respect to this exact ticket is whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set. |
Yes, as above, that is the hope: "if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either." I'm expecting / hoping / guessing from the work with Rhea and ChEBI over the past few years that we are not going to run into the issue of chemicals important to annotate human (patho)physiology that are a priori out of scope for these other resources. Also, there are generic terms, items like "polypeptide" or "nucleotide" that we can continue to use to ensure that all Reactome physical entities can be mapped to something in ChEBI to enable conversion to GO-CAM to proceed. |
I am confident we can get a simple biologist-friendly that satisfies all our requirements IF chebi can fix one thing. Right now it is impossible to make a subset of chebi that excludes non ph7.3 non-protonated forms without losing large numbers of important classifications. I finally got around to making a comprehensive report for CHEBI: From a GO perspective, this is one of the most important things CHEBI could work on. I suspect this will be high priority for Rhea too. I know it is a priority for multiple other ontologies that use CHEBI. Note that we would be interested in seeing a systematic approach to this - manually synchronizing the different branches for the different protonated forms is not scalable. I am willing to spend lots of time with the CHEBI team to explain how OWL can help solve this in a systematic way. |
This would vastly reduce the number of ChEBI terms to choose from, and would make sure we use the 7.3 forms.
Thanks, Pascale
@kltm
The text was updated successfully, but these errors were encountered: