Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPAD/GPI 2.0 Specifications - Request for comments #2864

Closed
vanaukenk opened this issue Mar 9, 2020 · 59 comments
Closed

GPAD/GPI 2.0 Specifications - Request for comments #2864

vanaukenk opened this issue Mar 9, 2020 · 59 comments
Assignees
Labels

Comments

@vanaukenk
Copy link
Contributor

vanaukenk commented Mar 9, 2020

Starting on Tuesday, March 10th, we will be requesting review and comments on the proposed GPAD/GPI 2.0 file format specifications.

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

Please add any comments or questions you have about the specs to this ticket by Tuesday, March 31st.

Thank you.

@RLovering
Copy link

Hi
I currently use GAF files in my Cytoscape analysis, because the isoform information associated with a UniProt ID is in column 17, not column 2, therefore all annotations associated with the isoforms ( for example P15692-4 (VEGFA)) will be associated with the P15692 node in the cytoscape network. If P15692-4 is in column 2 then only the P15692-4 node in the network will be associated with these annotations, not P15692.

I have just done a small test and I can't get the isoform gpa file from EBI to be included into Cytoscape analysis, whereas the isoform gaf file works fine, with all GO annotations that are specific to isoform 4 associated with the P15692 node. For example the GO term basophil chemotaxis has no child terms and is associated with P15692-4, but in the cytoscape analysis it is associated with the P15692 node not P15692-4.

Please can you ensure that Cytoscape is able to use the GPAD file before you get rid of the gaf file and also consider this isoform implications. Or provide very clear information for non-bioinformaticians how to use the gpad file in Cytoscape.

Thanks

Ruth

@hattrill
Copy link

Is the plan to fully retire the GAF? It would be good if we could maintain GAFs (even if it is just computed from GPAD/GPIs we submit - it's just such a wonderfully simple format that most biologists can manipulate it to their satisfaction and joy).

@vanaukenk
Copy link
Contributor Author

@RLovering @hattrill

We will definitely be continuing to maintain and provide GAF for our users. We're well aware that many applications and tools use GAF and it will take some time for these tools to transition, if they transition at all.

However, internally, we would like to move towards GPAD/GPI as our exchange file formats, as these new formats are more robust (i.e. IDs, not text) and provide us with a mechanism to exchange additional metadata that will be critical for importing annotations into Noctua.

@tberardini
Copy link

GPAD

  1. Column 3: Relation: What is the default for BP, if there is a default?
  2. Column 5: Reference: Are DOIs an acceptable reference entry? DOI:id
  3. ORCIDs : Are these meant to only be used in Col 12 (Annotation Properties) or can they also be used in Col 10(Assigned by)

GPI

  1. Column 9: MODs: Must associate gene ids with UniProtKB gene-centric reference protein accessions
    ---Do you mean: "Must associate gene ids for protein-coding genes SO:0001217..."?
    --- Also, is this generating a reciprocal set MOD-UniProt/ UniProt-MOD mapping files for the protein-coding entities?

@vanaukenk
Copy link
Contributor Author

Noting this exchange here: geneontology/helpdesk#252
about use of underscores in term relation labels.

@kltm - I want to confirm what we will need in GPAD. Currently the specs are not using underscores in term relation labels.

@vanaukenk
Copy link
Contributor Author

@tberardini - thanks for taking a careful look. Answers in-line below.

GPAD

  1. Column 3: Relation: What is the default for BP, if there is a default?

We are leaving selection of the default BP up to each individual curation group, as curation practices may differ by group.

Ontologically speaking, the default gp2BP relation would be 'acts upstream of or within'

  1. Column 5: Reference: Are DOIs an acceptable reference entry? DOI:id

Yes, DOIs are definitely an acceptable reference entry.

  1. ORCIDs : Are these meant to only be used in Col 12 (Annotation Properties) or can they also be used in Col 10(Assigned by)

ORCIDs are only to be used in the Annotation Property field. The Assigned by field will use an entry from the groups.yaml

GPI

  1. Column 9: MODs: Must associate gene ids with UniProtKB gene-centric reference protein accessions
    ---Do you mean: "Must associate gene ids for protein-coding genes SO:0001217..."?

Yes, I'll add that clarification.

--- Also, is this generating a reciprocal set MOD-UniProt/ UniProt-MOD mapping files for the protein-coding entities?

Yes, it will.

@tberardini
Copy link

Thanks for the clarification, @vanaukenk.

@kltm
Copy link
Member

kltm commented Mar 27, 2020

@vanaukenk Re: #2864 (comment)
I'm not sure it matters as there are not string values in the spec, only CURIEs. Technically, a human-readable label does not need an underscore. What is going on with the GAF is something different.

@vanaukenk
Copy link
Contributor Author

@kltm - Yes, apologies, mixing two things here.

I'll make a separate ticket for the proposal to add the full set of gp2term relations to the GAF and what we will use for that (CURIE vs string).

Thx.

@kltm
Copy link
Member

kltm commented Mar 28, 2020

@vanaukenk If you want to touch bases, we can do that--we ended up getting pretty confused on Friday as we tried to track through the helpdesk issue. I think we're all sorted, but if you have any questions on the GAF or GPAD spec it couldn't hurt to talk real fast.

@hattrill
Copy link

Just noting - as I see that there is a comment under the SO table in the specs, that in our current GAF, we have annotations to unmapped loci (these are super old and we don't make them now). We use the SO term 'gene' for these. We also output annotations for SO 'pseudogene' - we try to remove these annotations as we go, but sometimes there a few present.

I don't see the benefit of releasing these annotations to unmapped/pseudogene. So perhaps the GPAD and future GAF specs should exclude these entities. (DBs could keep them, just not release them - they are sometimes useful for mapping genes)

@vanaukenk
Copy link
Contributor Author

On the 2020-04-02 software call, we discussed two issues:

  1. Representation of identical protein sequences that can be encoded by more than one gene
  2. Clarifying the meaning of 'Parent Object ID'

For 1, we decided to represent the one:many relation by including all gene and/or protein names as synonyms, and including all genes as parents (according to the proposal for 2).

For 2, we felt that the meaning of 'Parent Object ID' could potentially be confusing depending upon what the entry represents, so we decided to split this column out into two: 'Encoded By' to capture a gene ID, and 'Parent Protein' to capture the gene-centric reference proteome accession for protein isoforms or peptides derived from proteolytically processed proteins.

@vanaukenk
Copy link
Contributor Author

Just noting - as I see that there is a comment under the SO table in the specs, that in our current GAF, we have annotations to unmapped loci (these are super old and we don't make them now). We use the SO term 'gene' for these. We also output annotations for SO 'pseudogene' - we try to remove these annotations as we go, but sometimes there a few present.

I don't see the benefit of releasing these annotations to unmapped/pseudogene. So perhaps the GPAD and future GAF specs should exclude these entities. (DBs could keep them, just not release them - they are sometimes useful for mapping genes)

@ukemi - can you provide examples of where MGI has used the other entity types for tomorrow's annotation conference call? Thx.

@hattrill
Copy link

@vanaukenk - can I check that pipe separating PMIDs and MOD ref IDs in GPAD is ok? I thought that moves were afoot to just use the PMID.

@vanaukenk
Copy link
Contributor Author

@hattrill

Yes, it is okay to pipe-separate a PMID and MOD paper id as long as they're referring to the same publication.

You're remembering correctly that at one point we talked about just using PMIDs, but since there is usually more information about the paper, especially wrt curation, at the MOD we decided to keep the MOD id, and link out from AmiGO.

@hattrill
Copy link

Sorry @vanaukenk another Q:
for the GPAD assigned by in col 10, should we use FB or FlyBase? (in the spec it says prefix). In GAF2.1 we have both DB (col 1) and assigned by (col 15), which are FB and FlyBase, respectively. (I think you have a similar thing for WormBase/WB :))

@vanaukenk
Copy link
Contributor Author

I will double-check with @kltm
There are two yaml files with database name and abbreviation information:

https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml

One should be the definitive source of accepted db prefixes and names for the purposes of the annotation files, and we can add that information to the spec.

We actually use WB in both places for our WB GAF.

@hattrill
Copy link

Thanks @vanaukenk I am guessing that we'd just stick with 'FlyBase' here.

@hattrill
Copy link

hattrill commented Jun 4, 2020

Noting this exchange here: geneontology/helpdesk#252
about use of underscores in term relation labels.

@kltm - I want to confirm what we will need in GPAD. Currently the specs are not using underscores in term relation labels.

@vanaukenk @kltm is this resolved - it seems to me that we shoould be using underscores in all gp2term rels. e.g. 'acts_upstream_of_or_within_positive_effect' rather than 'acts upstream of or within, positive effect' or 'acts upstream of or within positive effect' and 'contributes_to' rather than 'contributes to'
Can you confirm this?

@vanaukenk
Copy link
Contributor Author

@hattrill
For GPAD, we will only be using Relation Ontology IDs, no term labels at all.

@hattrill
Copy link

hattrill commented Jun 4, 2020

....for GAF2.2?

@vanaukenk
Copy link
Contributor Author

vanaukenk commented Jun 4, 2020

@hattrill
There is a separate ticket for the GAF2.2 relations which will use the underscore, since GAF uses text and not relation ids:

#2917

I've been trying to keep these two files issues separate since it can be confusing about what is required where.

@vanaukenk
Copy link
Contributor Author

From the 2020-07-28 annotation call, we are asking people who submit annotation files to the GOC to please sign off on the GPAD/GPI 2.0 specs on this ticket.

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

Signing off means that you've reviewed the specs and have raised and resolved any questions you might have.

The deadline for signing off on the specs is Tuesday, September 1st.

@mah11
@magrane
@Achchuthan
@hattrill
@ukemi
@vanaukenk
@pfey03
@gthayman
@suzialeksander
@sabrinatoro
@dustine32
@dsiegele
@deustp01
@tberardini
@malcolmfisher103

@kltm
Copy link
Member

kltm commented Sep 1, 2020

At this point, I think all annotations are required to be traceable to an entity in users.yaml. I think a good first pass would be to populate users.yaml with all current and historical curators, using a GOC:xyz identifier for those that pre-date ORCID. For things like bots or annotations that really have an unknown history, I suppose a grouping entity could be created that still marks the annotation or alteration as automated or unknown from SGD.

@pgaudet
Copy link
Contributor

pgaudet commented Sep 1, 2020

Why dont we use GOC:curators, that already exists ?

@RLovering
Copy link

Hi is there a list somewhere of the users with no ORCID accounts, sorry if I missed it. I would like to check it in case someone in my group is listed and I can help add the ORCID
Thanks
Ruth

@hattrill
Copy link

hattrill commented Sep 1, 2020

Pre-P2GO, we had no mechanism for individual attribution. In the GOA DB, the generic FlyBase curator "FlyBase GOcur" is used for these annotations.

@vanaukenk
Copy link
Contributor Author

@RLovering

You can check the users.yaml file to see if any of the UCL curators are listed but don't have an orcid.

@dsiegele
Copy link

The E. coli group is signing off on the new specs for GAF 2.2 and GPAD/GPI 2.0.

@gthayman
Copy link

RGD signs off on the specs.

@gthayman gthayman removed their assignment Sep 14, 2020
@RLovering
Copy link

thanks Kimberly just added a few ORCIDs to the file. Not sure that you want the MSc student names as their annotations are all either checked and approved by me or Shirin so I don't think their names will be listed with the annotations available via Protein2GO
Best

@vanaukenk
Copy link
Contributor Author

Thanks @RLovering

If the MSc students' names will never be associated with production annotations in Protein2GO, I don't see that it's necessary to have them in the users.yaml file.

That said, if the students ever start making annotations that don't need to be checked in Protein2GO, or if they'd ever like to make GO-CAMs, we'll need to add them to the users.yaml file.

@kimrutherford
Copy link

At this point, I think all annotations are required to be traceable to an entity in users.yaml.

Sorry for not commenting on this earlier.

Because of community curation, we have annotations from 360 users. This number is increases by 40-50 each year. Keeping users.yaml coordinated with our database is going to be a bit of a maintenance hassle. Would it be possible for us to provide a "users-pombase.yaml" along side our GPAD/GPI files? That would allow us to automate the updating of the users file.

@ValWood @mah11

@suzialeksander
Copy link
Contributor

@vanaukenk @kltm Our GAF, and now our new GPAD, contain annotations from upstream sources- UniProt, RNACentral, GOC, etc. obtained from EBI FTP. Will these annotations soon come to us with specific contributor-ids for GPAD column 12? Should we leave our "outside sources" col 12 blank until we have this info?

Alternatively, I could see making and assigning a "UniProt Curators" "RNACurators", etc. id as discussed above, or would we default to GOC:curators?

@vanaukenk
Copy link
Contributor Author

@suzialeksander
A column 12 annotation property for contributor is not required in GPAD 2.0, so you don't have to fill this in for upstream resources if it's not being made available to you.
Would SGD want to use upstream source contributor ids for anything (either internally or externally)?

@suzialeksander
Copy link
Contributor

@vanaukenk, sounds like leaving it blank will be the solution. We do attribute outside annotations to their sources on our site, but we can get this from column 10. We won't need more detail than that. Thanks!

@kimrutherford
Copy link

Were still working on getting parsers and pipelines in place for testing, hopefully within the next month.

Has there been any progress on this? We'd like to test our GPAD/GPI files to make sure we're ready to switch away from GAF format. Is there a GitHub issue we could keep an eye on? Thanks.

@suzialeksander
Copy link
Contributor

note: this might be a closable ticket as at least one source is putting out files labelled !gpi-version: 2.0; col 5 at least isn't in the right format. Header lacks dates, and col 9 isn't used when it could be.

ftp://ftp.ebi.ac.uk/pub/contrib/goa/gp_information.559292_sgd.v2.gz

UniProtKB:Q9ZZW7	BI3	Cytochrome b mRNA maturase bI3	BI3|Q0115	protein	NCBITaxon:taxon:559292				SGD:S000007272	db_subset=Swiss-Prot|go_annotation_complete=20180405|go_annotation_summary=Mitochondrial RNA binding protein involved in mitochondrial mRNA processing via Group I intron splicing

			
ComplexPortal:CPX-1021	dnf1-lem3_yeast	DNF1-LEM3 P4-ATPase complex	DNF1:LEM3|DNF1-LEM3 phospholipid flippase complex|LEM3-DNF1 complex|Aminophospholipid translocase complex|APLT complex	protein_complex	NCBITaxon:taxon:559292		

@pgaudet
Copy link
Contributor

pgaudet commented Jun 30, 2022

I think this was more about making sure people knew what we output; looking at http://release.geneontology.org/2022-06-15/annotations/index.html it looks like we do export GAF2.2, looking for example at cgd.gaf.

Thanks for checking!

@pgaudet pgaudet closed this as completed Jun 30, 2022
@hattrill hattrill reopened this Jun 30, 2022
@hattrill
Copy link

@pgaudet GAF2.2 work was finished but the GPAD/GPI updates have not been completed/finalized. @vanaukenk think that this was never signed off - we are waitig for the final spec to produce the FB GPAD/GPI.

@pgaudet
Copy link
Contributor

pgaudet commented Mar 16, 2023

Actually the specs are missing one point:

Column 9 xrefs is missing recommendations for RNA and complexes:
For GOA to pick up annotations to complexes and RNAs, the MOD IDs mappings for Complex Portal and RNA central IDs. Otherwise GOA cannot pick them up.

@suzialeksander
Copy link
Contributor

Since this is still open, there has been a request to have entities in the GPAD col 11 (annotation extension) match the database providing the GPAD, if applicable. See geneontology/helpdesk#440

@balhoff
Copy link
Member

balhoff commented Jul 18, 2023

Closing in favor of new round of comments on updated draft in #4684.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests