Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPAD/GPI export from chado #276

Closed
pombase-admin opened this issue Dec 11, 2013 · 33 comments
Closed

GPAD/GPI export from chado #276

pombase-admin opened this issue Dec 11, 2013 · 33 comments

Comments

@pombase-admin
Copy link

Export GO annotations and supporting data in GPAD and GPI formats instead of GAF.

File format spec here:
http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format

raise priority when GOC set a deadline - that's when we'll actually have to do it

Original comment by: mah11

@ValWood
Copy link
Member

ValWood commented Apr 21, 2017

@hdrabkin

when we do this, we should also try to include PRO

Harold:
Hi Val

Our GPI file here will have samples of how we associate them.

Here are a couple of examples

They are treated as separate identifiers (hence the PR in column 1), but then map to the mgi gene (MGI:MGI:))

Here’s one for an isoform:

PR Q9ESQ8-2 mNPVF/iso:m2 pro-FMRFamide-related neuropeptide VF isoform m2 (mouse) protein taxon:10090 MGI:MGI:1926488 UniProtKB:Q9ESQ8-2

Here’s one for a modified form (not of the example above).

PR 000030074 mLNP/Phos:1 protein lunapark phosphorylated 1 (mouse) protein taxon:10090 MGI:MGI:1918115

@mah11
Copy link
Member

mah11 commented Apr 12, 2019

upping to medium priority because GO has a timeline at last:

  • June 1st: finalize annotation properties and header
  • July 1st: review by contributing groups
  • October 1st: implementation

proposed spec: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

@mah11
Copy link
Member

mah11 commented Mar 16, 2020

Update: GO now has a ticket angling for comments on the GPAD/GPI spec:

geneontology/go-annotation#2864

... so now seems like a good time for @kimrutherford to take a look and make sure there won't be any problems producing those files.

@kimrutherford
Copy link
Member

take a look and make sure there won't be any problems producing those files.

I think it won't be a problem. The only part that will require a bit of care (if I'm reading it correctly) is the mapping from GO evidence codes to ECO.

@mah11
Copy link
Member

mah11 commented Mar 17, 2020

Yeah. They've tried to include cross-references to GO codes in ECO, but IIRC there have been a few where mapping wasn't totally straightforward. Try looking at the "xref: GOECO:" lines in ECO and see how much that covers.

@mah11
Copy link
Member

mah11 commented Apr 3, 2020

It might be worth paying attention to this ticket - evidenceontology/evidenceontology#251

@mah11
Copy link
Member

mah11 commented May 12, 2020

@mah11
Copy link
Member

mah11 commented May 12, 2020

or the "derived" version might be better so we don't "have to walk up the ECO graph yourself when going from ECO -> GAF" ... although I think we're mostly concerned about going from GAF -> ECO

https://github.com/evidenceontology/evidenceontology/blob/master/gaf-eco-mapping-derived.txt

@mah11
Copy link
Member

mah11 commented Aug 4, 2020

Returning to this ticket again now that GO is seeking sign-off on the spec (linked above) ... there are two main points for us -- one is the relations between terms/IDs and gene products that GO now wants, and the other is submitting extensions with the small set of relations that GO now allows (where we've retained some more specific ones locally).

Term-gene product relations for GPAD column 3:

  • For molecular function (F), use RO:0002327 'enables' (hideous usage, but it's what they've settled on) unless we have explicitly used contributes_to (RO:0002326).
  • For biological process (P), ignore all the blather and just always use RO:0002331 'involved in'.
  • For cellular component (C) it could be a bit tricky.
    • If we have used colocalizes_with anywhere, leave it (i.e. RO:0002325) as the relation, but I don't think we do use it any more.
    • For the rest, GO's preference depends on the term ancestry, so if this is realistically feasible:
      • if_descendant_of GO:0032991, use BFO:0000050 'part of'
      • otherwise, use RO:0001025 'located in' RO:0002432 'is active in' (updated 2021-09-07)
    • If it would be a colossal pain in the ass to make this distinction, we could see if just using 'part of' (BFO:0000050) for all CC causes any ructions.

Extension relations (also noted for GAF in #744):

  • Delete all "negated" extensions (i.e. ones that use not_happens_during or not_exists_during).
  • Convert a whole bunch of relations to RO:0002233 (has input):
    • has_direct_input
    • regulates_activity_of
    • has_regulation_target
    • directly_negatively_regulates
    • directly_positively_regulates
  • Convert occurs_at to BFO:0000066 (occurs in)
  • I don't think we're using regulates_expression_of, regulates_transcription_of, or regulates_translation_of, but if we ever do, convert them to (wait for it) RO:0002233 (has input).

@kimrutherford
Copy link
Member

Thanks Midori. I'm very glad you have a handle on that.

For the rest, GO's preference depends on the term ancestry, so if this is realistically feasible:
if_descendant_of GO:0032991, use BFO:0000050 'part of' otherwise, use RO:0001025 'located in'
If it would be a colossal pain in the ass to make this distinction, we could see if just using 'part of' (BFO:0000050) for all CC causes any ructions.

It should be OK. I don't recognise "if_descendant_of" though.

@mah11
Copy link
Member

mah11 commented Aug 10, 2020

I don't recognise "if_descendant_of" though.

Don't worry, as long as it's clear how the file should come out. I cribbed "if_descendant_of" from website display config (it's not a relation that would end up in any output files).

@kimrutherford
Copy link
Member

In the GPI file do we need to put anything in columns 7 to 11? They''re all optional.

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 11, 2020
@kimrutherford
Copy link
Member

As a first step I've added the code to write a GPI file during the nightly update. That's the easy part of GPI/GPAD.

Here's a sample of the output:
http://curation.pombase.org/misc/gene_product_information_taxonid_4896.tsv

In the GPI file do we need to put anything in columns 7 to 11?

I've left those columns empty for now.

I'll chip away at implementing GPAD writing.

@mah11
Copy link
Member

mah11 commented Aug 11, 2020

Great start; thanks!

It does need more tweaking (sorry to bear bad news).

This is probably the most important point:
We definitely shouldn't put the gene product description in column 2. That column is for the gene symbol; it's set up that way to accommodate species where the gene name and gene symbol are different (e.g. all those mad Drosophila names, where the symbols are usually abbreviations).

We might as well just put the gene name in both columns 2 and 3. It's redundant for us (and SGD) but I think it's the best workaround we've got.

In the GPI file do we need to put anything in columns 7 to 11? They're all optional.

Unfortunately, it's not quite that simple.

  • For protein-coding genes, we do have to put the UniProtKB accession in column 10. This is kind of buried, and if you think it looks not-entirely-consistent with "cardinality 0 or greater", well, I'm right there with ya.

  • On a related note, I'm finding the spec a bit woolly with respect to whether we can leave SO:0000704 in column 5 for all rows, or if we're actually required to use SO:0001217 for protein-coding genes (our feature type = protein coding) and SO:0001263 for the rest. Unless it would be a big pain, it might be worth using the more specific SO IDs.

  • Since we can't put the gene product description in column 2, we could include it in column 11 with the "go-annotation-summary" tag (e.g. go-annotation-summary=M phase inhibitor protein kinase Wee1). Despite using go-annotation-summary as the tag text, the description of allowed content is "A textual gene or gene product description.", so they'd better not complain.

@kimrutherford
Copy link
Member

Yes, I think it should include part_of, and then it'll be ready for prime time.

I've made that config and code change. It looks look only there are only one term and two annotations where it makes a difference.

GO:0042788 polysomal ribosome: https://www.pombase.org/term/GO:0042788

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 14, 2020
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 14, 2020
Changed to interesting_isa_parents to be more accurate.  Also add a new
field "all_interesting_parents" which contains the interesting parent
and the relation to get to that parent.

Refs pombase/pombase-chado#276
kimrutherford added a commit to pombase/pombase-config that referenced this issue Aug 14, 2020
kimrutherford added a commit to pombase/website that referenced this issue Aug 14, 2020
Change to interesting_isa_parents to reflex changes in JSON generation code.

Refs pombase/pombase-chado#276
@kimrutherford
Copy link
Member

So I guess ping them via the ticket they have open for comments?

Thanks for the suggestion. I've done that.

Perhaps we're a bit ahead of the game? I hope the GPAD/GPI spec doesn't change too much. :-)

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Aug 14, 2020
kimrutherford added a commit that referenced this issue Aug 16, 2020
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 16, 2020
kimrutherford added a commit to pombase/website that referenced this issue Aug 16, 2020
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 16, 2020
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 16, 2020
@kimrutherford
Copy link
Member

I was thinking about closing this issue then I noticed this comment:

when we do this, we should also try to include PRO

What does that involve?

@kimrutherford
Copy link
Member

While writing this essay, #744 (comment), I realised that the GPAD output has "binds(...)" where it should have "with" entries due to: pombase/website#108

That needs fixing.

@mah11
Copy link
Member

mah11 commented Aug 18, 2020

when we do this, we should also try to include PRO

What does that involve?

I don't know, but I suspect it might be something we could hive off into its own ticket, to be got round to in a while rather than making the rest of the GPAD/GPI export wait for it.

@kimrutherford
Copy link
Member

I suspect it might be something we could hive off into its own ticket,

I think so too. I'll leave that for Val to summarise in a new issue when she's back from her world tour.

@ValWood
Copy link
Member

ValWood commented Aug 24, 2020

I'n not sure that anything needs doing especially for PRO. If we use PRO as modified by forms that should be handled automatically? I would close. If anything is required we can open a new ticket once we know what it is.
Val

@kimrutherford
Copy link
Member

OK, thanks Val.

I'll do this then close the issue: #276 (comment)

Once GO start accepting or validating GPAD/GPI files I'll open new issues for any problems.

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Aug 25, 2020
Previously we were moving with values to be binds extensions display
in advance.  We now move the values only in the term/gene etc. details
when they are requested.  This allows us to write out the GAF and GAPD
files with the "with" value in the conventional column.

See: pombase/pombase-chado#276 (comment)

Refs pombase/pombase-chado#276
@kimrutherford
Copy link
Member

I realised that the GPAD output has "binds(...)" where it should have "with" entries due to: pombase/website#108

Fixed! (After tonight's load)

So I'll close this issue and wait until GO are ready for GPAD/GPI files.

For reference, the files are here for now:
https://curation.pombase.org/dumps/latest_build/misc/gene_product_annotation_data_taxonid_4896.tsv
https://curation.pombase.org/dumps/latest_build/misc/gene_product_information_taxonid_4896.tsv
and are generated every night.

@mah11 mah11 added GO export GAF, GPAD, GPI and removed export files labels Sep 15, 2020
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Sep 6, 2021
Change "sgf73 (PomBase:SPCC126.04c)" to "PomBase:SPCC126.04c"

Refs pombase/pombase-chado#276
Refs pombase/pombase-chado#848
@kimrutherford kimrutherford mentioned this issue Apr 14, 2022
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Jul 8, 2022
They're now in alphabetical order for consistency between nightly loads.

Refs pombase/pombase-chado#276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants