Skip to content

Latest commit

 

History

History
198 lines (162 loc) · 19 KB

gpad-gpi-2-0.md

File metadata and controls

198 lines (162 loc) · 19 KB

Specifications for Gene Ontology Consortium GPAD and GPI tabular formats version 2.0

This document specifies the syntax of Gene Product Annotation Data (GPAD) and Gene Product Information (GPI) formats. GPAD describes the relationships between biological entities (such as gene products) and biological descriptors (such as GO terms). GPI describes the biological entities.

Status

This is specification has been approved as version 2.0.

Summary of changes relative to 1.1

  • GPAD and GPI: columns 1 and 2 are now combined in a single column containing an id in CURIE syntax, e.g. UniProtKB:P56704.
  • GPAD: negation is captured in a separate column, column 2, using the text string 'NOT'.
  • GPAD: gene product-to-term relations captured in column 3 use a Relations Ontology (RO) identifier instead of a text string.
  • GPAD: the With/From column, column 7, may contain identifiers separated by commas as well as pipes.
  • GPAD and GPI: NCBI taxon ids are prefixed with 'NCBITaxon:' to indicate the source of the id, e.g. NCBITaxon:6239
  • GPAD: Annotation Extensions in column 11 will use a Relation_ID, rather than a Relation_Symbol, in the Relational_Expression, e.g. RO:0002233(UniProtKB:Q00362)
  • GPAD and GPI: dates follow the ISO-8601 format, e.g. YYYY-MM-DD; time may be included as YYYY-MM-DDTHH:MM:SS
  • GPI: the entity type in column 5 is captured using an ID from the Sequence Ontology, Protein Ontology, or Gene Ontology.
  • GPI: the parent object id in column 7 refers to the gene-centric parent, e.g. the UniProtKB Gene-Centric Reference Proteome accession or a Model Organism Database gene identifier
  • Characters allowed in all fields have been explicitly specified
  • Extensions in file names are: *.gpad and *.gpi

BNF Notation

GPAD and GPI document structures are defined using a BNF notation similar to W3C specs, which is summarized below.

  • terminal symbols are single quoted
  • non-terminal symbols are unquoted
  • zero or more symbols are indicated by following the symbol with a star; e.g. Annotation*
  • one or more symbols are indicated by following the symbol with a plus; e.g. Digit+
  • zero or one symbols are indicated by following the symbol with a question mark; e.g. Extension_Conj?
  • alternative symbols are written using vertical bars
  • groupings are written using parentheses

GPI and GPAD documents consist of sequences of ASCII characters.

GPAD-GPI Full Grammar

Document structure

Production Grammar Comments
Doc GPAD_Doc | GPI_Doc
GPAD_Doc GPAD_Header Annotation*
GPI_Doc GPI_Header Entity*
GPAD_Header '!gpad-version: 2.0' \n '!generated-by: ' Prefix \n '!date-generated: ' Date_Or_Date_Time \n Header_Line* Groups may include optional additional header properties
GPI_Header '!gpi-version: 2.0' \n '!generated-by: ' Prefix \n '!date-generated: ' Date_Or_Date_Time \n Header_Line* Groups may include optional additional header properties
Annotation DB_Object_ID \t Negation \t Relation \t Ontology_Class_ID \t Reference \t Evidence_Type \t With_Or_From \t Interacting_Taxon_ID \t Annotation_Date \t Assigned_By \t Annotation_Extensions \t Annotation_Properties \n
Entity DB_Object_ID \t DB_Object_Symbol \t DB_Object_Name \t DB_Object_Synonyms \t DB_Object_Type \t DB_Object_Taxon \t Encoded_By \t Parent_Protein \t Protein_Containing_Complex_Members \t DB_Xrefs \t Gene_Product_Properties \n

Header properties

In addition to the three required header properties specified in the grammars for GPAD and GPI, groups may decide to include optional additional information in header lines, either unstructured or using custom header properties. Examples include:

Header property Example value Comment
url http://www.yeastgenome.org/
project-release WS275
funding NHGRI
columns file format written out
go-version http://purl.obolibrary.org/obo/go/releases/2023-10-09/go.owl
ro-version http://purl.obolibrary.org/obo/ro/releases/2023-08-18/ro.owl

GPAD columns

Column Production Grammar Example Comments
1 DB_Object_ID ID UniProtKB:P11678
2 Negation 'NOT'? NOT
3 Relation ID RO:0002263 The relation used SHOULD come from the allowed gene-product-to-term relations
4 Ontology_Class_ID ID GO:0050803 The identifier MUST be a term from the GO ontology
5 Reference ID ( '|' ID )* PMID:30695063 Different IDs, e.g. PMID and MOD paper ID, MUST correspond to the same publication or reference
6 Evidence_Type ID ECO:0000315 The evidence identifier MUST be a term from the ECO ontology. GO evidence-ECO mapping file
7 With_Or_From ( ID ( [|,] ID )* )? WB:WBVar00000510 Pipe-separated entries represent independent evidence; comma-separated entries represent grouped evidence, e.g. two of three genes in a triply mutant organism
8 Interacting_Taxon_ID ( ID ( '|' ID )* )? NCBITaxon:5476 The taxon MUST be a term from the NCBITaxon ontology
9 Annotation_Date Date_Or_Date_Time 2019-01-30
10 Assigned_By Prefix MGI
11 Annotation_Extensions ( Extension_Conj ( '|' Extension_Conj )* )? BFO:0000066(GO:0005829)
12 Annotation_Properties ( Property_Value_Pair ( '|' Property_Value_Pair )* )? contributor-id=orcid:0000-0002-1478-7671 Properties and values MUST conform to the list in GPAD annotation properties

GPI columns

Column Production Grammar Example Comments
1 DB_Object_ID ID UniProtKB:Q4VCS5
2 DB_Object_Symbol Text_No_Spaces AMOT
3 DB_Object_Name Text Angiomotin
4 DB_Object_Synonyms (Text ( '|' Text )* )? E230009N18Rik|KIAA1071
5 DB_Object_Type ID ( '|' ID )* PR:000000001 Identifier used MUST conform to the list in GPI entity types
6 DB_Object_Taxon ID NCBITaxon:9606 The taxon MUST be a term from the NCBITaxon ontology
7 Encoded_By ( ID ( '|' ID )* )? HGNC:17810 For proteins and transcripts, this refers to the gene id that encodes those entities.
8 Parent_Protein ( ID ( '|' ID )* )? When column 1 refers to a protein isoform or modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry.
9 Protein_Containing_Complex_Members ( ID ( '|' ID )* )? UniProtKB:Q15021|UniProtKB:Q15003
10 DB_Xrefs ( ID ( '|' ID )* )? HGNC:17810 Identifiers used MUST include the required DB xref values
11 Gene_Product_Properties ( Property_Value_Pair ( '|' Property_Value_Pair )* )? db-subset=Swiss-Prot Properties and values MUST conform to the list in GPI gene product properties

Values

Production Grammar Comments
Header_Line ( Tag_Value_Header | Unstructured_Value_Header ) \n
Tag_Value_Header '!' Property ':' Space* Header_Value
Unstructured_Value_Header '!!' Header_Value
Header_Value Text
Extension_Conj Relational_Expression ( ',' Relational_Expression )*
Relational_Expression Relation_ID '(' Target_ID ')'
Relation_ID ID The identifier MUST be a term in the OBO relations ontology
Target_ID ID
Property_Value_Pair Property '=' Property_Value
Property (Alpha_Char | Digit | '-')+
Property_Value Text
ID Prefix ':' Local_ID
Prefix Alpha_Char ID_Char* The GO database registry contains a list of valid prefixes that can be used in GPAD or GPI files. Every identifier prefix used in a GPAD or GPI file MUST have an entry in the registry.
Local_ID ( ID_Char | ':' | '/' )+
ID_Char Alpha_Char | Digit | '_' | '-' | '.'
Date_Or_Date_Time Date | Date_Time
Date YYYY-MM-DD Corresponds to xsd:date without optional timezone (a subset of the ISO 8601 standard)
Date_Time YYYY-MM-DDTHH:MM:SS('.' s+)?((('+' | '-') hh ':' mm) | 'Z')? Corresponds to xsd:dateTime (a subset of the ISO 8601 standard)
Text Text_Char+
Text_No_Spaces Nonspace_Text_Char+
Text_Char Alpha_Char | Digit | Symbol_Char | Space
Nonspace_Text_Char Alpha_Char | Digit | Symbol_Char
Alpha_Char [A-Z] | [a-z]
Digit [0-9]
Symbol_Char '!' | '"' | '#' | '$' | '%' | '&' | ''' | '(' | ')' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | '\' | ']' | '^' | '_' | '`' | '{' | '}' | '~' ASCII symbols minus |
Space ' '

Allowed Gene Product to GO Term Relations

Default usage is indicated for MF and CC. Groups may choose which relation to use for BP annotations according to their curation practice. 'acts upstream of or within' is the parent Relations Ontology term for the BP relations listed below. A full view of the BP relation hierarchy can be found at http://www.ontobee.org/ or https://www.ebi.ac.uk/ols/index. Note: the RO term labels and IDs listed below are current as of 2020-06-09. However, to ensure accurate use of RO, groups should always derive mappings between RO term labels and IDs from the RO source file available here: https://github.com/oborel/obo-relations

GO Aspect Relations Ontology Label Relations Ontology ID Usage Guidelines
Molecular Function enables RO:0002327 Default for MF
Molecular Function contributes to RO:0002326
Biological Process involved in RO:0002331
Biological Process acts upstream of RO:0002263
Biological Process acts upstream of positive effect RO:0004034
Biological Process acts upstream of negative effect RO:0004035
Biological Process acts upstream of or within RO:0002264 Default for BP (GO:0008150) and child terms
Biological Process acts upstream of or within positive effect RO:0004032
Biological Process acts upstream of or within negative effect RO:0004033
Cellular Component part of BFO:0000050 Default for protein-containing complex (GO:0032991) and child terms
Cellular Component located in RO:0001025 Default for non-protein-containing complex CC terms
Cellular Component is active in RO:0002432 Used to indicate where a gene product enables its MF
Cellular Component colocalizes with RO:0002325

GPAD Annotation Properties

All properties are single valued as shown.

Property Allowed usages per annotation Value Grammar Example Comment
'id' 0 or 1 ID id=WBOA:3219 Unique identifier for an annotation in a contributing database.
'model-state' 0 or 1 Alpha_Char+ model-state=production GO-CAM model state
'noctua-model-id' 0 or 1 ID noctua-model-id=gomodel:5a7e68a100001078 Unique GO-CAM model id
'contributor-id' 0 or more ID contributor-id=orcid:0000-0002-1706-4196 Identifier for curator or user who entered or changed an annotation. Prefix MUST be orcid or goc
'reviewer-id' 0 or more ID reviewer-id=orcid:0000-0001-7476-6306 Identifier for curator or user who last reviewed an annotation. Prefix MUST be orcid or goc
'creation-date' 0 or 1 Date_Or_Date_Time creation-date=2019-02-05 The date on which the annotation was created.
'modification-date' 0 or more Date_Or_Date_Time modification-date=2019-02-06 The date(s) on which an annotation was modified.
'reviewed-date' 0 or more Date_Or_Date_Time reviewed-date=2019-02-06 The date(s) on which the annotation was reviewed.
'comment' 0 or more Text comment=Confirmed species by checking PMID:nnnnnnnn. Free-text field that allows curators or users to enter notes about a specific annotation.

GPI Entity Types

Entity type value must be provided as an ontology term identifier from Sequence Ontology, Protein Ontology, or GO, and must correspond to one of the permitted GPI entity types or a more granular child term. Common entries include:

protein PR:000000001
protein-coding gene SO:0001217
gene SO:0000704
ncRNA SO:0000655
    any subtype of ncRNA in the Sequence Ontology, including ncRNA-coding gene SO:0001263
protein-containing complex GO:0032991

Required and Optional DB xrefs

Required:

  • MODs: Must associate gene ids, for protein-coding genes, with UniProtKB gene-centric reference protein accessions
  • UniProtKB: Must associate gene-centric reference protein accessions with MOD gene ids

Optional DB xref suggestions (where applicable):

  • RNAcentral
  • Ensembl gene
  • NCBI RefSeq gene
  • HGNC
  • ComplexPortal
  • PRO

GPI Gene Product Properties

Property Allowed usages per annotation Value Grammar Example Comment
'db-subset' 0 or 1 'TrEMBL' | 'Swiss-Prot' db-subset=TrEMBL The status of a UniProtKB accession with respect to curator review.
'uniprot-proteome' 0 or 1 ID uniprot-proteome=UP000001940 A unique UniProtKB identifier for the set of proteins that constitute an organism's proteome.
'go-annotation-complete' 0 or 1 Date_Or_Date_Time 2019-02-05 Indicates the date on which a curator determined that the set of GO annotations for a given entity is complete with respect to GO annotation. Complete means that all information about a gene has been captured as a GO term, but not necessarily that all possible supporting evidence is annotated.
'go-annotation-summary' 0 or 1 Text go-annotation-summary=Sterol binding protein with a role in intracellular sterol transport; localizes to mitochondria and the cortical ER A textual gene or gene product description.