This document specifies the syntax of Gene Product Annotation Data (GPAD) and Gene Product Information (GPI) formats. GPAD describes the relationships between biological entities (such as gene products) and biological descriptors (such as GO terms). GPI describes the biological entities.
This is specification has been approved as version 2.0.
- GPAD and GPI: columns 1 and 2 are now combined in a single column containing an id in CURIE syntax, e.g. UniProtKB:P56704.
- GPAD: negation is captured in a separate column, column 2, using the text string 'NOT'.
- GPAD: gene product-to-term relations captured in column 3 use a Relations Ontology (RO) identifier instead of a text string.
- GPAD: the With/From column, column 7, may contain identifiers separated by commas as well as pipes.
- GPAD and GPI: NCBI taxon ids are prefixed with 'NCBITaxon:' to indicate the source of the id, e.g. NCBITaxon:6239
- GPAD: Annotation Extensions in column 11 will use a Relation_ID, rather than a Relation_Symbol, in the Relational_Expression, e.g. RO:0002233(UniProtKB:Q00362)
- GPAD and GPI: dates follow the ISO-8601 format, e.g. YYYY-MM-DD; time may be included as YYYY-MM-DDTHH:MM:SS
- GPI: the entity type in column 5 is captured using an ID from the Sequence Ontology, Protein Ontology, or Gene Ontology.
- GPI: the parent object id in column 7 refers to the gene-centric parent, e.g. the UniProtKB Gene-Centric Reference Proteome accession or a Model Organism Database gene identifier
- Characters allowed in all fields have been explicitly specified
- Extensions in file names are: *.gpad and *.gpi
GPAD and GPI document structures are defined using a BNF notation similar to W3C specs, which is summarized below.
- terminal symbols are single quoted
- non-terminal symbols are unquoted
- zero or more symbols are indicated by following the symbol with a star; e.g.
Annotation*
- one or more symbols are indicated by following the symbol with a plus; e.g.
Digit+
- zero or one symbols are indicated by following the symbol with a question mark; e.g.
Extension_Conj?
- alternative symbols are written using vertical bars
- groupings are written using parentheses
GPI and GPAD documents consist of sequences of ASCII characters.
Production | Grammar | Comments |
---|---|---|
Doc |
GPAD_Doc | GPI_Doc |
|
GPAD_Doc |
GPAD_Header Annotation* |
|
GPI_Doc |
GPI_Header Entity* |
|
GPAD_Header |
'!gpad-version: 2.0' \n '!generated-by: ' Prefix \n '!date-generated: ' Date_Or_Date_Time \n Header_Line* |
Groups may include optional additional header properties |
GPI_Header |
'!gpi-version: 2.0' \n '!generated-by: ' Prefix \n '!date-generated: ' Date_Or_Date_Time \n Header_Line* |
Groups may include optional additional header properties |
Annotation |
DB_Object_ID \t Negation \t Relation \t Ontology_Class_ID \t Reference \t Evidence_Type \t With_Or_From \t Interacting_Taxon_ID \t Annotation_Date \t Assigned_By \t Annotation_Extensions \t Annotation_Properties \n |
|
Entity |
DB_Object_ID \t DB_Object_Symbol \t DB_Object_Name \t DB_Object_Synonyms \t DB_Object_Type \t DB_Object_Taxon \t Encoded_By \t Parent_Protein \t Protein_Containing_Complex_Members \t DB_Xrefs \t Gene_Product_Properties \n |
In addition to the three required header properties specified in the grammars for GPAD and GPI, groups may decide to include optional additional information in header lines, either unstructured or using custom header properties. Examples include:
Header property | Example value | Comment |
---|---|---|
url |
http://www.yeastgenome.org/ |
|
project-release |
WS275 |
|
funding |
NHGRI |
|
columns |
file format written out | |
go-version |
http://purl.obolibrary.org/obo/go/releases/2023-10-09/go.owl |
|
ro-version |
http://purl.obolibrary.org/obo/ro/releases/2023-08-18/ro.owl |
Column | Production | Grammar | Example | Comments |
---|---|---|---|---|
1 | DB_Object_ID |
ID |
UniProtKB:P11678 |
|
2 | Negation |
'NOT'? |
NOT |
|
3 | Relation |
ID |
RO:0002263 |
The relation used SHOULD come from the allowed gene-product-to-term relations |
4 | Ontology_Class_ID |
ID |
GO:0050803 |
The identifier MUST be a term from the GO ontology |
5 | Reference |
ID ( '|' ID )* |
PMID:30695063 |
Different IDs, e.g. PMID and MOD paper ID, MUST correspond to the same publication or reference |
6 | Evidence_Type |
ID |
ECO:0000315 |
The evidence identifier MUST be a term from the ECO ontology. GO evidence-ECO mapping file |
7 | With_Or_From |
( ID ( [|,] ID )* )? |
WB:WBVar00000510 |
Pipe-separated entries represent independent evidence; comma-separated entries represent grouped evidence, e.g. two of three genes in a triply mutant organism |
8 | Interacting_Taxon_ID |
( ID ( '|' ID )* )? |
NCBITaxon:5476 |
The taxon MUST be a term from the NCBITaxon ontology |
9 | Annotation_Date |
Date_Or_Date_Time |
2019-01-30 |
|
10 | Assigned_By |
Prefix |
MGI |
|
11 | Annotation_Extensions |
( Extension_Conj ( '|' Extension_Conj )* )? |
BFO:0000066(GO:0005829) |
|
12 | Annotation_Properties |
( Property_Value_Pair ( '|' Property_Value_Pair )* )? |
contributor-id=orcid:0000-0002-1478-7671 |
Properties and values MUST conform to the list in GPAD annotation properties |
Column | Production | Grammar | Example | Comments |
---|---|---|---|---|
1 | DB_Object_ID |
ID |
UniProtKB:Q4VCS5 |
|
2 | DB_Object_Symbol |
Text_No_Spaces |
AMOT |
|
3 | DB_Object_Name |
Text |
Angiomotin |
|
4 | DB_Object_Synonyms |
(Text ( '|' Text )* )? |
E230009N18Rik|KIAA1071 |
|
5 | DB_Object_Type |
ID ( '|' ID )* |
PR:000000001 |
Identifier used MUST conform to the list in GPI entity types |
6 | DB_Object_Taxon |
ID |
NCBITaxon:9606 |
The taxon MUST be a term from the NCBITaxon ontology |
7 | Encoded_By |
( ID ( '|' ID )* )? |
HGNC:17810 |
For proteins and transcripts, this refers to the gene id that encodes those entities. |
8 | Parent_Protein |
( ID ( '|' ID )* )? |
When column 1 refers to a protein isoform or modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry. | |
9 | Protein_Containing_Complex_Members |
( ID ( '|' ID )* )? |
UniProtKB:Q15021|UniProtKB:Q15003 |
|
10 | DB_Xrefs |
( ID ( '|' ID )* )? |
HGNC:17810 |
Identifiers used MUST include the required DB xref values |
11 | Gene_Product_Properties |
( Property_Value_Pair ( '|' Property_Value_Pair )* )? |
db-subset=Swiss-Prot |
Properties and values MUST conform to the list in GPI gene product properties |
Production | Grammar | Comments |
---|---|---|
Header_Line |
( Tag_Value_Header | Unstructured_Value_Header ) \n |
|
Tag_Value_Header |
'!' Property ':' Space* Header_Value |
|
Unstructured_Value_Header |
'!!' Header_Value |
|
Header_Value |
Text |
|
Extension_Conj |
Relational_Expression ( ',' Relational_Expression )* |
|
Relational_Expression |
Relation_ID '(' Target_ID ')' |
|
Relation_ID |
ID |
The identifier MUST be a term in the OBO relations ontology |
Target_ID |
ID |
|
Property_Value_Pair |
Property '=' Property_Value |
|
Property |
(Alpha_Char | Digit | '-')+ |
|
Property_Value |
Text |
|
ID |
Prefix ':' Local_ID |
|
Prefix |
Alpha_Char ID_Char* |
The GO database registry contains a list of valid prefixes that can be used in GPAD or GPI files. Every identifier prefix used in a GPAD or GPI file MUST have an entry in the registry. |
Local_ID |
( ID_Char | ':' | '/' )+ |
|
ID_Char |
Alpha_Char | Digit | '_' | '-' | '.' |
|
Date_Or_Date_Time |
Date | Date_Time |
|
Date |
YYYY-MM-DD |
Corresponds to xsd:date without optional timezone (a subset of the ISO 8601 standard) |
Date_Time |
YYYY-MM-DDTHH:MM:SS('.' s+)?((('+' | '-') hh ':' mm) | 'Z')? |
Corresponds to xsd:dateTime (a subset of the ISO 8601 standard) |
Text |
Text_Char+ |
|
Text_No_Spaces |
Nonspace_Text_Char+ |
|
Text_Char |
Alpha_Char | Digit | Symbol_Char | Space |
|
Nonspace_Text_Char |
Alpha_Char | Digit | Symbol_Char |
|
Alpha_Char |
[A-Z] | [a-z] |
|
Digit |
[0-9] |
|
Symbol_Char |
'!' | '"' | '#' | '$' | '%' | '&' | ''' | '(' | ')' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | '\' | ']' | '^' | '_' | '`' | '{' | '}' | '~' |
ASCII symbols minus | |
Space |
' ' |
Default usage is indicated for MF and CC. Groups may choose which relation to use for BP annotations according to their curation practice. 'acts upstream of or within' is the parent Relations Ontology term for the BP relations listed below. A full view of the BP relation hierarchy can be found at http://www.ontobee.org/ or https://www.ebi.ac.uk/ols/index. Note: the RO term labels and IDs listed below are current as of 2020-06-09. However, to ensure accurate use of RO, groups should always derive mappings between RO term labels and IDs from the RO source file available here: https://github.com/oborel/obo-relations
GO Aspect | Relations Ontology Label | Relations Ontology ID | Usage Guidelines |
---|---|---|---|
Molecular Function | enables | RO:0002327 |
Default for MF |
Molecular Function | contributes to | RO:0002326 |
|
Biological Process | involved in | RO:0002331 |
|
Biological Process | acts upstream of | RO:0002263 |
|
Biological Process | acts upstream of positive effect | RO:0004034 |
|
Biological Process | acts upstream of negative effect | RO:0004035 |
|
Biological Process | acts upstream of or within | RO:0002264 |
Default for BP (GO:0008150) and child terms |
Biological Process | acts upstream of or within positive effect | RO:0004032 |
|
Biological Process | acts upstream of or within negative effect | RO:0004033 |
|
Cellular Component | part of | BFO:0000050 |
Default for protein-containing complex (GO:0032991) and child terms |
Cellular Component | located in | RO:0001025 |
Default for non-protein-containing complex CC terms |
Cellular Component | is active in | RO:0002432 |
Used to indicate where a gene product enables its MF |
Cellular Component | colocalizes with | RO:0002325 |
All properties are single valued as shown.
Property | Allowed usages per annotation | Value Grammar | Example | Comment |
---|---|---|---|---|
'id' |
0 or 1 | ID |
id=WBOA:3219 |
Unique identifier for an annotation in a contributing database. |
'model-state' |
0 or 1 | Alpha_Char+ |
model-state=production |
GO-CAM model state |
'noctua-model-id' |
0 or 1 | ID |
noctua-model-id=gomodel:5a7e68a100001078 |
Unique GO-CAM model id |
'contributor-id' |
0 or more | ID |
contributor-id=orcid:0000-0002-1706-4196 |
Identifier for curator or user who entered or changed an annotation. Prefix MUST be orcid or goc |
'reviewer-id' |
0 or more | ID |
reviewer-id=orcid:0000-0001-7476-6306 |
Identifier for curator or user who last reviewed an annotation. Prefix MUST be orcid or goc |
'creation-date' |
0 or 1 | Date_Or_Date_Time |
creation-date=2019-02-05 |
The date on which the annotation was created. |
'modification-date' |
0 or more | Date_Or_Date_Time |
modification-date=2019-02-06 |
The date(s) on which an annotation was modified. |
'reviewed-date' |
0 or more | Date_Or_Date_Time |
reviewed-date=2019-02-06 |
The date(s) on which the annotation was reviewed. |
'comment' |
0 or more | Text |
comment=Confirmed species by checking PMID:nnnnnnnn. |
Free-text field that allows curators or users to enter notes about a specific annotation. |
Entity type value must be provided as an ontology term identifier from Sequence Ontology, Protein Ontology, or GO, and must correspond to one of the permitted GPI entity types or a more granular child term. Common entries include:
protein PR:000000001
protein-coding gene SO:0001217
gene SO:0000704
ncRNA SO:0000655
any subtype of ncRNA in the Sequence Ontology, including ncRNA-coding gene SO:0001263
protein-containing complex GO:0032991
- MODs: Must associate gene ids, for protein-coding genes, with UniProtKB gene-centric reference protein accessions
- UniProtKB: Must associate gene-centric reference protein accessions with MOD gene ids
- RNAcentral
- Ensembl gene
- NCBI RefSeq gene
- HGNC
- ComplexPortal
- PRO
Property | Allowed usages per annotation | Value Grammar | Example | Comment |
---|---|---|---|---|
'db-subset' |
0 or 1 | 'TrEMBL' | 'Swiss-Prot' |
db-subset=TrEMBL |
The status of a UniProtKB accession with respect to curator review. |
'uniprot-proteome' |
0 or 1 | ID |
uniprot-proteome=UP000001940 |
A unique UniProtKB identifier for the set of proteins that constitute an organism's proteome. |
'go-annotation-complete' |
0 or 1 | Date_Or_Date_Time |
2019-02-05 |
Indicates the date on which a curator determined that the set of GO annotations for a given entity is complete with respect to GO annotation. Complete means that all information about a gene has been captured as a GO term, but not necessarily that all possible supporting evidence is annotated. |
'go-annotation-summary' |
0 or 1 | Text |
go-annotation-summary=Sterol binding protein with a role in intracellular sterol transport; localizes to mitochondria and the cortical ER |
A textual gene or gene product description. |