diff --git a/DESCRIPTION b/DESCRIPTION index 2932547..d5229e2 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ -Date: 2024-03-03 +Date: 2024-03-04 Package: canprot -Version: 1.1.2-30 +Version: 1.1.2-31 Title: Chemical Analysis of Proteins Authors@R: c( person("Jeffrey", "Dick", email = "j3ffdick@gmail.com", role = c("aut", "cre"), diff --git a/R/read.fasta.R b/R/read.fasta.R index 4fdacbc..5c8b1e4 100644 --- a/R/read.fasta.R +++ b/R/read.fasta.R @@ -147,13 +147,23 @@ count.aa <- function(sequence, start = NULL, stop = NULL, type = "protein") { aasum <- function(AAcomp, abundance = 1, average = FALSE) { # Recycle abundance values into same length as number of proteins abundance <- rep(abundance, length.out = nrow(AAcomp)) - # Find columns with names for the amino acids and "chains" + # Find columns with names for the amino acids and 'chains' + # (number of polypeptide chains) AA_names <- c( "Ala", "Cys", "Asp", "Glu", "Phe", "Gly", "His", "Ile", "Lys", "Leu", "Met", "Asn", "Pro", "Gln", "Arg", "Ser", "Thr", "Val", "Trp", "Tyr", "chains" ) isAA <- tolower(colnames(AAcomp)) %in% tolower(AA_names) + # Remove proteins with NA amino acid composition or abundance + ina.aa <- is.na(rowSums(AAcomp[, isAA])) + ina.ab <- is.na(abundance) + ina <- ina.aa | ina.ab + if (any(ina)) { + AAcomp <- AAcomp[!ina, ] + abundance <- abundance[!ina] + message("aasum: dropped ", sum(ina), " proteins with NA amino acid composition and/or abundance") + } # Multiply amino acid counts by protein abundance AAcomp[, isAA] <- AAcomp[, isAA] * abundance # Sum amino acid counts diff --git a/README.md b/README.md index b553a55..083b5df 100644 --- a/README.md +++ b/README.md @@ -46,4 +46,4 @@ demo("canprot") Zc and pI for human proteins in subcellular locations These plots show carbon oxidation state (*Z*C) and isoelectric point (pI) for human proteins in different subcellular locations. -The localization data is from Table S6 of [Thul et al. (2017)](10.1126/science.aal3321) (*A subcellular map of the human proteome*), filtered to include proteins that have both a validated location and only one predicted location. +The localization data is from Table S6 of [Thul et al. (2017)](https://doi.org/10.1126/science.aal3321) (*A subcellular map of the human proteome*), filtered to include proteins that have both a validated location and only one annotated location. diff --git a/inst/NEWS b/inst/NEWS index 9a733b9..1697cd0 100644 --- a/inst/NEWS +++ b/inst/NEWS @@ -1,4 +1,4 @@ -CHANGES IN canprot 1.1.2-29 (2024-03-03) +CHANGES IN canprot 1.1.2-30 (2024-03-03) ---------------------------------------- - Objects in cplab are now `quote()`-ed instead of `expression()`s for easier diff --git a/inst/doc/demo.html b/inst/doc/demo.html new file mode 100644 index 0000000..4a64146 --- /dev/null +++ b/inst/doc/demo.html @@ -0,0 +1,419 @@ + + + + + + + + + + + + + + +Demo for canprot + + + + + + + + + + + + + + + + + + + + + + + + + + +

Demo for canprot

+ + + +

The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-04 with canprot version 1.1.2-31.

+

Next vignette: Introduction to canprot

+
+

canprot demo

+

Run the demo using demo("canprot"). The code and output of the demo are shown below.

+

The canprot functions used are:

+ +
# Read SI table
+file <- system.file("extdata/protein/TAW+17_Table_S6_Validated.csv", package = "canprot")
+dat <- read.csv(file)
+# Keep only proteins with validated location
+dat <- dat[dat$Reliability == "Validated", ]
+# Keep only proteins with one annotated location
+dat <- dat[rowSums(dat[, 4:32]) == 1, ]
+
+# Get the amino acid compositions
+aa <- human.aa(dat$Uniprot)
+# Put the location into the amino acid data frame
+aa$location <- dat$IF.main.protein.location
+
+# Use top locations (and their colors) from Fig. 2B of Thul et al., 2017
+locations <- c("Cytosol","Mitochondria","Nucleoplasm","Nucleus","Vesicles","Plasma membrane")
+col <- c("#194964", "#2e6786", "#8a2729", "#b2333d", "#e0ce1d", "#e4d71c")
+# Keep the proteins in these locations
+aa <- aa[aa$location %in% locations, ]
+## Keep only proteins with length between 100 and 2000
+#aa <- aa[plength(aa) >= 100 & plength(aa) <= 2000, ]
+
+# Get amino acid composition for proteins in each location
+# (Loop over groups by piping location names into lapply)
+aalist <- lapply(locations, function(location) aa[aa$location == location, ] )
+
+# Setup plot
+par(mfrow = c(1, 2))
+titles <- c(Zc = "Carbon oxidation state", pI = "Isoelectric point")
+# Calculate Zc and pI
+for(metric in c("Zc", "pI")) {
+  datlist <- lapply(aalist, metric)
+  bp <- boxplot(datlist, ylab = cplab[[metric]], col = col, show.names = FALSE)
+  add.cld(datlist, bp)
+  # Make rotated labels
+  x <- (1:6) + 0.1
+  y <- par()$usr[3] - 1.5 * strheight("A")
+  text(x, y, locations, srt = 25, adj = 1, xpd = NA)
+  axis(1, labels = FALSE)
+  title(titles[metric], font.main = 1)
+}
+

+

The plots show carbon oxidation state (ZC) and isoelectric point (pI) for human proteins in different subcellular locations. The localization data is from Table S6 of Thul et al. (2017), filtered to include proteins that have both a validated location and only one predicted location.

+
+
+

References

+
+
+

Thul PJ, Åkesson L, Wiking M, Mahdessian D, Geladaki A, Blal HA, Alm T, Asplund A, Björk L, Breckels LM, et al. 2017. A subcellular map of the human proteome. Science 356(6340): eaal3321. doi: 10.1126/science.aal3321

+
+
+
+ + + + + + + + + diff --git a/inst/doc/introduction.html b/inst/doc/introduction.html index e0486a0..86db344 100644 --- a/inst/doc/introduction.html +++ b/inst/doc/introduction.html @@ -344,32 +344,32 @@

Introduction to canprot

-

The canprot package has functions to calculate chemical metrics of proteins from their amino acid composition. This vignette was compiled on 2024-02-29 with canprot version 1.1.2-24.

-

Next vignette: More about metrics

+

The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-04 with canprot version 1.1.2-31.

+

Previous vignette: Demo for canprot | Next vignette: More about metrics

Reading FASTA files

-

KHAB17.fast was obtained from Supplemental Information of Kacar et al. (2017) and is provided in the extdata directory of canprot. Use read.fasta() to read the file and return a data frame of amino acid composition.

-
my_fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
-aa <- read.fasta(my_fasta_file)
+

KHAB17.fasta was obtained from Supplemental Information of Kacar et al. (2017) and is provided in the extdata/fasta directory of canprot. Use read.fasta() to read the file and return a data frame of amino acid composition.

+
fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
+aa <- read.fasta(fasta_file)
## read.fasta: reading KHAB17.fasta ... 62 lines ... 6 sequences

The result is small enough that we can look at it here. The data frame has four columns for identifying information (protein, organism, ref, and abbrv); the first two are filled by read.fasta(). The chains column is the number of polypeptide chains.

aa
-
##        protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly His
-## 1 Anc_I/II/III   KHAB17  NA    NA      1  42   4  26  46  14  42  15
-## 2    Anc_I/III   KHAB17  NA    NA      1  42   2  26  45  14  41  16
-## 3   Anc_I/III'   KHAB17  NA    NA      1  42   1  25  44  15  42  14
-## 4        Anc_I   KHAB17  NA    NA      1  52   1  28  32  18  40   9
-## 5     Anc_IA/B   KHAB17  NA    NA      1  48   7  28  33  23  43  14
-## 6       Anc_IB   KHAB17  NA    NA      1  46   7  27  31  23  44  15
-##   Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
-## 1  30  28  42  10   8  21   4  22  13  19  29   5  13
-## 2  30  29  39  13   9  22   6  22  15  17  29   5  13
-## 3  32  30  41  11   8  23   7  22  18  17  29   4  14
-## 4  18  28  42  12  13  24  13  31  16  29  37  10  17
-## 5  20  24  43  12  14  21  11  30  17  32  30  10  15
-## 6  20  24  44  12  15  22  13  29  16  31  28   9  16
+
##        protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly His Ile Lys
+## 1 Anc_I/II/III   KHAB17  NA    NA      1  42   4  26  46  14  42  15  30  28
+## 2    Anc_I/III   KHAB17  NA    NA      1  42   2  26  45  14  41  16  30  29
+## 3   Anc_I/III'   KHAB17  NA    NA      1  42   1  25  44  15  42  14  32  30
+## 4        Anc_I   KHAB17  NA    NA      1  52   1  28  32  18  40   9  18  28
+## 5     Anc_IA/B   KHAB17  NA    NA      1  48   7  28  33  23  43  14  20  24
+## 6       Anc_IB   KHAB17  NA    NA      1  46   7  27  31  23  44  15  20  24
+##   Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
+## 1  42  10   8  21   4  22  13  19  29   5  13
+## 2  39  13   9  22   6  22  15  17  29   5  13
+## 3  41  11   8  23   7  22  18  17  29   4  14
+## 4  42  12  13  24  13  31  16  29  37  10  17
+## 5  43  12  14  21  11  30  17  32  30  10  15
+## 6  44  12  15  22  13  29  16  31  28   9  16

None of the first five columns is necessary for the calculation of chemical metrics. They are provided for compatibility with CHNOSZ, and if you don’t plan to use CHNOSZ, you can put any information here that you want, including NA values, or remove the columns completely. The columns that do matter for the calculation of chemical metrics are the last 20 columns, which are named with the 3-letter abbreviations of the amino acids.

-

These particular sequences are ancestral sequences of Rubisco. A geochemical biology hypothesis is that the proteins should get more oxidized around the GOE. Is this what happens? Plot the carbon oxidation state (ZC) to find out.

+

These particular sequences are ancestral sequences of Rubisco. A geochemical biology hypothesis is that proteins are oxidized when the environment is oxidizing. Is this what happens? Plot the carbon oxidation state (ZC) to find out. Note the use of pre-formatted plot labels for chemical metrics available in cplab.

xlab <- "Ancestral sequences (older to younger)"
 plot(Zc(aa), type = "b", xaxt = "n", xlab = xlab, ylab = cplab$Zc)
 names <- gsub(".*_", "", aa$protein)
@@ -377,21 +377,17 @@ 

Reading FASTA files

abline(v = 3.5, lty = 2, col = 2) axis(3, at = 3.5, "GOE (proposed)")

-

Note the use of pre-formatted plot labels for chemical metrics available in cplab. The vertical line denotes the proposed timing of the Great Oxidation Event (GOE) between Anc. I/III and Anc. I (Kacar et al., 2017). This analysis shows that proteins do get more oxidized around the GOE.

+

The vertical line denotes the proposed timing of the Great Oxidation Event (GOE) between Anc. I/III and Anc. I (Kacar et al., 2017). This analysis shows that proteins become more oxidized around the GOE.

Human proteins in canprot

-

canprot has a database of amino acid compositions of human proteins assembled from UniProt. Use protcomp() to get the amino acid composition. This example is for alanine aminotransferase, which has a UniProt ID of P24298:

-
(pc <- protcomp("P24298"))
-
## $uniprot
-## [1] "P24298"
-## 
-## $aa
-##              protein organism  ref abbrv chains Ala Cys Asp Glu Phe Gly
-## 5457 sp|P24298|ALAT1    HUMAN <NA>  <NA>      1  51  10  20  34  21  38
-##      His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
-## 5457   9  18  16  55  12  12  33  30  37  25  18  41   1  15
-
Zc(pc$aa)
+

canprot has a database of amino acid compositions of human proteins assembled from UniProt. Use human.aa() to get the amino acid composition. This example is for alanine aminotransferase, which has a UniProt ID of P24298:

+
(aa <- human.aa("P24298"))
+
##              protein organism  ref abbrv chains Ala Cys Asp Glu Phe Gly His Ile
+## 5457 sp|P24298|ALAT1    HUMAN <NA>  <NA>      1  51  10  20  34  21  38   9  18
+##      Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
+## 5457  16  55  12  12  33  30  37  25  18  41   1  15
+
Zc(aa)
##       5457 
 ## -0.1482091

Do you have a list of UniProt IDs for a differential expression dataset? Great! We can use those to calculate chemical metrics and make a boxplot. The IDs in this example come from Figure 5 of Doron et al. (2020), where they were identified as differentially regulated proteins in aggregate (3D) cell culture compared to monolayer (2D) culture.

@@ -413,15 +409,15 @@

Human proteins in canprot

"P35222", "P20908", "Q15417", "O75822", "P17812", "P05997", "P04080", "O43294", "P08243", "P02458")

With those UniProt IDs for human proteins we can retrieve the amino acid compositions, then calculate a couple of chemical metrics and make some boxplots comparing the groups of differentially regulated proteins.

-
aa_down <- protcomp(down)$aa
-aa_up <- protcomp(up)$aa
+
aa_down <- human.aa(down)
+aa_up <- human.aa(up)
 bp_names <- paste0(c("Down (", "Up ("), c(nrow(aa_down), nrow(aa_up)), c(")", ")"))
 
 par(mfrow = c(1, 2))
 
 Zclist <- list(Zc(aa_down), Zc(aa_up))
 names(Zclist) <- bp_names
-boxplot(Zclist, ylab = cplab$Zc, col = c(3, 2))
+boxplot(Zclist, ylab = cplab$Zc, col = c(4, 2))
 names(Zclist) <- c("x", "y")
 p <- do.call(wilcox.test, Zclist)$p.value
 legend("bottomleft", paste("p =", round(p, 3)), bty = "n")
@@ -429,14 +425,14 @@ 

Human proteins in canprot

nH2Olist <- list(nH2O(aa_down), nH2O(aa_up)) names(nH2Olist) <- bp_names -boxplot(nH2Olist, ylab = cplab$nH2O, col = c(3, 2)) +boxplot(nH2Olist, ylab = cplab$nH2O, col = c(4, 2)) names(nH2Olist) <- c("x", "y") p <- do.call(wilcox.test, nH2Olist)$p.value legend("bottomleft", paste("p =", round(p, 3)), bty = "n") title("Stoichiometric hydration state", font.main = 1)
-

-

The colors are chosen because green and red traditionally are used for down- and up-regulated proteins. We find no significant difference of ZC for the differentially regulated proteins. In contrast, nH2O is significantly lower for up-regulated than for down-regulated proteins.

-

This dehydration trend characterizes most datasets for 3D cell culture (Dick, 2021). The differential expression datasets analyzed in that paper, which were previously in canprot, have been moved to JMDplots.

+

+

We find no significant difference of ZC for the differentially regulated proteins. In contrast, nH2O is significantly lower for up-regulated than for down-regulated proteins.

+

A similar dehydration trend characterizes most datasets for 3D cell culture (Dick, 2021). The differential expression datasets analyzed in that paper, which were previously in canprot, have been moved to JMDplots.

References

diff --git a/inst/doc/metrics.html b/inst/doc/metrics.html index 3948d4a..d67d9e9 100644 --- a/inst/doc/metrics.html +++ b/inst/doc/metrics.html @@ -344,11 +344,11 @@

More about metrics

-

The canprot package has functions to calculate chemical metrics of proteins from their amino acid composition. This vignette was compiled on 2024-02-29 with canprot version 1.1.2-24.

+

The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-04 with canprot version 1.1.2-31.

Previous vignette: Introduction to canprot

More details on chemical metrics

-

Carbon oxidation state (ZC) is calculated from elemental ratios, but stoichiometric hydration state (nH2O) depends on a specific choice of thermodynamic components (or basis species). By default, nH2O is calculated from a theoretical reaction to form proteins from this set of basis species, abbreviated as QEC:

+

Carbon oxidation state (ZC) is calculated from elemental ratios, but stoichiometric hydration state (nH2O) depends on a specific choice of thermodynamic components (or basis species). In canprot, nH2O is calculated from a theoretical reaction to form proteins from the following set of basis species, abbreviated as QEC:

  • glutamine (Q)
  • glutamic acid (E)
  • @@ -420,7 +420,7 @@

    Implementation details

GRAVY, pI, and molecular weight

-

There are also functions for calculating the grand average of hydropathicity (GRAVY) and isoelectric point (pI) of proteins. Below, values for representative proteins are compared with results from the ProtParam tool (Gasteiger et al., 2005) in UniProt.

+

There are also functions for calculating the grand average of hydropathy (GRAVY) and isoelectric point (pI) of proteins. Below, values for representative proteins are compared with results from the ProtParam tool (Gasteiger et al., 2005) in UniProt.

We first look up a few proteins in CHNOSZ’s list of proteins, then get the amino acid compositions.

iprotein <- CHNOSZ::pinfo(c("LYSC_CHICK", "RNAS1_BOVIN", "AMYA_PYRFU", "CSG_HALJP"))
 AAcomp <- CHNOSZ::pinfo(iprotein)
diff --git a/man/CLES.Rd b/man/CLES.Rd index a256b60..30f8ee5 100644 --- a/man/CLES.Rd +++ b/man/CLES.Rd @@ -1,7 +1,7 @@ \encoding{UTF-8} \name{CLES} \alias{CLES} -\title{Common Language Effect Size} +\title{Common language effect size} \description{ Calculate the common language effect size. } diff --git a/man/add.cld.Rd b/man/add.cld.Rd index 37ab9ef..baf1eea 100644 --- a/man/add.cld.Rd +++ b/man/add.cld.Rd @@ -1,7 +1,7 @@ \encoding{UTF-8} \name{add.cld} \alias{add.cld} -\title{Compact Leter Display} +\title{Compact letter display} \description{ Adds compact letter display (significant difference letters) to a boxplot. } diff --git a/man/human.Rd b/man/human.Rd index 90de44b..8a0106d 100644 --- a/man/human.Rd +++ b/man/human.Rd @@ -7,7 +7,7 @@ \alias{human_additional} \alias{human_extra} \alias{uniprot_updates} -\title{Data for Amino Acid Compositions of Human Proteins} +\title{Data for amino acid compositions of human proteins} \description{ Data for amino acid compositions of proteins and conversion from old to new UniProt IDs. } diff --git a/man/human.aa.Rd b/man/human.aa.Rd index 2d63a6d..ce5ffe7 100644 --- a/man/human.aa.Rd +++ b/man/human.aa.Rd @@ -1,7 +1,7 @@ \encoding{UTF-8} \name{human.aa} \alias{human.aa} -\title{Get Amino Acid Compositions of Human Proteins} +\title{Get amino acid compositions of human proteins} \description{ Get amino acid compositions of human proteins from their UniProt IDs. } diff --git a/man/metrics.Rd b/man/metrics.Rd index 0d4f6d7..c79508b 100644 --- a/man/metrics.Rd +++ b/man/metrics.Rd @@ -15,7 +15,7 @@ \alias{OC} \alias{SC} \alias{cplab} -\title{Calculate Chemical Metrics for Proteins} +\title{Calculate chemical metrics for proteins} \description{ Calculate chemical metrics for proteins from their amino acid compositions. } diff --git a/man/read.fasta.Rd b/man/read.fasta.Rd index 13f2bf2..2244b47 100644 --- a/man/read.fasta.Rd +++ b/man/read.fasta.Rd @@ -3,7 +3,7 @@ \alias{read.fasta} \alias{count.aa} \alias{aasum} -\title{Functions for Reading FASTA Files} +\title{Functions for reading FASTA files} \description{ Read protein amino acid composition or sequences from a file and count numbers of amino acids in given sequences. @@ -58,10 +58,10 @@ If only one of \code{start} or \code{stop} is present, the other defaults to 1 ( \code{aasum} sums the amino acid compositions in the input \code{AAcomp} data frame. It only applies to columns with the three-letter abbreviations of amino acids and to a column named \code{chains} (if present). -The amino acid compositions are multiplied by the indicated \code{abundance} after recycling to the number of proteins. -If \code{average} is TRUE the sum is divided by the number of proteins. +The values in these columns are multiplied by the indicated \code{abundance} after recycling to the number of proteins. +The values in these columns are then summed; if \code{average} is TRUE then the sum is divided by the number of proteins. +Proteins with missing values (NA) of amino acid composition or abundance are omitted from the calculation. The output has one row and the same number of columns as the input; the value in the non-amino acid columns is taken from the first row of the input. -Note that missing values (NA) of amino acid composition are treated as zero. } \value{ diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd index c287003..72751f8 100644 --- a/vignettes/introduction.Rmd +++ b/vignettes/introduction.Rmd @@ -37,8 +37,8 @@ Previous vignette: [Demo for canprot](demo.html) | Next vignette: [More about me Use `read.fasta()` to read the file and return a data frame of amino acid composition. ```{r read.fasta_KHAB17} -my_fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot") -aa <- read.fasta(my_fasta_file) +fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot") +aa <- read.fasta(fasta_file) ``` The result is small enough that we can look at it here.