Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New API #13

Merged
merged 235 commits into from
Sep 6, 2022
Merged

New API #13

merged 235 commits into from
Sep 6, 2022

Conversation

zeehio
Copy link
Member

@zeehio zeehio commented Aug 7, 2022

I have been working on a new API for the package.

The API has several design advantages:

  • It integrates better with the Bioconductor ecosystem
  • It is designed to provide functions that can be applied to a dataset, a sample or a single spectrum/chromatogram (e.g. smooth)
  • Functions applied to datasets are delayed by default, so they are not applied until the data is needed (or the user decided to realize(dataset)). When functions are applied they are processed in parallel.

I have implemented a part of the vignette, including reshape/cutting and smoothing.

I will go for alignment next. We can review de design of the API if you like. Look at the vignette to see if it makes sense to you. I've used generic functions from ProtGenerics (filterRt) and Biobase (smooth) implementing methods for our classes.

ease the work done at the GUI by @AnitaMSLH I have been working on a whole new API

@zeehio
Copy link
Member Author

zeehio commented Aug 7, 2022

The vignette: It runs until line 213 where we exit because the rest isn't implemented yet:

---
title: "Introduction to GCIMS, Alternative API"
output: BiocStyle::pdf_document
package: GCIMS
author: "GCIMS authors"
date: "`r format(Sys.Date(), '%F')`"
abstract: >
An introduction to the GCIMS package, showing the most relevant functions and
a proposed workflow. This includes loading demo samples, adding sample
annotations, preprocessing the spectra, alignment, detecting peaks and regions
of interest (ROIs), clustering of ROIs across samples, peak integration and
building a peak table.
vignette: >
%\VignetteIndexEntry{Introduction to GCIMS, Alternative API}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(BiocParallel)
library(ggplot2)
library(GCIMS)
```
The GCIMS package allows you to import your Gas Chromatography - Ion Mobility Spectrometry samples,
preprocess them, align them one to each other and build a peak table with the relevant features.
This vignette will use a small dataset consisting of a mixture of three ketones.
# Downloading the dataset:
```{r}
# The folder where we will download the samples:
samples_directory <- "threeketones"
# Download the ketones dataset:
tryCatch({
download_three_ketones_dataset(samples_directory)
}, error = function(e) {
message("The download of the samples did not succeed. The vignette can't continue.")
message(conditionMessage(e))
knitr::knit_exit()
})
```
```{r}
# Check that the files are downloaded:
list.files(samples_directory)
```
Enable parallellization of the workflow, here we use three cores:
```{r}
# disable parallellization: (Useful for better error reporting)
register(SerialParam(progressbar = TRUE), default = TRUE)
# enable parallellization with 3 workers:
#register(SnowParam(workers = 3, progressbar = TRUE, exportglobals = FALSE), default = TRUE)
```
# Import data
Please start by preparing an Excel spreadsheet (or a CSV/TSV file if you prefer)
with your samples and their annotations. Please name the first column `SampleID`
and the second column `Filename`. We will use those annotations in plots.
```{r}
annotations <- readr::read_csv(file.path(samples_directory, "annotations.csv"), show_col_types = FALSE)
annotations
```
If you have some samples and you want to create your annotations file, you can use
the following code to create a template that you can fill with Excel:
```{r eval=FALSE}
# Your samples match this extension:
samples_ext <- "*.mea.gz"
filenames <- list.files(samples_directory, pattern = utils::glob2rx(samples_ext),
recursive = TRUE, include.dirs = FALSE)
#
annotations_template <- data.frame(
SampleID = tools::file_path_sans_ext(filenames, compression = TRUE),
FileName = filenames
)
# You may need to run: install.packages("writexl")
writexl::write_xlsx(annotations_template, file.path(samples_directory, "annotations.xlsx"))
```
You will find the `annotations.xlsx` file that you can edit and change at will.
Once you are happy with the annotations, you can load them back here with:
```{r eval=FALSE}
annotations <- readxl::read_excel(file.path(samples_directory, "annotations.xlsx"))
annotations
```
# Create a GCIMSDataset object
```{r}
# Delete the intermediate_files directory, if it exists
if (dir.exists("intermediate_files")) {
unlink("intermediate_files", recursive = TRUE)
}
```
```{r}
dataset <- GCIMSDataset(annotations, base_dir = samples_directory, scratch_dir = "intermediate_files")
dataset
```
Explore one sample:
```{r}
ket1 <- getGCIMSSample(dataset, sample = "Ketones1")
plotRaw(ket1, rt_range = c(0, 1000), dt_range = c(7.5, 17))
```
To get the sample, we had first to *realize* the all pending operations, modifying
the dataset in place. We can see how the pending operation is now part of history:
```{r}
dataset
```
Plot the RIC and the TIS to get an overview of the dataset:
```{r}
plotTIS(dataset)
```
```{r}
plotRIC(dataset)
```
# Filter the retention and drift time of your samples
```{r}
filterRt(dataset, rt = c(0, 1300)) # in s
filterDt(dataset, dt = c(1, 17)) # in ms
dataset
```
```{r}
ket1afterfilter <- getGCIMSSample(dataset, sample = "Ketones1")
ket1afterfilter
```
# Smoothing
You can remove noise from your sample using a Savitzky-Golay filter, applied
both in drift time and in retention time.
The Savitzky-Golay has two main parameters: the filter length and the filter order.
It is recommended to use a filter order of 2, but the filter length must be selected
so it is large enough to remove noise but always smaller than the peak width to
prevent distorting the peaks.
You can apply the smoothing filter to a single IMS spectrum or to a single chromatogram
to see how noise is removed and how peaks are not distorted. Tweak the filter lengths
and, once you are happy, apply the smoothing filter to all the dataset.
```{r}
one_ims_spec <- getIMS(ket1afterfilter, rt_range = 97.11)
```
```{r}
one_ims_smoothed <- smooth(one_ims_spec, dt_length_ms = 0.14, dt_order = 2)
to_plot <- dplyr::bind_rows(
NoSmoothed = as.data.frame(one_ims_spec),
Smoothed = as.data.frame(one_ims_smoothed),
.id = "Status"
)
ggplot(to_plot) +
geom_line(aes(x = drift_time_ms, y = intensity, colour = Status)) +
coord_cartesian(xlim = c(7, 10))
ggplot(to_plot) +
geom_line(aes(x = drift_time_ms, y = intensity, colour = Status)) +
coord_cartesian(xlim = c(7, 10), ylim = c(0, 300))
```
```{r}
one_chrom <- getEIC(ket1afterfilter, dt_range = 10.4)
```
```{r}
one_chrom_smoothed <- smooth(one_chrom, rt_length_s = 3, rt_order = 2)
to_plot <- dplyr::bind_rows(
NoSmoothed = as.data.frame(one_chrom),
Smoothed = as.data.frame(one_chrom_smoothed),
.id = "Status"
)
ggplot(to_plot) +
geom_line(aes(x = retention_time_s, y = intensity, colour = Status))
ggplot(to_plot) +
geom_line(aes(x = retention_time_s, y = intensity, colour = Status)) +
coord_cartesian(xlim = c(200, 250))
```

This branch can be installed with:

remotes::install_github("sipss/GCIMS@sergio/new-api", auth_token= "your-personal-access-token")

@zeehio
Copy link
Member Author

zeehio commented Aug 30, 2022

Overview as of 508d153:

  • Can run the analysis all in RAM or saving samples in disk.
  • Runs the 3 samples vignette (twice?) in roughly 3 minutes

The new API vignette has some differences with respect to the old vignette:

  • Cutting samples:

    • New uses rt: [0,1300] s and dt: [1, 17] ms [we easily keep a big region without peaks]
    • Old uses rt: [0,1300] s and dt: [7, 17] ms
  • Smoothing:

    • New uses rt: 3s dt: 0.14 ms
    • Old uses rt: 19 points, dt: 19 points
  • Decimation:

    • Old uses rt: 2, dt: 2
    • New uses rt: 1, dt: 2
  • Time required to build vignette:

    • Old uses 3.6 minutes
    • New uses 3.2 minutes (using a 60% larger drift time axis and twice the number of rt points)

To do things:

  • push changes to clustering

Naming issues:

  • Decide on as.data.frame vs tidy for GCIMSChromatogram and GCIMSSpectrum?
  • Better name for add_peaklist_rect? Also it needs the dataset, since we want the sample annotations as well to color_by.. or use a peaklist class?
  • getGCIMSSample vs getSample vs base::sample name/generic
  • getIMS for getting a spectrum? getSpectrum?
  • deprecated getEIC for getting a Chromatogram? getChromatogram?

Architecture issues:

  • Should we have an "AlignmentResults" class with plot method?
  • Some we have a "ClusteringResults" class with plot method?
  • getTIS, getRIC might return TIS and RIC classes with plot method?
  • Do we need an MChromatograms MSpectrum class? This simplifies smoothing and baseline plots. getTIS and getRIC for datasets can benefit as well. So yes. And yes so the object has the annotations and can plot based on the annotations as well
  • Should we use class-method.R or method-class.R file names?
  • After realize() we should save the dataset object in the folder as well, so we can import the directory

Features we could split and depend on:

  • Should sgolayfilt be in an independent package?
  • Put sgolayfilt benchmarks on a vignette? [Not worth it]
  • Implement geom_*_matrix if possible. Create a package for it?
    • geom_raster_matrix with nativeRaster for overview plots
    • geom_line_matrix to plot multiple geom lines? [Not that easy to do it efficiently in ggplot]

Dependencies / Imports:

  • S4Vectors only used for DataFrame. Can we use data.frame instead? [Not worth the exploration]
  • Remove lifecycle dependency?
  • importFrom rlang abort warn inform

Reading data

  • Helpers for importing CSV files that are not overly specific to a software version
  • Support nos/scm files

Physical units:

  • Peak intensity in volts vs a.u.
  • Support for reduce mobility units. dt_range -> ims_range & ims_unit ?
  • Support for retention index units. rt_range -> gc_range & gc_unit ?
  • find_rip verbosity information prints too many significant digits on drift time limits

Plots

  • plot(dataset) (several samples in subplots, same intensity scale)
  • plot(list of samples) (several samples in subplots, same intensity scale)
  • align should save RIC and TIS before transforming samples, so we can get a before and after plot of them

Vignette and docs:

  • Review vignette plots
  • Review docs from:
    • as.data.frame vs tidy for GCIMSChromatogram and GCIMSSpectrum
    • plot for GCIMSChromatogram and GCIMSSpectrum
    • tidy for GCIMSSample
    • add_peaklist_rect
    • create_annotations_table
    • getIMS, getChromatogram
    • integratePeaks
    • peaks
    • plotRaw vs plot
  • Runnable examples for > 80% of functions of the new API

@zeehio zeehio merged commit 855edca into master Sep 6, 2022
@zeehio zeehio mentioned this pull request Nov 16, 2022
27 tasks
@lalo-caballero lalo-caballero deleted the sergio/new-api branch March 31, 2023 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant