New API #13

zeehio · 2022-08-07T18:45:37Z

I have been working on a new API for the package.

The API has several design advantages:

It integrates better with the Bioconductor ecosystem
It is designed to provide functions that can be applied to a dataset, a sample or a single spectrum/chromatogram (e.g. smooth)
Functions applied to datasets are delayed by default, so they are not applied until the data is needed (or the user decided to realize(dataset)). When functions are applied they are processed in parallel.

I have implemented a part of the vignette, including reshape/cutting and smoothing.

I will go for alignment next. We can review de design of the API if you like. Look at the vignette to see if it makes sense to you. I've used generic functions from ProtGenerics (filterRt) and Biobase (smooth) implementing methods for our classes.

ease the work done at the GUI by @AnitaMSLH I have been working on a whole new API

…changing the docs. These changes happened automatically with: usethis::use_lifecycle() You can now add badges in documentation topics by inserting one of: #' `r lifecycle::badge('experimental')` #' `r lifecycle::badge('superseded')` #' `r lifecycle::badge('deprecated')` See https://lifecycle.r-lib.org/articles/communicate.html

plotRaw() uses ggplot functions, as gcims_view_sample().

…ple objects from a GCIMSDataset

… indexing

zeehio · 2022-08-07T21:26:11Z

The vignette: It runs until line 213 where we exit because the rest isn't implemented yet:

GCIMS/vignettes/introduction-to-gcims-alternative-api.Rmd

Lines 1 to 213 in da95167

    
           --- 
        
           title: "Introduction to GCIMS, Alternative API" 
        
           output: BiocStyle::pdf_document 
        
           package: GCIMS 
        
           author: "GCIMS authors" 
        
           date: "`r format(Sys.Date(), '%F')`" 
        
           abstract: > 
        
             An introduction to the GCIMS package, showing the most relevant functions and 
        
             a proposed workflow. This includes loading demo samples, adding sample 
        
             annotations, preprocessing the spectra, alignment, detecting peaks and regions 
        
             of interest (ROIs), clustering of ROIs across samples, peak integration and 
        
             building a peak table. 
        
           vignette: > 
        
             %\VignetteIndexEntry{Introduction to GCIMS, Alternative API} 
        
             %\VignetteEngine{knitr::rmarkdown} 
        
             %\VignetteEncoding{UTF-8} 
        
           --- 
        
           ```{r, include = FALSE} 
        
           knitr::opts_chunk$set( 
        
             collapse = TRUE, 
        
             comment = "#>" 
        
           ) 
        
           ``` 
        
           ```{r setup} 
        
           library(BiocParallel) 
        
           library(ggplot2) 
        
           library(GCIMS) 
        
           ``` 
        
           The GCIMS package allows you to import your Gas Chromatography - Ion Mobility Spectrometry samples, 
        
           preprocess them, align them one to each other and build a peak table with the relevant features. 
        
           This vignette will use a small dataset consisting of a mixture of three ketones. 
        
           # Downloading the dataset: 
        
           ```{r} 
        
           # The folder where we will download the samples: 
        
           samples_directory <- "threeketones" 
        
           # Download the ketones dataset: 
        
           tryCatch({ 
        
             download_three_ketones_dataset(samples_directory) 
        
           }, error = function(e) { 
        
             message("The download of the samples did not succeed. The vignette can't continue.") 
        
             message(conditionMessage(e)) 
        
             knitr::knit_exit() 
        
           }) 
        
           ``` 
        
           ```{r} 
        
           # Check that the files are downloaded: 
        
           list.files(samples_directory) 
        
           ``` 
        
           Enable parallellization of the workflow, here we use three cores: 
        
           ```{r} 
        
           # disable parallellization: (Useful for better error reporting) 
        
           register(SerialParam(progressbar = TRUE), default = TRUE) 
        
           # enable parallellization with 3 workers: 
        
           #register(SnowParam(workers = 3, progressbar = TRUE, exportglobals = FALSE), default = TRUE) 
        
           ``` 
        
           # Import data 
        
           Please start by preparing an Excel spreadsheet (or a CSV/TSV file if you prefer) 
        
           with your samples and their annotations. Please name the first column `SampleID` 
        
           and the second column `Filename`. We will use those annotations in plots. 
        
           ```{r} 
        
           annotations <- readr::read_csv(file.path(samples_directory, "annotations.csv"), show_col_types = FALSE) 
        
           annotations 
        
           ``` 
        
           If you have some samples and you want to create your annotations file, you can use 
        
           the following code to create a template that you can fill with Excel: 
        
           ```{r eval=FALSE} 
        
           # Your samples match this extension: 
        
           samples_ext <- "*.mea.gz" 
        
           filenames <- list.files(samples_directory, pattern = utils::glob2rx(samples_ext), 
        
                                   recursive = TRUE, include.dirs = FALSE) 
        
           # 
        
           annotations_template <- data.frame( 
        
             SampleID = tools::file_path_sans_ext(filenames, compression = TRUE), 
        
             FileName = filenames 
        
           ) 
        
           # You may need to run: install.packages("writexl") 
        
           writexl::write_xlsx(annotations_template, file.path(samples_directory, "annotations.xlsx")) 
        
           ``` 
        
           You will find the `annotations.xlsx` file that you can edit and change at will. 
        
           Once you are happy with the annotations, you can load them back here with: 
        
           ```{r eval=FALSE} 
        
           annotations <- readxl::read_excel(file.path(samples_directory, "annotations.xlsx")) 
        
           annotations 
        
           ``` 
        
           # Create a GCIMSDataset object 
        
           ```{r} 
        
           # Delete the intermediate_files directory, if it exists 
        
           if (dir.exists("intermediate_files")) { 
        
             unlink("intermediate_files", recursive = TRUE) 
        
           } 
        
           ``` 
        
           ```{r} 
        
           dataset <- GCIMSDataset(annotations, base_dir = samples_directory, scratch_dir = "intermediate_files") 
        
           dataset 
        
           ``` 
        
           Explore one sample: 
        
           ```{r} 
        
           ket1 <- getGCIMSSample(dataset, sample = "Ketones1") 
        
           plotRaw(ket1, rt_range = c(0, 1000), dt_range = c(7.5, 17)) 
        
           ``` 
        
           To get the sample, we had first to *realize* the all pending operations, modifying 
        
           the dataset in place. We can see how the pending operation is now part of history: 
        
           ```{r} 
        
           dataset 
        
           ``` 
        
           Plot the RIC and the TIS to get an overview of the dataset: 
        
           ```{r} 
        
           plotTIS(dataset) 
        
           ``` 
        
           ```{r} 
        
           plotRIC(dataset) 
        
           ``` 
        
           # Filter the retention and drift time of your samples 
        
           ```{r} 
        
           filterRt(dataset, rt = c(0, 1300)) # in s 
        
           filterDt(dataset, dt = c(1, 17)) # in ms 
        
           dataset 
        
           ``` 
        
           ```{r} 
        
           ket1afterfilter <- getGCIMSSample(dataset, sample = "Ketones1") 
        
           ket1afterfilter 
        
           ``` 
        
           # Smoothing 
        
           You can remove noise from your sample using a Savitzky-Golay filter, applied 
        
           both in drift time and in retention time. 
        
           The Savitzky-Golay has two main parameters: the filter length and the filter order. 
        
           It is recommended to use a filter order of 2, but the filter length must be selected 
        
           so it is large enough to remove noise but always smaller than the peak width to 
        
           prevent distorting the peaks. 
        
           You can apply the smoothing filter to a single IMS spectrum or to a single chromatogram 
        
           to see how noise is removed and how peaks are not distorted. Tweak the filter lengths 
        
           and, once you are happy, apply the smoothing filter to all the dataset. 
        
           ```{r} 
        
           one_ims_spec <- getIMS(ket1afterfilter, rt_range = 97.11) 
        
           ``` 
        
           ```{r} 
        
           one_ims_smoothed <- smooth(one_ims_spec, dt_length_ms = 0.14, dt_order = 2) 
        
           to_plot <- dplyr::bind_rows( 
        
             NoSmoothed = as.data.frame(one_ims_spec), 
        
             Smoothed = as.data.frame(one_ims_smoothed), 
        
             .id = "Status" 
        
           ) 
        
           ggplot(to_plot) + 
        
             geom_line(aes(x = drift_time_ms, y = intensity, colour = Status)) + 
        
             coord_cartesian(xlim = c(7, 10)) 
        
           ggplot(to_plot) + 
        
             geom_line(aes(x = drift_time_ms, y = intensity, colour = Status)) + 
        
             coord_cartesian(xlim = c(7, 10), ylim = c(0, 300)) 
        
           ``` 
        
           ```{r} 
        
           one_chrom <- getEIC(ket1afterfilter, dt_range = 10.4) 
        
           ``` 
        
           ```{r} 
        
           one_chrom_smoothed <- smooth(one_chrom, rt_length_s = 3, rt_order = 2) 
        
           to_plot <- dplyr::bind_rows( 
        
             NoSmoothed = as.data.frame(one_chrom), 
        
             Smoothed = as.data.frame(one_chrom_smoothed), 
        
             .id = "Status" 
        
           ) 
        
           ggplot(to_plot) + 
        
             geom_line(aes(x = retention_time_s, y = intensity, colour = Status)) 
        
           ggplot(to_plot) + 
        
             geom_line(aes(x = retention_time_s, y = intensity, colour = Status)) + 
        
             coord_cartesian(xlim = c(200, 250)) 
        
           ```

This branch can be installed with:

remotes::install_github("sipss/GCIMS@sergio/new-api", auth_token= "your-personal-access-token")

…xport the decimate name as a generic for the GCIMSDataset and GCIMSSample classes

…ed and can be executed

- Add proc_params to GCIMSSample to store processing parameters - Fix getting dt/rt ranges with a single value (gets the closest one) - Rename files so class definition is always collated before methods

… sergio/new-api

zeehio · 2022-08-30T06:24:05Z

Overview as of 508d153:

Can run the analysis all in RAM or saving samples in disk.
Runs the 3 samples vignette (twice?) in roughly 3 minutes

The new API vignette has some differences with respect to the old vignette:

Cutting samples:
- New uses rt: [0,1300] s and dt: [1, 17] ms [we easily keep a big region without peaks]
- Old uses rt: [0,1300] s and dt: [7, 17] ms
Smoothing:
- New uses rt: 3s dt: 0.14 ms
- Old uses rt: 19 points, dt: 19 points
Decimation:
- Old uses rt: 2, dt: 2
- New uses rt: 1, dt: 2
Time required to build vignette:
- Old uses 3.6 minutes
- New uses 3.2 minutes (using a 60% larger drift time axis and twice the number of rt points)

To do things:

push changes to clustering

Naming issues:

Decide on as.data.frame vs tidy for GCIMSChromatogram and GCIMSSpectrum?
Better name for add_peaklist_rect? Also it needs the dataset, since we want the sample annotations as well to color_by.. or use a peaklist class?
getGCIMSSample vs getSample vs base::sample name/generic
getIMS for getting a spectrum? getSpectrum?
deprecated getEIC for getting a Chromatogram? getChromatogram?

Architecture issues:

Should we have an "AlignmentResults" class with plot method?
Some we have a "ClusteringResults" class with plot method?
getTIS, getRIC might return TIS and RIC classes with plot method?
Do we need an MChromatograms MSpectrum class? This simplifies smoothing and baseline plots. getTIS and getRIC for datasets can benefit as well. So yes. And yes so the object has the annotations and can plot based on the annotations as well
Should we use class-method.R or method-class.R file names?
After realize() we should save the dataset object in the folder as well, so we can import the directory

Features we could split and depend on:

Should sgolayfilt be in an independent package?
Put sgolayfilt benchmarks on a vignette? [Not worth it]
Implement geom_*_matrix if possible. Create a package for it?
- geom_raster_matrix with nativeRaster for overview plots
- geom_line_matrix to plot multiple geom lines? [Not that easy to do it efficiently in ggplot]

Dependencies / Imports:

S4Vectors only used for DataFrame. Can we use data.frame instead? [Not worth the exploration]
Remove lifecycle dependency?
importFrom rlang abort warn inform

Reading data

Helpers for importing CSV files that are not overly specific to a software version
Support nos/scm files

Physical units:

Peak intensity in volts vs a.u.
Support for reduce mobility units. dt_range -> ims_range & ims_unit ?
Support for retention index units. rt_range -> gc_range & gc_unit ?
find_rip verbosity information prints too many significant digits on drift time limits

Plots

plot(dataset) (several samples in subplots, same intensity scale)
plot(list of samples) (several samples in subplots, same intensity scale)
align should save RIC and TIS before transforming samples, so we can get a before and after plot of them

Vignette and docs:

zeehio added 13 commits August 3, 2022 08:28

Replace image(GCIMSSample) with plotRaw(GCIMSSample)

417b55b

plotRaw() uses ggplot functions, as gcims_view_sample().

Create a GCIMSDelayedOp class to store delayed operations on GCIMSSam…

a2ec2d1

…ple objects from a GCIMSDataset

Merge branch 'master' of github.com:sipss/GCIMS

104d0e1

Include upper limit when subsetting the matrix for consistency with R…

b04e9f3

… indexing

New API from scratch until smoothing

084b877

Document

b1d1036

Merge branch 'master' of github.com:sipss/GCIMS

6bf0712

Merge branch 'master' into sergio/new-api

69d5201

Remove test from unused object

6876862

Merge remote-tracking branch 'origin/master' into sergio/new-api

6f2fd18

Fixes for the new API

29ea71e

devtools::document()

da95167

zeehio added 16 commits August 9, 2022 22:46

Create scratch dir on GCIMSDataset initialization

513966f

Merge master changes into sergio/new-api

422921c

Delete Rplots.pdf

ecc5e6d

gitlab-ci can now set the coverage string in the job

4e185a8

Fix warning about long line in manual

589af64

Merge remote-tracking branch 'origin/master' into sergio/new-api

60f9962

Provide updateObject method instead of custom function name

b6d6e00

Rename decimate internal function to decimate_impl. I would like to e…

b2e79bd

…xport the decimate name as a generic for the GCIMSDataset and GCIMSSample classes

decimate(GCIMSSample)

663a6ee

decimate(GCIMSDataset)

7a96442

decimate() generic

e9794ff

realize() early in the tutorial to introduce how operations are delay…

c066cf7

…ed and can be executed

New API up to decimation

6d619c9

document()

3d2ba5e

Fixes to S4 classes

6e1995e

- Add proc_params to GCIMSSample to store processing parameters - Fix getting dt/rt ranges with a single value (gets the closest one) - Rename files so class definition is always collated before methods

Update sample description if sample name changes

aac072d

zeehio added 8 commits August 29, 2022 19:44

build img only when dockerfile changes

854b382

imputePeakTable using a GCIMSDataset object

e6474fc

Fix plots and getters for RIC and TIS

a2c0a9d

Remove unused generic

e720364

Fix helper for sample selection

f530928

vignette reaches feature parity! :-)

f59bd21

document()

94673e6

Merge branch 'sergio/new-api' of gitlab.com:sipss/research/GCIMS into…

508d153

… sergio/new-api

zeehio added 18 commits August 31, 2022 13:20

Savitzky-Golay improvements

45d2510

document()

b769cc4

Remove useless parameter in vignette

0a8fac8

add_peaklist_rect not clean, but we can map phenotypes to color

461a7bc

Setup parallellization at setup

0d0cf99

fftw not needed

23bc2c8

Avoid false positive in BiocCheck

fbac12f

Remove some NOTEs in BiocCheck

ddd8668

move findPeaks to its own file

f3c2431

Massive rename class-method to method-class

82bb1fd

NOTE in BiocCheck

d8928bb

Remove extra sum

aa74f89

Remove redundant embed call

4c5a1cd

Another engine tried, does not provide much improvement

cb329fb

document()

598e44e

import inform warn abort so we don't have to rlang:: the calls

34e931f

rlang::warn() to warn()

21aecbb

rlang::abort() -> abort()

4c93aa9

zeehio merged commit 855edca into master Sep 6, 2022

zeehio mentioned this pull request Nov 16, 2022

Several to-dos #15

Open

27 tasks

lalo-caballero deleted the sergio/new-api branch March 31, 2023 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New API #13

New API #13

zeehio commented Aug 7, 2022

zeehio commented Aug 7, 2022

zeehio commented Aug 30, 2022 •

edited

Loading

New API #13

New API #13

Conversation

zeehio commented Aug 7, 2022

zeehio commented Aug 7, 2022

zeehio commented Aug 30, 2022 • edited Loading

To do things:

Naming issues:

Architecture issues:

Features we could split and depend on:

Dependencies / Imports:

Reading data

Physical units:

Plots

Vignette and docs:

zeehio commented Aug 30, 2022 •

edited

Loading