SpeCollate is the first Deep Learning-based peptide-spectrum similarity network. It allows searching a peptide database by generating embeddings for both mass spectra and database peptides. K-nearest neighbor search is performed on a GPU in the embedding space to find the k (usually k=5) nearest peptide for each spectrum.
SpeCollate network consists of two branch, i.e., Spectrum Sub-Network (SSN) and Peptide Sub-Network (PSN). SSN processes spectra and generates spectral embeddings while PSN processes peptide sequences and generates peptides embeddings. Both types of embeddings are generated in real space of dimension 256. The network architecture is shown in Fig 1 below.
Fig 1: SpeCollate network architecture. Spectra are encodded in dense arrays of length 80,000 each where each index represents a m/z bin width of 0.1 Da. Hence, spectra with maximum m/z of 8,000 can be encoded using this technique. Encoded spectra are passed through SSN which consists of two fully connected layers of dimessions 80,000 x 1,024 and 1,024 x 256. Output from the second layer is normalized to have unit length. Similarly, peptides sequences are integer encoded where each amino acid and modification character is assigned a unique integer value. These encoded peptide vectors are passed through the embedding layer which learns 256 dimension embedding for each amino acid. The output from the embedding layer is then passed throug PSN which consists of two BiLSTMs and two fully connected layers of length 2,048 x 1,024 and 1,024 x 256. Output from the last layer is normalzied to unit length.
To train SpeCollate, we design a custom loss function called SNAP-Loss which is inspired from Triplet Loss function. In SNAP-Loss, loss is calcualted over sextuplets of datapoints where each sextuplet consists of an anchor spectrum, a positive peptide, two negative spectra and two negative peptides.
We design SNAP-loss which extends Triplet-Loss to multi-modal data, in our case numerical spectra and sequence peptides. For this purpose, we consider all possible negatives (qj, pk, ql, pm) for a given positive pair (qi, pi) and average the total loss. The four possible negatives are explained below:
- qj: The negative spectrum for qi.
- pk: The negative peptide for qi.
- ql: The negative spectrum for pi.
- pm: The negative peptide for pi.
To calculate the loss value, we first define a few variables that are precomputed in distances matrices above as follows:
Then the SNAP-loss is calculated for a batch of size b as follows:
The training process is visualized in the figure below:
Once, the sextuplets are genrated, the loss is calculated using the SNAP-Loss function and the network paramenters are updated by back propagation.
Tuned hyperparameters are given in table 1 below and the ranges for which their value was tuned for:
Hyperparameter | Value | Values Tested |
---|---|---|
Learning Rate | 0.0001 | 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01 |
Weight Decay | 0.0001 | 1xe^-6, 1xe^-5, 1xe^-4, 1xe^-3 |
Margin | 0.2 | 0.1, 0.2, 0.3, 0.4 |
Embedding Dim | 256 | 32, 64, 128, 256, 512, 1028, 2048 |
FC Layers | 2 | 1, 2, 3 |
BiLSTM Layers | 2 | 1, 2, 3, 4 |
- A Computer with Ubuntu 16.04 (or later) or CentOS 8.1 (or later).
- Cuda enabled GPU with at least 12 GBs of memory.
- OpenMS tool for creating custom peptide database. (Optional)
- Crux for FDR analysis using its percolator option.
Step by Step Guide to Install Anaconda
- Fork the repository to your own account.
- Clone your fork to your machine.
cd SpeCollate
conda env create --file specollate_env.yml
(It would take some minutes to install dependencis)
conda activate specollate
Our end-to-end pipeline uses specollate model.
-
Use mgf files for spectra in
sample_data/mgfs
. Or you can use your own spectra files in mgf format. -
Use peptidome subset in
sample_data/peptides
. You can provide your own peptide database file created using the Digestor tool provided by OpenMS. -
Download the weights for specollate model here under the Assets section.
-
Set the following parameters in the [search] section of the
config.ini
file:model_name
: Absolute path to the specollate model (called specollate_model_weights.pt that you downloaded from here under the Assets section).mgf_dir
: Absolute path to the directory containing mgf files(Spectra) to be searched.prep_dir
: Absolute path to the directory where preprocessed mgf files will be saved.pep_dir
: Absolute path to the directory containing peptide database.out_pin_dir
: Absolute path to a directory where percolator pin files will be saved.- Set database search parameters
-
Run
python run_search.py
. It would preprocess, generate the embeddings for spectra and peptides and it would perform the search. It would generate the output(e.g target.pin, decoy.pin).
The database search would output two files (target.pin, decoy.pin). target.pin
contains the information about Target Peptide Spectrum Match. decoy.pin
contains the information about Decoy Peptide Spectrum Match. Both .pin file would have the features given below for Peptide-Spectrum Match.
Once the search is complete and .pin are generated; you can analyze the percolator files using the crux percolator tool:
cd <out_pin_dir>
crux percolator target.pin decoy.pin --list-of-files T --overwrite T
Execution time is dependent on many factors including your machines, size of the data, size of the spectra, and what kind of search-space reduction was achieved. As an example, for a 3.9GB database our proposed method completes the search in 20.36 hours.
You can retrain the SpeCollate model if you wish.
- Prepare the spectra data (mgf format).
- Open the config.ini file in your favorite text editor and set the following parameters in the input section:
mgf_dir
: Absolute path of the mgf files.prep_dir
Absolute path to the directory where preprocessed mgf files will be saved.- other parameters in the [ml] section: You can adjust different hyperparameters in the [ml] section, e.g., learning_rate, dropout, etc.
- Setup the wandb account. Create a project name. Then login to the project using
wandb login.
It would store the logs for training. - Run the specollate_train file
python run_train.py
. The model weights would be saved in an output dir.
If you use our tool, please cite our work:
[1]. Tariq, Muhammad Usman, and Fahad Saeed. "SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions." PloS one 16.10 (2021): e0259349.
For questions, suggestions, or technical problems, contact:
[email protected]