AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.
This method provides alignment-free and temporal analysis of multi-FASTA data through the implementation of a C toolkit highly flexible and with characteristics covering large-scale data, namely extensive collections of genomes/proteomes. This toolkit is ideal for scenarios entangling the presence of multiple sequences from epidemic and pandemic events. AlcoR is implemented in C language using multi-threading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence(s) in (multi-) FASTA format.
The AltaiR toolkit contains one main menu (command: AltaiR) with the six sub menus for computing the features that it provides, namely
- average: moving average filter of a column float CSV file (the column to use is a parameter);
- filter: filters FASTA reads by characteristics: alphabet, completeness, length, CG quantity, multiple string patterns and pattern absence;
- frequency: computes the alphabet frequencies for each FASTA read (it enables alphabet filtering);
- nc: computes the Normalized Compression (NC) for all FASTA reads according to a compression level;
- ncd: computes the Normalized Compression Distance (NCD) for all FASTA reads according to a reference;
- raw: computes Relative Absent Words (RAWs) with CG quantity estimation for all RAWs.
First, install Miniconda if you haven't already. Then, to create a new Conda environment named altair
and install altair-mf
using Conda Forge and Bioconda channels, run the following command:
mamba create -n altair -c conda-forge -c bioconda altair-mf
To simply install altair-mf
in an existing environment:
conda install -y -c bioconda altair-mf
Otherwise, CMake is needed for manual installation. You can download CMake directly from http://www.cmake.org/cmake/resources/software.html or use an appropriate package manager. Below are the instructions to install, compile, and run AltaiR:
sudo apt-get install cmake git git clone https://github.com/cobilab/altair.git cd altair/src/ cmake . make
For certain scripts, the Gto toolkit is required, installable via Conda:
conda install -c cobilab gto --yes
Or manually:
git clone https://github.com/cobilab/gto.git
cd gto/src/
make
export PATH="$HOME/gto/bin:$PATH"
To see the possible options type
AltaiR
or
AltaiR -h
If you are not interested in viewing each sub-program option, type
AltaiR average -h AltaiR filter -h AltaiR frequency -h AltaiR nc -h AltaiR ncd -h AltaiR raw -h
Assuming AltaiR is compiled under the src/
folder, and you are in the pipeline/
folder.
cp ../src/AltaiR .
To filter sequences use the following command:
python3 Histogram.py
bash Filter.sh 29885 29921
To simulate and measure similarity profiles:
bash Simulation.sh
bash Similarity.sh ORIGINAL.fa
bash SimProfile.sh sim-data.csv 2 0 1.2
mv NCDProfilesim-data.csv.pdf NCD_P1.pdf
Use the tree.py
script to construct a phylogenetic tree from NCD values:
python3 tree.py sim-data.csv -N 50
Run the following script to generate complexity profiles:
bash ComplexitySars.sh
python3 CompProfileSars.py comp-data.csv sorted_output.fa 0.961 0.9617
mv NCProfilecomp-data.csv.pdf NC.pdf
Generate frequency profiles using the following commands:
bash FrequencySars.sh
python3 combine_freq_and_date.py
mv base_frequencies_plot.pdf Freq.pdf
To calculate RAWs profiles:
bash RawSars.sh
python3 RawSarsProfile.py sorted_output.fa
mv relativeSingularityProfile.pdf RAWProfiles.pdf
If you use AltaiR in your research, please cite: Silva, Jorge M., Armando J. Pinho, and Diogo Pratas. "AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data." GigaScience 13 (2024): giae086.
For any issues, please report at AltaiR Issues.
AltaiR is licensed under GPL v3. For more information, visit GPL v3 License.