Here we describe all steps required to reproduce the results from our publication in as much detail as possible. It should be possible to follow along this recipe, execute all the commands and arrive at the same results. However if you have difficulties reproducing the results or find any clitches in the description don’t hesitate to open an issue.
The original analyses have been performed on a machine with the following specs (except where otherwise noted):
- Intel(R) Core(TM)2 Duo CPU E8500 @3.16GHz
- 4GB RAM
- Ubuntu 14.04.3 LTS 64bit
- hmmsearch: HMMER version 3.1b1
- muscle: v3.8.31
- Gblocks: version 0.91b
- RAxML: version 8.2.4 (raxmlHPC-PTHREADS-SSE3)
To download all Lactobacillales from ezbiocloud execute the following command in reproduce/data:
mkdir -p ezbiocloud_lactobacillales
# Search for Lactobacillales, extract project IDs, download zipped peptide fastas of the projects
curl "http://www.ezbiocloud.net/search?k=ezgenome&v=Lactobacillales" |
grep case2 | cut -f5 -d"=" | cut -f1 -d \" | sort -u | tail -n+2 |
xargs -n1 -I{} wget -O ezbiocloud_lactobacillales/{}.zip "http://www.ezbiocloud.net/mod_download_fasta_ezgenome.jsp?acc="{}"&mode=CDS_AA"
cd ezbiocloud_lactobacillales
for i in *.zip; do unzip $i; done
rm *.zip
# move files with errors into error subdir
mkdir -p error
for i in $(grep --color=none -lr error); do mv $i error; done
# replace offending characters in file names
rename "s/[,:\s)(;'\]\[=]/_/g" *
# create options file for bcgTree
for i in $(\ls *.fasta)
do
# Filter empty sequences
cat $i | SeqFilter --min-len 1 --out $i
BASE=$(basename $i _1.0_CDS_AA.fasta)
echo --proteome $BASE=$PWD/$i
done >ezbiocloud_lactobacillales.options
As this returns all the Lactobacillales genomes in ezbiocloud at the time you execute it the results will vary over time. In order to reproduce the tree from the publication you can use the ezbiocloud_lactobacillales.options file in reproduce/precomputed. But be aware, that this file contains the paths relative to that directory so either make them absolute or call bcgTree from there if you want to use the precomputed file.
To download the faa files from all bacteria in genomes via ftp do the following in reproduce/data:
mkdir -p ncbi
cd ncbi
# The download link has been invalidated in the meantime, it was originally:
# wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.faa.tar.gz
# now it is:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.faa.tar.gz
tar xvzf all.faa.tar.gz
rm all.faa.tar.gz
# Combine the peptide files of each genome
for i in *
do
cat $i/*.faa > $i.fa
rm -rf $i
done
rename "s/[,:\s)(;'\]\[=]/_/g" *
# Create the options file for bcgTree
for i in *.fa
do
echo "--proteome "$(basename $i .fa)"="$PWD"/"$i
done >ncbi_all.options
cd ..
You can also use the precomputed options file in the reproduce/precomputed folder. But be aware, that this file contains the paths relative to that directory so either make them absolute or call bcgTree from there if you want to use the precomputed file.
To download the 16S rRNA sequences from NCBI execute the following commands in reproduce/data:
mkdir -p ncbi_16S
cd ncbi_16S
# The download link has been invalidated in the meantime, it was originally:
# wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.frn.tar.gz
# now it is:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.frn.tar.gz
tar xvzf all.frn.tar.gz
rm all.frn.tar.gz
for i in *
do
cat $i/*.frn | grep 16S | head -n1 | perl -pe 's/^>//;s/ .*//' >$i.ids
cat $i/*.frn | ../../../SeqFilter/bin/SeqFilter - --ids $i.ids --out $i.16S.fa
rm $i.ids
rm -rf $i
done
rename "s/[,:\s)(;'\]\[=]/_/g" *
cd ..
This way there is exactly one 16S sequence for most of the genomes. However that of Lactobacillus_brevis_KB290_uid195560 is missing due to a lack of annotation. To get this the whole genome is downloaded and rnammer used for extraction. You need rnammer version 1.2 for that (be aware that rnammer relies on hmmsearch version 2).
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Lactobacillus_brevis_KB290_uid195560/NC_020819.fna -O Lactobacillus_brevis_KB290_uid195560.fna
rnammer -S bac -m lsu,ssu,tsu -f Lactobacillus_brevis_KB290_uid195560.rrna.fa Lactobacillus_brevis_KB290_uid195560.fna
grep 16s Lactobacillus_brevis_KB290_uid195560.rrna.fa | head -n1 | perl -pe 's/^>//;s/ .*//' >Lactobacillus_brevis_KB290_uid195560.16S.id
../../../SeqFilter Lactobacillus_brevis_KB290_uid195560.rrna.fa --ids Lactobacillus_brevis_KB290_uid195560.16S.id --out Lactobacillus_brevis_KB290_uid195560.16S.fa
perl -i -pe 's/^>/>Lactobacillus_brevis_KB290_uid195560 /' Lactobacillus_brevis_KB290_uid195560.16S.fa
rm Lactobacillus_brevis_KB290_uid195560.fna Lactobacillus_brevis_KB290_uid195560.rrna.fa Lactobacillus_brevis_KB290_uid195560.16S.id
If you have difficulties here I also provide the precomputed 16S rRNA file for Lactobacillus_brevis_KB290_uid195560. Preliminary analyses also showed that the sequence of Lactobacillus_casei_LC2W_uid162121 was not really 16S due to misannotation. So to get the real 16S sequence execute the following commands:
wget ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Lactobacillus_casei_LC2W_uid162121/NC_017473.frn
echo "ref|NC_017473|:258849-260429|16S" | SeqFilter --ids - NC_017473.frn --out Lactobacillus_casei_LC2W_uid162121.16S.fa
rm NC_017473.frn
The full list of Lactobacillales genomes from ezbiocloud contains 2225 genomes. Of those 622 from the genus Enterococcus and 1188 from the genus Streptococcus. To reduce the size of the tree all genomes from ezbiocloud except those of Enterococcus and Streptococcus have been used and only 50 random genomes of each of those genera have been added: The commands to get this set were (in reproduce/data/ezbiocloud_lactobacillales)
grep -vP "Enterococcus|Streptococcus" ezbiocloud_lactobacillales.options >ezbiocloud_lactobacillales_reduced.options
grep "Enterococcus" ezbiocloud_lactobacillales.options | shuf | head -n50 >>ezbiocloud_lactobacillales_reduced.options
grep "Streptococcus" ezbiocloud_lactobacillales.options | shuf | head -n50 >>ezbiocloud_lactobacillales_reduced.options
This results in a list of 515 Lactobacillales genomes. As shuf returns a random list you will need to use the precomputed ezbiocloud_lactobacillales_reduced.options file in order to exactly reproduce the published tree. ATTENTION: This tree has been calculated on a 80 Core Interl Xeon CPU with 512GB of RAM running Ubuntu 12.04 LTS 64 bit. To start the calculation simply run in the reproduce analysis folder:
../../bin/bcgTree.pl @../data/ezbiocloud_lactobacillales/ezbiocloud_lactobacillales_reduced.options --outdir case_study_lactobacillus
The final tree has been refined with figtree (collapsing of nodes on Genus <G> or Species <S> level). Sorry, no idea how to automate this.
To calculate the Lactobacillus tree (bcgTree) run the following in reproduce/analysis
mkdir -p evaluation1/bcgTree
cd evaluation1/bcgTree
grep Lactobacillus ../../../data/ncbi/ncbi_all.options >lactobacillus_paenibacillus.options
grep Paenibacillus ../../../data/ncbi/ncbi_all.options >>lactobacillus_paenibacillus.options
perl ../../../../bin/bcgTree.pl @lactobacillus_paenibacillus.options --out lactobacillus_paenibacillus --bootstraps 100
python ../../../../bin/plot_matrix.py lactobacillus_paenibacillus/absence_presence.csv lactobacillus_paenibacillus_ap.svg
cd ../..
The final tree is in RAxML_bipartitions_final
To calculate the corresponding 16S tree:
mkdir -p evaluation1/16S
cd evaluation1/16S
rm -f lactobacillus_paenibacillus.16S.fa
for i in $(cat ../bcgTree/lactobacillus_paenibacillus.options | grep proteome | rev | cut -f1 -d"/" | rev | perl -pe 's/\.fa$/.16S.fa/')
do
BASE=$(basename $i .16S.fa)
cat ../../../data/ncbi_16S/$i | perl -pe 's/^>/>'$BASE' /' >>lactobacillus_paenibacillus.16S.fa
done
muscle -in lactobacillus_paenibacillus.16S.fa -out lactobacillus_paenibacillus.16S.aln
perl -i -pe 's/ .*//' lactobacillus_paenibacillus.16S.aln
Gblocks lactobacillus_paenibacillus.16S.aln -t=p -b4=4 -b5=h
perl -i -pe 's/ //g;s/:/_/g' lactobacillus_paenibacillus.16S.aln-gb
raxmlHPC -f a -m GTRGAMMA -p 12345 -s lactobacillus_paenibacillus.16S.aln-gb -n lactobacillus_16S -T 4 -x 12345 -N 100
cd ../..
The final tree is in RAxML_bipartitions.lactobacillus_16S
To calculate the high-level tree (bcgTree) run the following in reproduce/analysis
mkdir -p evaluation2/bcgTree
cd evaluation2/bcgTree
rm -f high_level.options
for i in Sulfolobus Thermococcus Bifidobacterium Rhodococcus Bacteroides Flavobacterium Chlamydia Parachlamydia Lactobacillus Paenibacillus Rhizobium Wolbachia Burkholderia Ralstonia Desulfovibrio Geobacter Campylobacter Helicobacter Erwinia Pseudomonas Borrelia Sphaerochaeta Kosmotoga Thermotoga
do
grep $i ../../../data/ncbi/ncbi_all.options | head -n1 >>high_level.options
done
perl ../../../../bin/bcgTree.pl @high_level.options --out high_level --bootstraps 100
cd ../..
To calculate the corresponding 16S tree:
mkdir -p evaluation2/16S
cd evaluation2/16S
rm -f high_level.16S.fa
for i in $(cat ../bcgTree/high_level.options | grep proteome | rev | cut -f1 -d"/" | rev | perl -pe 's/\.fa$/.16S.fa/')
do
BASE=$(basename $i .16S.fa)
cat ../../../data/ncbi_16S/$i | perl -pe 's/^>/>'$BASE' /' >>high_level.16S.fa
done
muscle -in high_level.16S.fa -out high_level.16S.aln
perl -i -pe 's/ .*//' high_level.16S.aln
Gblocks high_level.16S.aln -t=p -b4=4 -b5=h
perl -i -pe 's/ //g;s/:/_/g' high_level.16S.aln-gb
raxmlHPC -f a -m GTRGAMMA -p 12345 -s high_level.16S.aln-gb -n high_level_16S -T 4 -x 12345 -N 100
cd ../..
The final tree is in RAxML_bipartitions.high_level_16S
In order to get the distribution of bootstrap values for each tree you can execute:
perl -pe 's/\)/\n/g' evaluation1/16S/RAxML_bipartitions.lactobacillus_16S | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/16S\tevaluation1\t/' >bootstraps
perl -pe 's/\)/\n/g' evaluation2/16S/RAxML_bipartitions.high_level_16S | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/16S\tevaluation2\t/' >>bootstraps
perl -pe 's/\)/\n/g' evaluation1/bcgTree/lactobacillus_paenibacillus/RAxML_bipartitions.final | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/bcgTree\tevaluation1\t/' >>bootstraps
perl -pe 's/\)/\n/g' evaluation2/bcgTree/high_level/RAxML_bipartitions.final | cut -f1 -s -d":" | tail -n+2 | perl -pe 's/^/bcgTree\tevaluation2\t/' >>bootstraps
Calculate trees with random subselections of the partitions on high-level and low-level use cases. In reproduce/analysis execute the following commands:
mkdir -p multimarker/evaluation1
cd multimarker/evaluation1
cp ../../evaluation1/bcgTree/lactobacillus_paenibacillus/full_alignment.concat.* .
for j in {0..9}
do
for i in {1..108}
do
NEWDIR=$(printf "%02d" $j)_$(printf "%03d" $((109-i)))
mkdir $NEWDIR
cp full_alignment.concat.fa full_alignment.concat.partition $NEWDIR
cd $NEWDIR
cut -f4 -d" " full_alignment.concat.partition | shuf | head -n $i >exclude_file
raxmlHPC -E exclude_file -s full_alignment.concat.fa -q full_alignment.concat.partition -n EXCLUDE -m GTRGAMMA
raxmlHPC -f a -m GTRGAMMA -p 12345 -q full_alignment.concat.partition.exclude_file -s full_alignment.concat.fa.exclude_file -n partsub_$i -T 6 -x 12345 -N 1
cd -
done
done
# Create qdist table
echo -e "batch\tgene_count\tqdist\tqdist_rel" >lactobacillus_paenibacillus.qdist_by_genenumber.tsv
for j in {0..9}
do
for i in {1..108}
do
echo -n $j$'\t'$i$'\t'
FILE="$(printf "%02d" $j)_$(printf "%03d" $i)/RAxML_bestTree.partsub_"
if [ -f $FILE ]
then
qdist ../../evaluation1/bcgTree/lactobacillus_paenibacillus/RAxML_bestTree.final $FILE | tail -n1 | cut -f7,8
else
echo -e "NA\tNA"
fi
done
done >>lactobacillus_paenibacillus.qdist_by_genenumber.tsv
echo -ne "16S\t1\t" >>lactobacillus_paenibacillus.qdist_by_genenumber.tsv
qdist ../../evaluation1/bcgTree/lactobacillus_paenibacillus/RAxML_bestTree.final ../../evaluation1/16S/RAxML_bestTree.lactobacillus_16S | tail -n1 | cut -f7,8 >>lactobacillus_paenibacillus.qdist_by_genenumber.tsv
cd ../..
Repeat this step for the evaluation2 set:
mkdir -p multimarker/evaluation2
cd multimarker/evaluation2
cp ../../evaluation2/bcgTree/high_level/full_alignment.concat.* .
for j in {0..9}
do
for i in {1..108}
do
NEWDIR=$(printf "%02d" $j)_$(printf "%03d" $((109-i)))
mkdir $NEWDIR
cp full_alignment.concat.fa full_alignment.concat.partition $NEWDIR
cd $NEWDIR
cut -f4 -d" " full_alignment.concat.partition | shuf | head -n $i >exclude_file
raxmlHPC -E exclude_file -s full_alignment.concat.fa -q full_alignment.concat.partition -n EXCLUDE -m GTRGAMMA
raxmlHPC -f a -m GTRGAMMA -p 12345 -q full_alignment.concat.partition.exclude_file -s full_alignment.concat.fa.exclude_file -n partsub_$i -T 6 -x 12345 -N 1
cd -
done
done
# Create qdist table
echo -e "batch\tgene_count\tqdist\tqdist_rel" >high_level.qdist_by_genenumber.tsv
for j in {0..9}
do
for i in {1..108}
do
echo -n $j$'\t'$i$'\t'
FILE="$(printf "%02d" $j)_$(printf "%03d" $i)/RAxML_bestTree.partsub_"
if [ -f $FILE ]
then
qdist ../../evaluation2/bcgTree/high_level/RAxML_bestTree.final $FILE | tail -n1 | cut -f7,8
else
echo -e "NA\tNA"
fi
done
done >>high_level.qdist_by_genenumber.tsv
echo -ne "16S\t1\t" >>high_level.qdist_by_genenumber.tsv
qdist ../../evaluation2/bcgTree/high_level/RAxML_bestTree.final ../../evaluation2/16S/RAxML_bestTree.high_level_16S | tail -n1 | cut -f7,8 >>high_level.qdist_by_genenumber.tsv
cd ../..
To get an impression on how the runtime scales with the number of genomes the following analysis has been performed. To reproduce go to reproduce/analysis:
mkdir -p performance
cd performance
for i in $(seq 5 1 15)
do
for j in $(seq 1 1 5)
do
shuf ../../data/ncbi/ncbi_all.options | head -n ${i} >${i}_${j}.options
../../../bin/bcgTree.pl @${i}_${j}.options --out ${i}_${j}
done
done
for i in $(seq 5 5 50)
do
shuf ../../data/ncbi/ncbi_all.options | head -n ${i} >${i}_0.options
../../../bin/bcgTree.pl @${i}_0.options --out ${i}_0
done
#Gather timestamps:
echo "genomes"$'\t'"rep"$'\t'"start"$'\t'"startRaxml"$'\t'"end" >timestamps.tsv
for i in *_*.options
do
BASE=$(basename $i .options)
echo -n $BASE$'\t' | perl -pe 's/_/\t/'
LOG="$BASE/bcgTree.log"
cat <(head -n1 $LOG) <(grep "Starting: raxml" $LOG) <(tail -n1 $LOG) | cut -f1,2 -d" " | perl -pe 's/\n/\t/' | perl -pe 's/\t$/\n/'
done >>timestamps.tsv
cd ..
Analysis of timestamps in R (go to folder reproduce/analysis/performance):
data <- read.table("timestamps.tsv", header=T, sep="\t")
#The date/time string can be converted to an appropriate object with strptime
#strptime(as.character(data$end), "[%m-%d %H:%M:%S]")
time_until_finished = difftime(strptime(as.character(data$end), "[%m-%d %H:%M:%S]"), strptime(as.character(data$start), "[%m-%d %H:%M:%S]"), unit="sec")
time_until_raxml = difftime(strptime(as.character(data$startRaxml), "[%m-%d %H:%M:%S]"), strptime(as.character(data$start), "[%m-%d %H:%M:%S]"), unit="sec")
cor(as.numeric(data$genomes), as.numeric(time_until_raxml))
cor(as.numeric(data$genomes), as.numeric(time_until_finished))
cor.test(as.numeric(data$genomes), as.numeric(time_until_finished))
# quite surprisingly the linear(!) correlation between number of genomes and runtime is very strong
pdf("runtime_vs_genome_number.pdf")
plot(data$genomes,
time_until_finished,
xlab="Number of genomes",
ylab="Time in s",
main="Runtime of bcgTree depending on number of genomes",
ylim=c(1,max(time_until_finished))
)
points(data$genomes,
time_until_raxml,
col="red"
)
# add regression line
abline(lm(time_until_finished ~ data$genomes))
legend("topleft", legend=c("runtime before raxML", "total runtime", "regression"), pch=c(1,1,NA), lty=c(0,0,1), col=c("red","black","black"))
dev.off()