Skip this step if using a Linux server. Refer to the WSL setup guide.
- Open the Windows command prompt: Press
Win Key
, search for ‘command prompt’, and select ‘Run as administrator’. Click ‘Yes’ when prompted by Windows to allow the app to make changes. - Type in the command prompt:
wsl.exe — install
. - Wait for the installation to finish, and type ‘Yes’ if prompted.
- Restart your computer.
- Click on Ubuntu.exe to register Ubuntu.
- To have root as the administrator every time you start, use:
sudo -s
. - If any issues arise, unregister Ubuntu in PowerShell with:
wsl --unregister Ubuntu
, then click Ubuntu.exe to register Ubuntu again.
Click Ubuntu.exe or visit your Linux server terminal. Refer to the Conda manual.
cd ~ # or your working directory
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b -u -p ./miniconda3
./miniconda3/bin/conda init bash
exit
After exiting, reopen with Ubuntu.exe and you should see (base).
source ~/.bashrc
conda config --set auto_activate_base false
source ~/.bashrc
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
# To check
nano ~/.condarc
which conda
cd /your/work/folder/path # Replace with your actual path
mamba create -n plantismash python=2.7.15
conda activate plantismash
git clone https://github.com/plantismash/plantismash.git
cd plantismash
python install_dependencies.py # Takes about 10 minutes
python download_databases.py # Only need PFAM for fullhmmer
# For usage instructions of plantismash, use the command:
python run_antismash.py -h
mkdir Arabidopsis_thaliana
cd Arabidopsis_thaliana
datasets download genome accession GCF_000001735.4 --include gbff
unzip ncbi_dataset.zip
python ../run_antismash.py --clusterblast --knownclusterblast --verbose --debug --limit -1 --taxon plants --outputfolder result/ ncbi_dataset/data/GCF_000001735.4/genomic.gbff
# --clusterblast --knownclusterblast are optional
python run_antismash.py --verbose --debug --limit -1 --taxon plants --outputfolder result/ --use_phase --gff3 path/to/gff3/file path/to/fasta/file
# Please check the error message. The genome names in the gff3 file may differ from those in the fasta file, causing an error.
It is recommended to change the output folder or delete it every time you run PlantiSMASH. Because when using the same output folder, files from the previous run may be partially preserved (only rewriting files with the same names).
This module is in antismash\generic_modules\hmm_detection
HMMs are used to do that by running hmmerscan. The HMMs files, cluster_rules.txt
, hmmdetails.txt
, and filterhmmdetails.txt
. for plant are in plant
. The hmmdetails.txt
controls which HMMs will be used (4th column) and the bitscore cutoff (3rd column) to filter hmmerscan results. Usually the bitscore cutoff is -1
equal to no filter.
The result recorded in the output .gbk files. To only show the HMMs with highest bitscore from matches on same proteins sequence range, add them in filterhmmdetails.txt
. For example, the same domain range match HMM UDPGT and UDPGT_2 , but the output will only show the one with highest bitscore.
Note: Another module by using the command --full-hmmer
will use Pfam-A.hmm to identify any kind of domains. But the results only recorded in the output .gbk files and did not used in other module.
This involves the essential
and Cutoff
of a rule in cluster_rules.txt
.
A rule is usually formed as follows:
Name Rule Cutoff (in kb) Extension (in kb)
Product type minimum(3,[required],[core_list]) 5 1
The core_list
is a list of HMMs names of domains determining product type, such as Chal _ stic _ synt _ C/Chal _ stic _ synt _ N
for the rule polyketide
. Here the two HMMs names are joined by /
, representing the saved cluster contains at least one of them. If joined by ,
, the cluster containing both domains will be saved.
In required
, each HMMs name is joined by ,
, representing the saved cluster contains at least one of them. The required
almost is the list of HMMs record in hmmdetails.txt
because the clustering starts from the gene with a HMM (in the list) match. Then check the HMM matches of neighboring genes whether in the required
. If a neighboring gene with the match belongs to the required
, then add this gene and check the neighbors of it and so on until no new gene adding in the cluster.
The span of left and right “neighborhood” of a gene on the conift or on the chromosome is calculated dynamically by function get_dynamic_cutoff_multiplier
.
Simply, left span = right span = Cutoff*(the span of the nearest ten genes)/10
In idea situation and Cutoff = 5
, the neighboring genes are the nearest ten genes. But if one of the nearest is way far away then it will be in the ‘neighborhood’. You can increase the cutoff to consider more neighbor genes. When cutoff = 10, then the neighboring genes must include the nearest 10 genes.
The domains composition of the cluster contains at least two different domains (--min-domain-number 2
is default) and meets core_list
.
The cluster contains at least three genes with required
HMMs matches (set by the number behind minimum(
) and they sharing the similarity below 50% (--cdh-cutoff 0.5
is default). The similarity is calculated by CD-HIT. Maybe meet the error of memory, the default is --cdh-memory 2000
.
The cluster also contain genes in the Extension
of the both ends biosynthetic genes. The cluster type is same to the name of rule found and saved it.
If not, add HMMs files in plant
and update hmmdetails.txt
required
can only include few HMMs names (even just core_list
) if want no others between genes with required
matches.
If rules respond to same cluster type, can use or
to link. Check rule phenolamide
in cluster_rules.txt
as example.
In cluster_rules.txt
, the cyclopeptide
rule will save every cluster contains gene with BURP
HMMs match. Can follow this format to make rule.
There are other ways to form rules, can check the __init__.py
Keep the first line; can delete other rules; do not leave blank lines
Such as --cdh-cutoff
, --min-domain-number
, --cdh-memory
Give violet background to alkaloid
clusters on overview page
antismash/output_modules/html/css/style.css
.alkaloid {
background-color: violet;
}
.alkaloid a {
color: black;
}
Set cluster legend of genes with pyridoxal synthase domains HMMs matches.
antismash/output_modules/html/js/gene_colors.js
{ label: "pyridoxal synthase", color: "#4CAF50", members : ["plants/YjeF_N", "plants/Pyridox_oxidase", "plants/PNPOx_C"] },
The subgroup module is located in antismash/generic_modules/subgroup
. A subgroup is defined as a clade of a family phylogenetic tree, which contains members sharing similar functions. This module enables the subgroup of protein sequences from various families based on the results of hmmer
scans using domain pHMMs. The families it covers include the Cellulose synthase (CSLs) family (Chung et al., 2020; Jozwiak et al., 2020), UDP-glucuronosyltransferase (UGTs) family (Louveau & Osbourn, 2019), short-chain dehydrogenases/reductases (SDRs) family (Moummou et al., 2012), and oxidosqualene cyclase (OSCs) family (unpublished work).
You can specify which families to subgroup by altering the ‘Enable’ column from ‘Y’ to ‘N’ in Subgroup_Model.txt
. Each family folder (Cellulose_synt
, OSC
, SDR
, UDPGT
) contains HMM profiles of subgroups. Note that for SDR, the HMMs are not based on a phylogenetic tree.
The Family model name(s)
column in the configuration file designates which domain pHMMs can be associated with a particular family. The Subgroup Model
and Match type
columns determine what is displayed on the web pages for each cluster, in the overview, and within .gbk files.
Similar ideas in Subgroup_Tree.txt
, the reference package (in family folders) containing the referent tree is for the pplacer to place the sequence on the tree.
Commands for the module include:
--disable_subgroup
to disable identifying the subgroup.--disable_treesvg
to disable the creation of SVG pictures of subgrouping trees to save time.--subgroup_inputpath SUBGROUP_INPUTPATH
to specify the path to a folder with the same structure as the subgroup folder inantismash/generic_modules
.
Take the Cellulose synthase family as an example. If a subgroup of the Cellulose synthase on the family tree has been well studied, you can collect those sequences, divide them into subgroups, and create HMMs using hmmbuild
. Ensure to add the corresponding information in Subgroup_Model.txt
.
The .hmm file name and the NAME in the .hmm file content should match.
The reference package, which contains the reference tree for pplacer
to place the sequence on, is created using taxtastic
with the following components:
- Alignment of reference sequences.
CSLs_all_to_check_with_add.afa
- The reference tree based on the alignment.
RAxML_bipartitionsBranchLabels.CSLs_all_to_check_with_add.newick
- The log file from the tool (iqtree, raxmlng, raxml, fasttree,phyml) used to create the tree.
RAxML_info.CSLs_all_to_check_with_add
- A table containing node names and their corresponding subgroup names.
node-subgroup.txt
cd /your/work/folder/path # Replace with your actual path
git clone https://github.com/fhcrc/taxtastic.git
cd taxtastic
mamba create -n plantismash python=3
conda activate python3
source taxtastic-env/bin/activate
pip install .
cd ../plantismash/antismash/generic_modules/subgroup/Cellulose_synt/Cellulose_synt.refpkg
taxit create -P Cellulose_synt.refpkg -l plant_Cellulose_synt --aln-fasta CSLs_all_to_check_with_add.afa --tree-stats RAxML_info.CSLs_all_to_check_with_add --tree-file RAxML_bipartitions.CSLs_all_to_check_with_add --seq-info node-subgroup.txt
Taxtastic
generates CONTENTS.json
recording the files and phylo_model.json
recording information from the log file. Verify phylo_model.json
to ensure the parameters match those in the log file. If it is not, it can be changed directly on it.
For example, change "subs_model": "AUTO"
to "subs_model": "LG"
. The same strategy can be used when using a tree-building tool log that is not accepted by taxtastic
.
After creating the reference package, add the corresponding information to Subgroup_Tree.txt
.
Module uses ete3 to parse the tree file. But to ete3, 0.003847[100] format is not supported. bootstrap values in newick format should look like: ')100:0.003847'.
Chung, S. Y., Seki, H., Fujisawa, Y., Shimoda, Y., Hiraga, S., Nomura, Y., Saito, K., Ishimoto, M., & Muranaka, T. (2020). A cellulose synthase-derived enzyme catalyses 3-O-glucuronosylation in saponin biosynthesis. Nature Communications 2020 11:1, 11(1), 1–11. https://doi.org/10.1038/s41467-020-19399-0
Jozwiak, A., Sonawane, P. D., Panda, S., Garagounis, C., Papadopoulou, K. K., Abebie, B., Massalha, H., Almekias-Siegl, E., Scherf, T., & Aharoni, A. (2020). Plant terpenoid metabolism co-opts a component of the cell wall biosynthesis machinery. Nature Chemical Biology 2020 16:7, 16(7), 740–748. https://doi.org/10.1038/s41589-020-0541-x
Louveau, T., & Osbourn, A. (2019). The Sweet Side of Plant-Specialized Metabolism. Cold Spring Harbor Perspectives in Biology, 11(12), a034744. https://doi.org/10.1101/CSHPERSPECT.A034744
Moummou, H., Kallberg, Y., Tonfack, L. B., Persson, B., & van der Rest, B. (2012). The Plant Short-Chain Dehydrogenase (SDR) superfamily: Genome-wide inventory and diversification patterns. BMC Plant Biology, 12(1), 1–17. https://doi.org/10.1186/1471-2229-12-219/FIGURES/7
Activate it with --clusterblast
It will find the homologous clusters stored in the database
The scripts are in antismash/generic_modules/clusterblast
also with database files, including:plantgeneclusters.txt, plantgeneclusterprots.fasta
Use --clusterblastdir
to specify the database directory which you want to use.
- change the minimum number of each rule in
clusterblast_rules.txt
to 2 (so will save clusters only with 2 genes); - download the genomes you want to use to make the database, for example NCBI reference genomes of Streptophyta:
datasets download genome taxon 35493 --annotated --reference --include gbff --dehydrated --filename streptophyta_ref_anno.zip
unzip streptophyta_ref_anno.zip -d ncbi_plant_ref_anno
datasets rehydrate --directory ncbi_plant_ref_anno
# check whether get all genomes mentioned in fetch.txt
find ncbi_plant_ref_anno/ncbi_dataset/data -mindepth 1 -maxdepth 1 -type d ! -exec sh -c 'ls "{}" | grep -q "genomic.gbff"' \; -print
- run plantiSMASH with
--clusterblast
to get the clusters, for example for using those Streptophyta genomes:
nohup bash -c 'find ncbi_dataset/data -mindepth 1 -maxdepth 1 -type d -exec bash -c "python2 ../plantismash/run_antismash.py --cpus 90 --cdh-memory 64000 --cdh-cutoff 0.9 --update_clusterblast --verbose --disable-svg --disable-html --disable-xls --disable_specific_modules --disable_subgroup --disable-genbank --limit -1 --taxon plants --outputfolder {}/for_blast_db {}/*.*" \; > clusterblast_database_making.log 2>&1' &
This will update plantgeneclusters.txt, plantgeneclusterprots.fasta
in every run
Can add --clusterblastdir
to specify the directory which you want to save the database.
- get the numbers of clusters found in each genome:
grep -E 'INFO: Numbers of clusters' clusterblast_database_making.log > clusters_numbers.txt