Once install_databases.py
has been run, pharokka requires an input fasta file. An output directory can be specified using -o. Otherwise, an output/
directory will be created in your current working directory.
You can run pharokka
using the following command:
pharokka.py -i <fasta file> -o <output folder> -t <threads> -p <prefix>
pharokka
defaults to 1 thread.
As of v1.4.0, you can speed up pharokka
for small inputs (e.g. phage isolates or a small number of contigs) using -fast
. Note: this will run only PyHMMER against PHROGs (so no VFDB or CARD annotations will be done).
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --fast
A prefix is not required - by default it is pharokka
e.g. pharokka.gff
To specify a different database directory (recommended) from the default:
pharokka.py -i <fasta file> -o <output folder> -d <path/to/databse_dir> -t <threads> -p <prefix>
To overwrite an existing output directory, use -f or --force
:
pharokka.py -i <fasta file> -o <output folder> -d <path/to/databse_dir> -t <threads> -p <prefix> -f
To specify a custom locus tag in the gff/genbank file, use -l:
pharokka.py -i <fasta file> -o <output folder> -d <path/to/databse_dir> -t <threads> -p <prefix> -f -l <locus_tag>
In versions from v0.1.7 and afterwards, the ability to specify an E-value threshold for CDS functional assignment using MMseqs2 was added using the -e flag. It defaults to 1E-5.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -e <E-value>
To use Prodigal (pyrodigal) gene predictions instead of PHANOTATE use -g prodigal
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -g prodigal
To use Prodigal-gv (pyrodigal-gv) gene predictions instead of PHANOTATE use -g prodigal-gv
. This is recommended for metagenomic datasets where some phages likely have alternate genetic codes.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -g prodigal-gv
If you are annotating more than 1 contig, it is recommended that you run pharokka in meta mode using the -m
flag, which will enable pharokka to finish faster by making full use of all available threads when running tRNAscan-SE 2.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -m -g prodigal-gv
As of v1.5.0, you can skip running mash to find the closest match for each contig in INPHARED using skip_mash
.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --skip_mash
As of v1.5.0, you can skip running tRNAscan-SE 2, MinCED and Aragorn fusingg skip_extra_annotations
.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --skip_extra_annotations
As of v1.4.0, pharokka
will automatically run MMseqs2 (PHROGs, CARD, VFDB) and PyHMMER (PHROGs). To turn off PyHMMER, use --mmseqs2_only
.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --mmseqs2_only
As of v1.4.0, pharokka
will automatically run --mmseqs2_only
if you specify -m
or --meta
for meta mode. To force pharokka
to use PyHMMER on PHROGs in meta mode as well, use --meta_hmm
. Note: this might take a long time if you have a large metagenome!
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -m --meta_hmm
You can also force pharokka
to only use PyHMMER on PHROGs in meta mode using --fast
.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -m --fast
If you are running meta mode, you can optionally specify split mode using the -s
flag. This will add output directories with separated FASTA, genbank and gff files for each input contig called single_fastas
, single_gbks
and single_gffs
.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -m -s
There is also support for alternative genetic codes if pharokka
is run with prodigal as a gene predictor using the -c
flag. See Prodigal's documentation, along with Yutin et al. 2021 and Peters et al. 2022 for more information and background on phage stop codon reassignment.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> -g prodigal -c 4
As of v1.4.0 you can specify a genbank input file with --genbank
. This will take all (custom) CDS calls in genbank file and PHANOTATE/pyrodigal will not be run. So if you have done manual gene curation, this option is recommended.
pharokka.py -i <GENBANK file> -o <output folder> -d <path/to/database_dir> -t <threads> --genbank
As of v1.4.0 you want to automatically reorient your phage to begin with the large terminase subunit, you can use --dnaapler
which will run dnaapler to automatically reorient your phage. The reoriented FASTA will be named <prefix>_dnaapler_reoriented.fasta
in the output directory.
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --dnaapler
--dnaapler
is recommended for reorientation as of v1.4.0. However, if you need to reorient your phage to begin at a certain point (for example, to reorient the phage to begin with the large terminase subunit) or if dnaapler
isn't working for you, you can acheive this using the --terminase
command along with specifying the --terminase_strand
denoting the strandedness of the large terminase subunit (must be 'pos' or 'neg') and --terminase_start
denoting the start co-ordinate. This requires the input to be a single contig, or else pharokka
will fail and warn you. For example, if you annotate your phage using pharokka
, and then see that your terminase large subunit begins at coordinate 9503 on the positive strand, you could run the following to reorient it to begin at the large terminase subunit:
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --terminase --terminase_start 9503 --terminase_strand pos
Additionally, it should be noted that if you want to reorient to begin with the large terminase subunit that is on the negative strand (--terminase_strand neg
), you will need to specify a --terminase_start
coordinate that is actually the end (right) coordinate from the gff when --terminase_strand neg is specified. For example (thanks to Albert Vill) if your large terminase subunit is:
ON631220.1 PHANOTATE CDS 80073 82226 -1.721062366005452e+16 - ...
Then you need to specify the following to reorient the phage to begin with the large terminase subunit as follows:
pharokka.py \
-i <fasta file> \
-o <output folder> \
-d <path/to/database_dir> \
-t <threads> \
--terminase \
--terminase_start 82226 \
--terminase_strand neg
Of course, you can also use this functionality to reorient your phage however you wish!
As of v1.6.0 you can specify a custom mash distance threshold vs the INPHARED database with --mash_distance
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --mash_distance 0.05
You can also specify custom arguments to be passed to MINced with --minced_args
Note 2 things: (1) that you need to leave off the leading hyphen (i.e. "minNR 2"
not "-minNR 2"
) and (2) you will need to use quotation marks (as you extra arguments will contain spaces)
pharokka.py -i <fasta file> -o <output folder> -d <path/to/database_dir> -t <threads> --minced_args "minNR 2 -minRL 21"
As of v1.7.4 you can specify the bacterial tRNAscan-SE
model using --trna_scan_model bacterial
. Otherwise, pharokka
uses the general model by default. See the tRNAscan-SE paper for more information.
usage: pharokka.py [-h] [-i INFILE] [-o OUTDIR] [-d DATABASE] [-t THREADS] [-f] [-p PREFIX] [-l LOCUSTAG] [-g GENE_PREDICTOR] [-m] [-s]
[-c CODING_TABLE] [-e EVALUE] [--fast] [--mmseqs2_only] [--meta_hmm] [--dnaapler] [--custom_hmm CUSTOM_HMM] [--genbank]
[--terminase] [--terminase_strand TERMINASE_STRAND] [--terminase_start TERMINASE_START] [--skip_extra_annotations]
[--skip_mash] [--minced_args MINCED_ARGS] [--mash_distance MASH_DISTANCE] [-V] [--citation]
pharokka: fast phage annotation program
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Input genome file in fasta format.
-o OUTDIR, --outdir OUTDIR
Directory to write the output to.
-d DATABASE, --database DATABASE
Database directory. If the databases have been installed in the default directory, this is not required. Otherwise specify the path.
-t THREADS, --threads THREADS
Number of threads. Defaults to 1.
-f, --force Overwrites the output directory.
-p PREFIX, --prefix PREFIX
Prefix for output files. This is not required.
-l LOCUSTAG, --locustag LOCUSTAG
User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.
-g GENE_PREDICTOR, --gene_predictor GENE_PREDICTOR
User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank".
Defaults to phanotate (not required unless prodigal is desired).
-m, --meta meta mode for metavirome input samples
-s, --split split mode for metavirome samples. -m must also be specified.
Will output separate split FASTA, gff and genbank files for each input contig.
-c CODING_TABLE, --coding_table CODING_TABLE
translation table for prodigal. Defaults to 11.
-e EVALUE, --evalue EVALUE
E-value threshold for MMseqs2 database PHROGs, VFDB and CARD and PyHMMER PHROGs database search. Defaults to 1E-05.
--fast, --hmm_only Runs PyHMMER (HMMs) with PHROGs only, not MMseqs2 with PHROGs, CARD or VFDB.
Designed for phage isolates, will not likely be faster for large metagenomes.
--mmseqs2_only Runs MMseqs2 with PHROGs, CARD and VFDB only (same as Pharokka v1.3.2 and prior). Default in meta mode.
--meta_hmm Overrides --mmseqs2_only in meta mode. Will run both MMseqs2 and PyHMMER.
--dnaapler Runs dnaapler to automatically re-orient all contigs to begin with terminase large subunit if found.
Recommended over using '--terminase'.
--custom_hmm CUSTOM_HMM
Run pharokka with a custom HMM profile database suffixed .h3m.
Please use create this with the create_custom_hmm.py script.
--genbank Flag denoting that -i/--input is a genbank file instead of the usual FASTA file.
The CDS calls in this file will be preserved and re-annotated.
--terminase Runs terminase large subunit re-orientation mode.
Single genome input only and requires --terminase_strand and --terminase_start to be specified.
--terminase_strand TERMINASE_STRAND
Strand of terminase large subunit. Must be "pos" or "neg".
--terminase_start TERMINASE_START
Start coordinate of the terminase large subunit.
--skip_extra_annotations
Skips tRNAscan-se, MINced and Aragorn.
--skip_mash Skips running mash to find the closest match for each contig in INPHARED.
--minced_args MINCED_ARGS
extra commands to pass to MINced (please omit the leading hyphen for the first argument). You will need to use quotation marks e.g. --minced_args "minNR 2 -minRL 21"
--mash_distance MASH_DISTANCE
mash distance for the search against INPHARED. Defaults to 0.2.
--trna_scan_model {general,bacterial}
tRNAscan-SE model
-V, --version Print pharokka Version
--citation Print pharokka Citation