-
Notifications
You must be signed in to change notification settings - Fork 27
input
Starting from the --inputdir
folder, BiG-SCAPE will recursively look for files with the .gbk
extension. The following files are excluded:
- Filenames that include the string(s) specified in
--exclude_gbk_str
(default: 'final'. This is in order to exclude the summary GenBank file produced by antiSMASH, which ends with <clustername>.final.gbk) - Files with spaces in their path (including the filename). Spaces don't work well with hmmer
- Files with the string '_ORF', which is used internally by BiG-SCAPE
- Files with duplicated names (e.g. in different folders)
- Files where no protein sequences could be extracted
- Files whose sequence (summed between all records) is shorter than
min_bgc_size
- Files with format issues not parseable by BioPython
By default, only the following files are included:
- Files with 'cluster' in their name (antiSMASH 4)
- Files with 'region' in their name (antiSMASH 5)
If you need to exclude or include files with certain strings in their name, use the --exclude_gbk_str
and
If two CDS features overlap (e.g. splicing events), BiG-SCAPE's behaviour is to allow for a maximum of 10% of the shortest CDS. If more overlap is detected, BiG-SCAPE will discard the smallest feature from the analysis.
The file's name (without extension) will be used in the following as the BGC name.
Note that at the time being, BiG-SCAPE does not do any particular analysis for a given taxon (i.e. bacterial, archeal, fungal or plant BGCs are treated the same)