Skip to content
Jorge edited this page Dec 19, 2021 · 1 revision

Input

Starting from the --inputdir folder, BiG-SCAPE will recursively look for files with the .gbk extension. The following files are excluded:

  • Filenames that include the string(s) specified in --exclude_gbk_str (default: 'final'. This is in order to exclude the summary GenBank file produced by antiSMASH, which ends with <clustername>.final.gbk)
  • Files with spaces in their path (including the filename). Spaces don't work well with hmmer
  • Files with the string '_ORF', which is used internally by BiG-SCAPE
  • Files with duplicated names (e.g. in different folders)
  • Files where no protein sequences could be extracted
  • Files whose sequence (summed between all records) is shorter than min_bgc_size
  • Files with format issues not parseable by BioPython

By default, only the following files are included:

  • Files with 'cluster' in their name (antiSMASH 4)
  • Files with 'region' in their name (antiSMASH 5)

If you need to exclude or include files with certain strings in their name, use the --exclude_gbk_str and

If two CDS features overlap (e.g. splicing events), BiG-SCAPE's behaviour is to allow for a maximum of 10% of the shortest CDS. If more overlap is detected, BiG-SCAPE will discard the smallest feature from the analysis.

The file's name (without extension) will be used in the following as the BGC name.

Note that at the time being, BiG-SCAPE does not do any particular analysis for a given taxon (i.e. bacterial, archeal, fungal or plant BGCs are treated the same)

Clone this wiki locally