title | author | date | output |
---|---|---|---|
MIA |
Flavio Lichtenstein |
December 11, 2015 |
html_document |
Mutual Information Analyzer, a software that clusters molecular sequences based on entropy and mutual information
Federal University of Sao Paulo (UNIFESP), DIS-Bioinformatics
Motivation: Here we propose a method to discriminate closely related species at the molecular level using entropy and mutual information. Sequences of orthologous genes in the same species might contain species-specific covariation patterns that can be identified through mutual information.
Summary: Mutual Information Analyzer (MIA) is a pipeline written in Python with the intent to calculate Vertical Entropy, Vertical and Horizontal Mutual Information. From VH, VMI and HMI distributions, Jensen-Shannon Divergence (JSD) is calculated to estimate the distances between species sequences. Each pair of mutual information distribution distances with their respective standard errors are calculated and stored in distance matrices. These distances between distributions can be presented as histograms or hierarchical cluster dendrograms.
see install document
We did an executable for Windows and Linux, with PyInstaller, but at the end it didn't work (I don't know why). We apologize.
Since we open the source code you can execute by command line, but first you must install some libraries and set a python path.
Mutual Information Analyzer (MIA) is a pipeline written in Pytho with the following algorithms:
- A1) NCBI: gathers data in NCBI and stores them in GBK file format;
- A2) Gbk to Fasta: analyze GBK file and organizes in fasta files per species;
- A3) Alignment: aligns all sequences and at the end creates two fasta files: "mincut" cutting out columns and sequences with large gaps, and "maxmer" maintaining the maximum possible gaps;
- A4) Purging: replaces ambiguous nucleotides via IUPAC nucleotide ambiguity table, and eliminates sequences with undesirable words in their names like "synthetic";
- A5) Consensus: replaces gaps by their vertical consensus nucleotide;
- A6) VMI: calculates and stores Vertical Entropy (VH) and Vertical Mutual Information (VMI) distributions, and plots the respective histograms and heat maps;
- A7) HMI: calculates and stores Horizontal Mutual Information (HMI) distributions, and plots the histograms;
- A8) JSD: calculates Jensen-Shannon Divergence, storing distances and their SEs in distance matrix files, and plots the histograms;
- A9) HC: calculates hierarchical cluster and present it as a dendrogram; A10) Entropy: simulates Shannon Entropy.
HMI and VMI are calculated with and without bias corrections, therefore, the gain or loss of information for "mincut" versus "maxmer", with or without bias correction, can be compared. Distances between distributions are calculated via the square root of JSD. Since Mutual Information and JSD are not linear functions of the data their standard errors are calculated by empirical propagation.
####Images:
Vertical Shannon Entropy
Vertical Mutual Information 2D Heatmap
Vertical Mutual Information 3D Heatmap
Horizontal Entropy
JSD Histogram
Hierarchical Cluster
Soon: a nonparametric classifier
We developed also a MI_Classifier, a non parametric classifier. You give all sequences in a fasta file, it calculates MI spectra and JSD distances, and export the Informational Dendrogram to Figtree. It is in test.