#Comparative genome analysis with OrthoMCL

Author: Juan A. Ugalde, juanuu@gmail.com

##Some considerations

You need to have OrthoMCL installed on your machine, and the right permissions to create mysql databases

##Preparing the genome files

Download the data from IMG

You need to download three files for each genome, and store them in separate folders, according to data type:

Annotation: the format is GenomeNumber.info.xls
Fasta proteins: the format is GenomeNumber.faa
Fasta genes (nucleotide): the format is GenomeNumber.fna

An important thing to keep in mind, is that sometimes when you are downloading a file from the IMG website, the file does not have the correct name and instead the name is main.cgi. Is that is the case, you need to appropiate rename the file.

Another detail, is that the extension of each file does not need to be faa or fna, but all the files in the folder must have the same extension.

Create a genome list file

This will be used to replace the JGI names with appropiate (and more easy to remember) names. The format of this file is:

Taxon OID (which is the prefix of each file) Full genome name Prefix to use

where each file is separated by tabs

An example of this: 645058727 Acidithiobacillus caldus ATCC 51756 Acaldus51756

Prepare the files

Use the PrepareGenomes_OrthoMCL.py script. The required inputs for the script are:

Genome list file
Folder with either nucleotide or protein sequences
Extension of the files in the folder
Name of the output folder (it will be created if it doesn't exist)

Run Blast

Concatenate all of the genomes files into one single file
Create blast database
Blast all the sequences versus the blast database, the output must be in tabular format

Example command:

formatdb -i All_genomes.fasta -n AcidoProteins -p T

blastall -p blastp -d AcidoProteins -i All_genomes.fasta -F 'm S' -v 1000000 -b 1000000 -z 19055 -e 1e-5 -m 8 -a 6 -o blastp.Acidothio

Create MySQL database and run MCL

Create MySQL database, and give privileges to the user running the analysis
Create the config file (example provided in the orthoMCL folder), and edit it with the information of the mysql database, user name and password.
Install orthoMCL schema: orthomclInstallSchema config_file
Parse Blast Results: orthomclBlastParser blast_input fasta_protein_folder >> similarSequences.txt
Load results in the database orthomclLoadBlast config_file similarSequences.txt
Run MCL pairs orthomclPairs config_file log_file cleanup=no
Dump the pairs files orthomclDumpPairsFiles config_file
MCL analysis It is possible here to try with different inflation values (I) mcl mclInput --abc -I 1.5 -o 1.5-mclOutput
Get group names for the MCL clusters
orthomclMclToGroups group_name number < mcl_output_file > group_output_file

###Analyze the orthoMCL output with the custom python scripts

PrepareGenome_OrthoMCL.py
AnnotateOrthoMCL_Clusters.py
ClusterAlignmentTree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual.MD

Manual.MD

Download the data from IMG

Create a genome list file

Prepare the files

Run Blast

Create MySQL database and run MCL

Files

Manual.MD

Latest commit

History

Manual.MD

File metadata and controls

Download the data from IMG

Create a genome list file

Prepare the files

Run Blast

Create MySQL database and run MCL