#Comparative genome analysis with OrthoMCL
Author: Juan A. Ugalde, [email protected]
##Some considerations
- You need to have OrthoMCL installed on your machine, and the right permissions to create mysql databases
##Preparing the genome files
You need to download three files for each genome, and store them in separate folders, according to data type:
- Annotation: the format is GenomeNumber.info.xls
- Fasta proteins: the format is GenomeNumber.faa
- Fasta genes (nucleotide): the format is GenomeNumber.fna
An important thing to keep in mind, is that sometimes when you are downloading a file from the IMG website, the file does not have the correct name and instead the name is main.cgi. Is that is the case, you need to appropiate rename the file.
Another detail, is that the extension of each file does not need to be faa or fna, but all the files in the folder must have the same extension.
This will be used to replace the JGI names with appropiate (and more easy to remember) names. The format of this file is:
Taxon OID (which is the prefix of each file) Full genome name Prefix to use
where each file is separated by tabs
An example of this: 645058727 Acidithiobacillus caldus ATCC 51756 Acaldus51756
Use the PrepareGenomes_OrthoMCL.py script. The required inputs for the script are:
- Genome list file
- Folder with either nucleotide or protein sequences
- Extension of the files in the folder
- Name of the output folder (it will be created if it doesn't exist)
- Concatenate all of the genomes files into one single file
- Create blast database
- Blast all the sequences versus the blast database, the output must be in tabular format
Example command:
formatdb -i All_genomes.fasta -n AcidoProteins -p T
blastall -p blastp -d AcidoProteins -i All_genomes.fasta -F 'm S' -v 1000000 -b 1000000 -z 19055 -e 1e-5 -m 8 -a 6 -o blastp.Acidothio
-
Create MySQL database, and give privileges to the user running the analysis
-
Create the config file (example provided in the orthoMCL folder), and edit it with the information of the mysql database, user name and password.
-
Install orthoMCL schema: orthomclInstallSchema config_file
-
Parse Blast Results: orthomclBlastParser blast_input fasta_protein_folder >> similarSequences.txt
-
Load results in the database orthomclLoadBlast config_file similarSequences.txt
-
Run MCL pairs orthomclPairs config_file log_file cleanup=no
-
Dump the pairs files orthomclDumpPairsFiles config_file
-
MCL analysis It is possible here to try with different inflation values (I) mcl mclInput --abc -I 1.5 -o 1.5-mclOutput
-
Get group names for the MCL clusters
-
orthomclMclToGroups group_name number < mcl_output_file > group_output_file
###Analyze the orthoMCL output with the custom python scripts
- PrepareGenome_OrthoMCL.py
- AnnotateOrthoMCL_Clusters.py
- ClusterAlignmentTree.py