This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Gene Set Characterization Pipeline.
This pipeline ranks a user supplied gene set against a KnowEnG's gene sets collection.
There are three gene set characterization methods that one can choose from:
Options | Method | Parameters |
---|---|---|
Fisher exact test | Fisher | fisher |
Discriminative Random Walks with Restart | DRaWR | DRaWR |
Net Path | Net Path | net_path |
git clone https://github.com/KnowEnG/GeneSet_Characterization_Pipeline.git
apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.19.1
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib==1.4.2
pip3 install pyyaml
pip3 install knpackage
cd GeneSet_Characterization_Pipeline
cd test
make env_setup
- Run fisher pipeline
make run_fisher
- Run DRaWR pipeline
make run_drawr
- Run Net Path pipeline
make run_netpath
Follow steps 1-3 above then do the following:
mkdir run_directory
cd run_directory
mkdir results_directory
Look for examples of run_parameters in the GeneSet_Characterization_Pipeline/data/run_files BENCHMARK_1_fisher.yml
- Update PYTHONPATH environment variable
export PYTHONPATH='./src':$PYTHONPATH
- Run
python3 ../src/geneset_characterization.py -run_directory ./run_dir -run_file BENCHMARK_1_fisher.yml
Key | Value | Comments |
---|---|---|
method | DRaWR or fisher or net_path | Choose DRaWR or fisher or Net Path as the gene set characterization method |
pg_network_name_full_path | directory+pg_network_name | Path and file name of the 4 col property file |
gg_network_name_full_path | directory+gg_network_name | Path and file name of the 4 col network file(needed in DRaWR and Net Path) |
spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user supplied gene sets |
gene_names_map | directory+gene_names_map | Map ENSEMBL names to user specified gene names |
results_directory | directory | Directory to save the output files |
rwr_max_iterations | 500 | Maximum number of iterations without convergence in random walk with restart(needed in DRaWR and Net Path) |
rwr_convergence_tolerence | 0.0001 | Frobenius norm tolerence of spreadsheet vector in random walk(needed in DRaWR and Net Path) |
rwr_restart_probability | 0.5 | alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo (needed in DRaWR and Net Path) |
k_space | 100 | number of the new space dimensions in SVD(only needed in Net Path) |
max_cpu | 4 | Maximum number of processors to use |
pg_network_name = kegg_pathway_property_gene.edge
gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
gene_names_map = ProGENI_rwr20_STExp_GDSC_500_MAP.rname.gxc.tsv
- Output files of all three methods save sorted properties for each gene set with name {method}_ranked_by_property_{timestamp}.df.
user gene set name1 | user gene set name2 | ... | user gene set name n |
---|---|---|---|
property (most significant) |
property (most significant) |
... | property (most significant) |
... | ... | ... | ... |
property (least significant) |
property (least significant) |
... | property (least significant) |
- Fisher method saves one output file with seven columns and it is sorted in descending order based on
pval
. The name of the file is fisher_sorted_by_property_score_{timestamp}.df.
user_gene_set | property_gene_set | pval | universe_count | user_count | property_count | overlap_count |
---|---|---|---|---|---|---|
user gene 1 | property 1 | float | int | int | int | int |
... | ... | ... | ... | ... | ... | ... |
user gene n | property m | float | int | int | int | int |
- DRaWR method saves two output files with five columns and they are sorted in descending order based on
difference_score
. The files are DRaWR_sorted_by_gene_score_{timestamp}.df and DRaWR_sorted_by_property_score_{timestamp}.df
user_gene_set | gene_node_id | difference_score | query_score | baseline_score |
---|---|---|---|---|
user gene 1 | gene node 1 | float | float | float |
... | ... | ... | ... | ... |
user gene n | gene node m | float | float | float |
user_gene_set | property_gene_set | difference_score | query_score | baseline_score |
---|---|---|---|---|
user gene 1 | property 1 | float | float | float |
... | ... | ... | ... | ... |
user gene n | property m | float | float | float |
- Net Path method saves one output file with three columns and it is sorted in descending order based on
cosine_sum
. The name of the file is net_path_sorted_by_property_score_{timestamp}.df.
user_gene_set | property_gene_set | cosine_sum |
---|---|---|
user gene 1 | property 1 | float |
... | ... | ... |
user gene n | property m | float |