alternative_to_odds_ratio.Rmd

---
title: "alternative to odds ratio"
output: html_document
---

Clean version of alternative to odds ratio!

The problem of odds ratio is that it is not really interpretable. In order to solve this, this markdown aims at doing another statistical analysis. It looks at what % of compounds does the top X % of matching compounds contain at least 1 compound with the same annotated MOA.

```{r setup, include=FALSE}
library(magrittr)
library(dplyr)
library(foreach)
library(doMC)
library(reshape2)
library(stringr)
```

## Parameters
```{r parameters}
# top percentage of matching compound.
top.x <- 0.02 # 0.1 = 10%, 0.05 = 5%, 0.02 = 2%

# for reproductibility
seed <- 42
set.seed(seed)

# number of CPU cores for parallelization
registerDoMC(7)
```

## Import data

```{r import data, message=FALSE}

filename <- "Hit_pearson_random_100_5838.rds" #"Hit_jaccard_30n_FS2_svd_5950.rds" 
#"Hit_jaccard_30n_FS2_svd_5950.rds", "Hit_jaccard_30n_fs_6206.rds", "Hit_jaccard_50n_6225.rds", "Hit_pearson_FS2_svd_6438.rds", "Hit_pearson_fs_5975.rds", "Hit_pearson_5925.rds"

pf <- readRDS(file.path("..", "..", "input", "BBBC022_2013", "single_cells", filename))

profiles <- pf$data

variables <- pf$feat_cols
```

## Metadata 

Select compounds that have a MOA that is at least shared with another compounds. In final 88 MOAs are selected. For 412 compounds.
moa size: 432x3 (because some compounds have more than 1 MOA)
```{r MOAs}
# import MOAs data
moa <- 
  read.csv("../../input/BBBC022_2013/MOAs.csv", na.strings = c("", "NA")) %>%
  mutate_if(is.factor, as.character) # has to do that to duplicate rows where multiple MOA

moa %<>% 
  filter(!is.na(MOA_group)) %>% # remove rows where moa = NA
  rename(Image_Metadata_SOURCE_COMPOUND_NAME = Name) # rename compound name

# remove duplicate compounds
moa$Image_Metadata_SOURCE_COMPOUND_NAME <- toupper(moa$Image_Metadata_SOURCE_COMPOUND_NAME) # put compound names in upper character (to make it comparable)
moa %<>%
  group_by(Image_Metadata_SOURCE_COMPOUND_NAME) %>%
  slice(1) %>%
  ungroup

# duplicate rows where multiple MOA associated to one compounds
for (i in 1:nrow(moa)){
  # if there are more than 1 moa associated
  if (str_detect(moa$MOA_group[i], ",")){
    t1 <- str_trim(str_split(moa$MOA_group[i], ",")[[1]])
    moa$MOA_group[i] <- t1[1]
    new.row <- moa[i,]
    new.row$MOA_group <- t1[2]
    moa <- rbind(moa, new.row)
  }
}
```

Compound-ID: 959 unique (same number as in the other!)
```{r compound-id}
# mapping IDs-Compounds: give access to the compound knowing the IDs
id.cmpds <-
  pf$data %>%
  dplyr::select(one_of(c("Image_Metadata_BROAD_ID","Image_Metadata_SOURCE_COMPOUND_NAME"))) %>% # select ID and Compound
  mutate_if(is.character, funs(toupper)) %>% # same compound can be writen in upper and lower case and looks different# be sure it is a dataframe
  group_by(Image_Metadata_SOURCE_COMPOUND_NAME) %>%
  slice(1) %>%
  ungroup
```

```{r metadata}
# metadata: compound, MOA, target, ID
metadata <- 
  moa %>%
  left_join(., id.cmpds, by = "Image_Metadata_SOURCE_COMPOUND_NAME")

# select only rows that have a BROAD_ID
metadata %<>% filter(!is.na(Image_Metadata_BROAD_ID))

# select MOA that are appearing more than once (meaning at least two compounds are related to it)
n.MOA <- table(metadata$MOA_group) %>% as.data.frame() %>% filter(Freq != 1)
metadata %<>%  filter(MOA_group %in% n.MOA$Var1)
```

## Profiles of the compound

Do the average along replicates.

```{r compound profile}
# average along replicate (keep only id and variables)
pf.cmpds <-
  pf$data %>%
  filter(Image_Metadata_BROAD_ID %in% metadata$Image_Metadata_BROAD_ID) %>% # select ID that have a unique compound
  group_by(Image_Metadata_BROAD_ID) %>% 
  summarise_each(funs(mean(., na.rm=TRUE)), -c(Image_Metadata_SOURCE_COMPOUND_NAME, Well, Plate)) %>% # mean along replicate
  ungroup %>%
  as.data.frame()

# keep track of ID of the compounds
row.names(pf.cmpds) <- pf.cmpds$Image_Metadata_BROAD_ID
```

## Combination profile + metadata

```{r profile + metadata}
# Attention: since some compounds have more than one MOAs, few rows have same compounds name.
pf.cmpd.meta <- 
  metadata %>%
  dplyr::left_join(., pf.cmpds, by = "Image_Metadata_BROAD_ID") %>%
  as.data.frame
dim(pf.cmpd.meta)
```

## Number of MOA in common for each compounds pair

```{r}
# binary indicator matrix of ID vs MOA 
n.moa.ID <- 
  pf.cmpd.meta %>%
  select(MOA_group, Image_Metadata_BROAD_ID) %>%
  table %>%
  as.data.frame.matrix %>%
  as.matrix

# number of moa in common for each ID pairs
n.moa.cmpd.pair <-
  t(n.moa.ID)  %*% n.moa.ID %>% # calculate matrix compound ID - compound ID relation
  melt(.) %>% # transform matrix into column
  filter(Var1 != Var2)

```

## Correlation compound-compound 

```{r correlation cmpd-cmpd}
# correlation compound-compound
cor.cmpd <-
  pf.cmpds[, variables] %>% 
  as.matrix() %>%
  t() %>%
  cor()
```

## Percentage
Percentage of compound that have at least 1 compound with same MOA in the top X% of correlation. 

```{r function knn}
top.percentage.matching.moa <- function(cor.cmpd, n.moa.cmpd.pair){
  cor.cmpd.pair <- 
    melt(cor.cmpd) %>%
    rename(cmpd1 = Var1, cmpd2 = Var2, corr = value) %>% # rename columns
    filter(cmpd1 != cmpd2) %>% # remove column were same compound
    left_join(.,
              n.moa.cmpd.pair,
              by = c("cmpd1" = "Var1", "cmpd2" = "Var2"))
  
  top.moa.matching <- 
    cor.cmpd.pair %>%
    group_by(cmpd1) %>% # group by compound
    arrange(cmpd1, desc(corr)) %>% # sort correlation from higher to lower
    filter(corr > quantile(corr, 1.0-top.x)) %>% # look at the top 10% correlation in each group
    summarise(p = sum(value)/n()) %>% # percentage of similar moa
    ungroup()

  final.number <-
    top.moa.matching %>%
    filter(p > 0) %>% # p bigger than 0 mean that at least there is one moa in common
    summarise(n = n()/nrow(top.moa.matching)) %>% # number of compound that have a least one MOA in common divided by the total number of compounds
    as.numeric()

  return(final.number)
}

```

```{r knn data}

final.number <- top.percentage.matching.moa(cor.cmpd, n.moa.cmpd.pair)

```


## Baseline
1000x do random shuffles of rows -> give a way to compare result with randomness

```{r baseline}
start.time <- Sys.time()

v <- rownames(cor.cmpd) # extract names of the rows and columns to shuffle

set.seed(seed)
N <- 1000 # number of time reproduce the same analysis
seeds <- sample(1:10000, N, replace=F)


random.percent <- foreach(i = 1:N, .combine=cbind) %dopar% {
  # for reproducibility
  set.seed(seeds[i])
  # randomly shuffle names of the compounds
  t <- sample(v)
  
  # shuflle in the same random way the names of the rows and the columns
  cor.comp.random <- cor.cmpd
  rownames(cor.comp.random) <- t
  colnames(cor.comp.random) <- t
  
  # knn for random
  random.percent <- top.percentage.matching.moa(cor.comp.random, n.moa.cmpd.pair)
}

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 50 secs
```

## Results

```{r mean and 95 quantile}
# mean of the baseline
mean.random <- mean(random.percent)

# 95-quantile of the baseline
quant <- quantile(random.percent, .95)

#print result
print("Mean of the baseline: ")
mean.random
print("95-quantile of the baseline: ")
quant
print("percentage of the data: ")
final.number

```
## Results with random features selection from 100 to 700 
# Jaccard

For 100:
Mean of the baseline: 0.1468037
95-quantile of the baseline: 0.1917808 
percentage of the data: 0.3105023

For 200:
Mean of the baseline: 0.1465429
95-quantile of the baseline: 0.1877551 
percentage of the data: 0.3142857

For 300:
Mean of the baseline: 0.146032
95-quantile of the baseline: 0.184 
percentage of the data: 0.324

For 400:
Mean of the baseline: 0.1478452
95-quantile of the baseline: 0.1882845 
percentage of the data: 0.334728

For 500:
Mean of the baseline: 0.1600837
95-quantile of the baseline: 0.2070485 
percentage of the data: 0.3524229

For 600:
Mean of the baseline: 0.1574522
95-quantile of the baseline: 0.2 
percentage of the data: 0.3478261

For 700:
Mean of the baseline: 0.1591293
95-quantile of the baseline: 0.2025862 
percentage of the data: 0.3663793

# Pearson

For 100:
Mean of the baseline: 0.1500418
95-quantile of the baseline: 0.1924686 
percentage of the data: 0.3138075

For 200:
Mean of the baseline: 0.1532301
95-quantile of the baseline: 0.199115 
percentage of the data: 0.2876106

For 300:
Mean of the baseline: 0.1507574
95-quantile of the baseline: 0.1914894 
percentage of the data: 0.3021277

For 400:
Mean of the baseline: 0.1486473
95-quantile of the baseline: 0.186722 
percentage of the data: 0.2987552

For 500:
Mean of the baseline: 0.1480805
95-quantile of the baseline: 0.190678 
percentage of the data: 0.3008475

For 600:
Mean of the baseline: 0.150089
95-quantile of the baseline: 0.1949153 
percentage of the data: 0.3050847

For 700:
Mean of the baseline: 0.148958
95-quantile of the baseline: 0.1932773 
percentage of the data: 0.2983193

## Results with features selection from 100 to 700 SVD-entropy FS2
# Jaccard

For 100:
Mean of the baseline: 0.0832695
95-quantile of the baseline: 0.1276596 
percentage of the data: 0.2624113

For 200:
Mean of the baseline: 0.138462
95-quantile of the baseline: 0.1847826 
percentage of the data: 0.3097826

For 300:
Mean of the baseline: 0.1477851
95-quantile of the baseline: 0.1929825 
percentage of the data: 0.2894737

For 400:
Mean of the baseline: 0.1545439
95-quantile of the baseline: 0.1973684 
percentage of the data: 0.3245614

For 500:
Mean of the baseline: 0.1599414
95-quantile of the baseline: 0.2072072 
percentage of the data: 0.3243243

For 600:
Mean of the baseline: 0.1432762
95-quantile of the baseline: 0.1841004 
percentage of the data: 0.3138075

For 700:
Mean of the baseline: 0.1555862
95-quantile of the baseline: 0.1982759 
percentage of the data: 0.3362069

# Pearson

For 100:
Mean of the baseline: 0.09419728
95-quantile of the baseline: 0.1428571 
percentage of the data: 0.2312925

For 200:
Mean of the baseline: 0.1261397
95-quantile of the baseline: 0.1731844 
percentage of the data: 0.2849162

For 300:
Mean of the baseline: 0.1557892
95-quantile of the baseline: 0.1973094 
percentage of the data: 0.2959641

For 400:
Mean of the baseline: 0.1487215
95-quantile of the baseline: 0.1898734 
percentage of the data: 0.3037975

For 500:
Mean of the baseline: 0.1445021
95-quantile of the baseline: 0.1851852 
percentage of the data: 0.3004115

For 600:
Mean of the baseline: 0.1483191
95-quantile of the baseline: 0.187234 
percentage of the data: 0.293617

For 700:
Mean of the baseline: 0.14541
95-quantile of the baseline: 0.1882845 
percentage of the data: 0.3054393

## Results only with compounds that have at least one shared MOA (new results)

For top 2%

For Jaccard Hit selection with FS2 svd-entropy feature selection:
Mean of the baseline: 0.1430
95-quantile of the baseline: 0.1878
percentage of the data: 0.3144

For Jaccard Hit selection with findCorrelation feature selection:
Mean of the baseline: 0.1505
95-quantile of the baseline: 0.1901  
percentage of the data: 0.2975

For Jaccard Hit selection with no feature selection:
Mean of the baseline: 0.1529
95-quantile of the baseline: 0.1949  
percentage of the data: 0.3475

For Pearson Hit selection with FS2 svd-entropy feature selection:
Mean of the baseline: 0.1680
95-quantile of the baseline: 0.2095  
percentage of the data: 0.3123

For Pearson Hit selection with findCorrelation feature selection:
Mean of the baseline: 0.1513
95-quantile of the baseline: 0.1909  
percentage of the data: 0.2839

For Pearson Hit selection with no feature selection:
Mean of the baseline: 0.1503
95-quantile of the baseline: 0.1888  
percentage of the data: 0.3004