-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathalternative_to_odds_ratio.Rmd
437 lines (341 loc) · 11.5 KB
/
alternative_to_odds_ratio.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
---
title: "alternative to odds ratio"
output: html_document
---
Clean version of alternative to odds ratio!
The problem of odds ratio is that it is not really interpretable. In order to solve this, this markdown aims at doing another statistical analysis. It looks at what % of compounds does the top X % of matching compounds contain at least 1 compound with the same annotated MOA.
```{r setup, include=FALSE}
library(magrittr)
library(dplyr)
library(foreach)
library(doMC)
library(reshape2)
library(stringr)
```
## Parameters
```{r parameters}
# top percentage of matching compound.
top.x <- 0.02 # 0.1 = 10%, 0.05 = 5%, 0.02 = 2%
# for reproductibility
seed <- 42
set.seed(seed)
# number of CPU cores for parallelization
registerDoMC(7)
```
## Import data
```{r import data, message=FALSE}
filename <- "Hit_pearson_random_100_5838.rds" #"Hit_jaccard_30n_FS2_svd_5950.rds"
#"Hit_jaccard_30n_FS2_svd_5950.rds", "Hit_jaccard_30n_fs_6206.rds", "Hit_jaccard_50n_6225.rds", "Hit_pearson_FS2_svd_6438.rds", "Hit_pearson_fs_5975.rds", "Hit_pearson_5925.rds"
pf <- readRDS(file.path("..", "..", "input", "BBBC022_2013", "single_cells", filename))
profiles <- pf$data
variables <- pf$feat_cols
```
## Metadata
Select compounds that have a MOA that is at least shared with another compounds. In final 88 MOAs are selected. For 412 compounds.
moa size: 432x3 (because some compounds have more than 1 MOA)
```{r MOAs}
# import MOAs data
moa <-
read.csv("../../input/BBBC022_2013/MOAs.csv", na.strings = c("", "NA")) %>%
mutate_if(is.factor, as.character) # has to do that to duplicate rows where multiple MOA
moa %<>%
filter(!is.na(MOA_group)) %>% # remove rows where moa = NA
rename(Image_Metadata_SOURCE_COMPOUND_NAME = Name) # rename compound name
# remove duplicate compounds
moa$Image_Metadata_SOURCE_COMPOUND_NAME <- toupper(moa$Image_Metadata_SOURCE_COMPOUND_NAME) # put compound names in upper character (to make it comparable)
moa %<>%
group_by(Image_Metadata_SOURCE_COMPOUND_NAME) %>%
slice(1) %>%
ungroup
# duplicate rows where multiple MOA associated to one compounds
for (i in 1:nrow(moa)){
# if there are more than 1 moa associated
if (str_detect(moa$MOA_group[i], ",")){
t1 <- str_trim(str_split(moa$MOA_group[i], ",")[[1]])
moa$MOA_group[i] <- t1[1]
new.row <- moa[i,]
new.row$MOA_group <- t1[2]
moa <- rbind(moa, new.row)
}
}
```
Compound-ID: 959 unique (same number as in the other!)
```{r compound-id}
# mapping IDs-Compounds: give access to the compound knowing the IDs
id.cmpds <-
pf$data %>%
dplyr::select(one_of(c("Image_Metadata_BROAD_ID","Image_Metadata_SOURCE_COMPOUND_NAME"))) %>% # select ID and Compound
mutate_if(is.character, funs(toupper)) %>% # same compound can be writen in upper and lower case and looks different# be sure it is a dataframe
group_by(Image_Metadata_SOURCE_COMPOUND_NAME) %>%
slice(1) %>%
ungroup
```
```{r metadata}
# metadata: compound, MOA, target, ID
metadata <-
moa %>%
left_join(., id.cmpds, by = "Image_Metadata_SOURCE_COMPOUND_NAME")
# select only rows that have a BROAD_ID
metadata %<>% filter(!is.na(Image_Metadata_BROAD_ID))
# select MOA that are appearing more than once (meaning at least two compounds are related to it)
n.MOA <- table(metadata$MOA_group) %>% as.data.frame() %>% filter(Freq != 1)
metadata %<>% filter(MOA_group %in% n.MOA$Var1)
```
## Profiles of the compound
Do the average along replicates.
```{r compound profile}
# average along replicate (keep only id and variables)
pf.cmpds <-
pf$data %>%
filter(Image_Metadata_BROAD_ID %in% metadata$Image_Metadata_BROAD_ID) %>% # select ID that have a unique compound
group_by(Image_Metadata_BROAD_ID) %>%
summarise_each(funs(mean(., na.rm=TRUE)), -c(Image_Metadata_SOURCE_COMPOUND_NAME, Well, Plate)) %>% # mean along replicate
ungroup %>%
as.data.frame()
# keep track of ID of the compounds
row.names(pf.cmpds) <- pf.cmpds$Image_Metadata_BROAD_ID
```
## Combination profile + metadata
```{r profile + metadata}
# Attention: since some compounds have more than one MOAs, few rows have same compounds name.
pf.cmpd.meta <-
metadata %>%
dplyr::left_join(., pf.cmpds, by = "Image_Metadata_BROAD_ID") %>%
as.data.frame
dim(pf.cmpd.meta)
```
## Number of MOA in common for each compounds pair
```{r}
# binary indicator matrix of ID vs MOA
n.moa.ID <-
pf.cmpd.meta %>%
select(MOA_group, Image_Metadata_BROAD_ID) %>%
table %>%
as.data.frame.matrix %>%
as.matrix
# number of moa in common for each ID pairs
n.moa.cmpd.pair <-
t(n.moa.ID) %*% n.moa.ID %>% # calculate matrix compound ID - compound ID relation
melt(.) %>% # transform matrix into column
filter(Var1 != Var2)
```
## Correlation compound-compound
```{r correlation cmpd-cmpd}
# correlation compound-compound
cor.cmpd <-
pf.cmpds[, variables] %>%
as.matrix() %>%
t() %>%
cor()
```
## Percentage
Percentage of compound that have at least 1 compound with same MOA in the top X% of correlation.
```{r function knn}
top.percentage.matching.moa <- function(cor.cmpd, n.moa.cmpd.pair){
cor.cmpd.pair <-
melt(cor.cmpd) %>%
rename(cmpd1 = Var1, cmpd2 = Var2, corr = value) %>% # rename columns
filter(cmpd1 != cmpd2) %>% # remove column were same compound
left_join(.,
n.moa.cmpd.pair,
by = c("cmpd1" = "Var1", "cmpd2" = "Var2"))
top.moa.matching <-
cor.cmpd.pair %>%
group_by(cmpd1) %>% # group by compound
arrange(cmpd1, desc(corr)) %>% # sort correlation from higher to lower
filter(corr > quantile(corr, 1.0-top.x)) %>% # look at the top 10% correlation in each group
summarise(p = sum(value)/n()) %>% # percentage of similar moa
ungroup()
final.number <-
top.moa.matching %>%
filter(p > 0) %>% # p bigger than 0 mean that at least there is one moa in common
summarise(n = n()/nrow(top.moa.matching)) %>% # number of compound that have a least one MOA in common divided by the total number of compounds
as.numeric()
return(final.number)
}
```
```{r knn data}
final.number <- top.percentage.matching.moa(cor.cmpd, n.moa.cmpd.pair)
```
## Baseline
1000x do random shuffles of rows -> give a way to compare result with randomness
```{r baseline}
start.time <- Sys.time()
v <- rownames(cor.cmpd) # extract names of the rows and columns to shuffle
set.seed(seed)
N <- 1000 # number of time reproduce the same analysis
seeds <- sample(1:10000, N, replace=F)
random.percent <- foreach(i = 1:N, .combine=cbind) %dopar% {
# for reproducibility
set.seed(seeds[i])
# randomly shuffle names of the compounds
t <- sample(v)
# shuflle in the same random way the names of the rows and the columns
cor.comp.random <- cor.cmpd
rownames(cor.comp.random) <- t
colnames(cor.comp.random) <- t
# knn for random
random.percent <- top.percentage.matching.moa(cor.comp.random, n.moa.cmpd.pair)
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 50 secs
```
## Results
```{r mean and 95 quantile}
# mean of the baseline
mean.random <- mean(random.percent)
# 95-quantile of the baseline
quant <- quantile(random.percent, .95)
#print result
print("Mean of the baseline: ")
mean.random
print("95-quantile of the baseline: ")
quant
print("percentage of the data: ")
final.number
```
## Results with random features selection from 100 to 700
# Jaccard
For 100:
Mean of the baseline: 0.1468037
95-quantile of the baseline: 0.1917808
percentage of the data: 0.3105023
For 200:
Mean of the baseline: 0.1465429
95-quantile of the baseline: 0.1877551
percentage of the data: 0.3142857
For 300:
Mean of the baseline: 0.146032
95-quantile of the baseline: 0.184
percentage of the data: 0.324
For 400:
Mean of the baseline: 0.1478452
95-quantile of the baseline: 0.1882845
percentage of the data: 0.334728
For 500:
Mean of the baseline: 0.1600837
95-quantile of the baseline: 0.2070485
percentage of the data: 0.3524229
For 600:
Mean of the baseline: 0.1574522
95-quantile of the baseline: 0.2
percentage of the data: 0.3478261
For 700:
Mean of the baseline: 0.1591293
95-quantile of the baseline: 0.2025862
percentage of the data: 0.3663793
# Pearson
For 100:
Mean of the baseline: 0.1500418
95-quantile of the baseline: 0.1924686
percentage of the data: 0.3138075
For 200:
Mean of the baseline: 0.1532301
95-quantile of the baseline: 0.199115
percentage of the data: 0.2876106
For 300:
Mean of the baseline: 0.1507574
95-quantile of the baseline: 0.1914894
percentage of the data: 0.3021277
For 400:
Mean of the baseline: 0.1486473
95-quantile of the baseline: 0.186722
percentage of the data: 0.2987552
For 500:
Mean of the baseline: 0.1480805
95-quantile of the baseline: 0.190678
percentage of the data: 0.3008475
For 600:
Mean of the baseline: 0.150089
95-quantile of the baseline: 0.1949153
percentage of the data: 0.3050847
For 700:
Mean of the baseline: 0.148958
95-quantile of the baseline: 0.1932773
percentage of the data: 0.2983193
## Results with features selection from 100 to 700 SVD-entropy FS2
# Jaccard
For 100:
Mean of the baseline: 0.0832695
95-quantile of the baseline: 0.1276596
percentage of the data: 0.2624113
For 200:
Mean of the baseline: 0.138462
95-quantile of the baseline: 0.1847826
percentage of the data: 0.3097826
For 300:
Mean of the baseline: 0.1477851
95-quantile of the baseline: 0.1929825
percentage of the data: 0.2894737
For 400:
Mean of the baseline: 0.1545439
95-quantile of the baseline: 0.1973684
percentage of the data: 0.3245614
For 500:
Mean of the baseline: 0.1599414
95-quantile of the baseline: 0.2072072
percentage of the data: 0.3243243
For 600:
Mean of the baseline: 0.1432762
95-quantile of the baseline: 0.1841004
percentage of the data: 0.3138075
For 700:
Mean of the baseline: 0.1555862
95-quantile of the baseline: 0.1982759
percentage of the data: 0.3362069
# Pearson
For 100:
Mean of the baseline: 0.09419728
95-quantile of the baseline: 0.1428571
percentage of the data: 0.2312925
For 200:
Mean of the baseline: 0.1261397
95-quantile of the baseline: 0.1731844
percentage of the data: 0.2849162
For 300:
Mean of the baseline: 0.1557892
95-quantile of the baseline: 0.1973094
percentage of the data: 0.2959641
For 400:
Mean of the baseline: 0.1487215
95-quantile of the baseline: 0.1898734
percentage of the data: 0.3037975
For 500:
Mean of the baseline: 0.1445021
95-quantile of the baseline: 0.1851852
percentage of the data: 0.3004115
For 600:
Mean of the baseline: 0.1483191
95-quantile of the baseline: 0.187234
percentage of the data: 0.293617
For 700:
Mean of the baseline: 0.14541
95-quantile of the baseline: 0.1882845
percentage of the data: 0.3054393
## Results only with compounds that have at least one shared MOA (new results)
For top 2%
For Jaccard Hit selection with FS2 svd-entropy feature selection:
Mean of the baseline: 0.1430
95-quantile of the baseline: 0.1878
percentage of the data: 0.3144
For Jaccard Hit selection with findCorrelation feature selection:
Mean of the baseline: 0.1505
95-quantile of the baseline: 0.1901
percentage of the data: 0.2975
For Jaccard Hit selection with no feature selection:
Mean of the baseline: 0.1529
95-quantile of the baseline: 0.1949
percentage of the data: 0.3475
For Pearson Hit selection with FS2 svd-entropy feature selection:
Mean of the baseline: 0.1680
95-quantile of the baseline: 0.2095
percentage of the data: 0.3123
For Pearson Hit selection with findCorrelation feature selection:
Mean of the baseline: 0.1513
95-quantile of the baseline: 0.1909
percentage of the data: 0.2839
For Pearson Hit selection with no feature selection:
Mean of the baseline: 0.1503
95-quantile of the baseline: 0.1888
percentage of the data: 0.3004