forked from broadinstitute/imaging_metric_comparison
-
Notifications
You must be signed in to change notification settings - Fork 0
/
feature_selection_findCorrelation.Rmd
100 lines (73 loc) · 2.11 KB
/
feature_selection_findCorrelation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "featureSelectionOld"
output: html_document
---
This Markdown aims at doing feature selection. In this version, we use findCorrelation to remove the features that are too correlated.
```{r libraries, include=FALSE, message=FALSE}
library(ggplot2)
library(caret)
library(magrittr)
library(dplyr)
library(tidyverse)
library(stringr)
```
## Data
The input contains only the high median correlation compounds.
The input data is a 3752 by 803 matrix.
There are 3752 different observations and 799 features (extracted with CellProfiler).
Each compound (938 different) has 4 replicates.
```{r import data old, message=FALSE}
set.seed(42)
# name of the data file
filename <- "Pf_Gustafsdottir.rds"
# import data
pf <- readRDS(file.path("..", "..", "input", "BBBC022_2013", "old", filename))
profiles <- pf$data
dim(profiles)
variables <- pf$feat_cols
metadata <- pf$factor_cols
```
## Feature Selection
```{r correlation}
# remove zero variance data
profiles %<>%
cytominer::select(
sample =
profiles %>%
filter(Image_Metadata_BROAD_ID %in% ""),
variables = variables,
operation = "variance_threshold"
)
variables <-
names(profiles) %>% str_subset("^Cells_|^Cytoplasm_|^Nuclei_")
# correlation between features
correlation <-
profiles %>%
select(one_of(variables)) %>%
cor()
corrplot::corrplot(correlation, tl.cex = 0.5, method = "color", tl.pos="n", order = "hclust")
```
```{r find correlation}
# remove features that are too correlated (threshold of .9)
profiles %<>%
cytominer::select(
sample = profiles,
variables = variables,
operation = "correlation_threshold"
)
#%>%filter(Metadata_pert_type == "control"),
variables <-
names(profiles) %>% str_subset("^Cells_|^Cytoplasm_|^Nuclei_")
# plot of the correlation after features selection
correlation <-
profiles %>%
select(one_of(variables)) %>%
cor()
corrplot::corrplot(correlation, tl.cex = 0.5, method = "color", tl.pos="n", order = "hclust")
dim(profiles)
# saving the new dataset
pf$data <- profiles
pf$feat_cols <- variables
pf %>%
saveRDS("../../input/BBBC022_2013/old/Pf_Gustafsdottir_fs.rds")
```