-
Notifications
You must be signed in to change notification settings - Fork 1
/
feature_selection_SVD_entropy_cpp_old.Rmd
174 lines (117 loc) · 6.37 KB
/
feature_selection_SVD_entropy_cpp_old.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Feature Selection SVD entropy c++"
output: html_document
---
This Markedown perform feature selection based on the paper [Novel Unsupervised Feature Filtering of Biological Data](https://academic.oup.com/bioinformatics/article-abstract/22/14/e507/227946/Novel-Unsupervised-Feature-Filtering-of-Biological).
The method is based on SVD-entropy. The features are selected based on their contribution to the entropy.
There are three different methods of feature selection:
1) Simple Ranking (SR), select the features that contribute to the Entropy more than the mean + std of all entropy contribution
2) Forward Selection (FS), choose best and recalculate entropy of all other without the best. Do that until mc features are selected
3) Backward Elimination (BE), eliminate the lowest entropy contribution feature, until there are just mc features left.
For computational reason, the code was generated in c++, which can be linked using Rcpp library.
Moreover to do some linear algebra, the library Armadillo is used.
```{r libraries, include=FALSE, message=FALSE}
library(ggplot2)
library(caret)
library(magrittr)
library(dplyr)
library(tidyverse)
library(stringr)
library(Rcpp)
```
## Data
The input contains only the high median correlation compounds.
The input data is a 3752 by 803 matrix.
There are 3752 different observations and 799 features (extracted with CellProfiler).
Each compound (938 different) has 4 replicates.
```{r import data old, message=FALSE}
#set.seed(42)
# name of the data file
filename <- "Pf_Gustafsdottir.rds"
# import data
pf <- readRDS(file.path("..", "..", "input", "BBBC022_2013", "old", filename))
profiles <- pf$data
dim(profiles)
variables <- pf$feat_cols
metadata <- pf$factor_cols
```
## Feature selection
```{r feature selection with SR}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
A <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
CE <- CE_entropy_SR(A)
# average of all CE
c <- mean(CE)
# standard deviation of all CE
d <- sd(CE)
# features to keep, when CEi > c + d
ind.CEi <- which(CE >= c + d) # select 387 features
# names of the features to keep
names.CEi <- rownames(A)[ind.CEi]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 12.06843 mins
```
With SR: keep 9 features: 1 33 56 121 185 267 278 281 307
```{r feature selection with FS1}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
A <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
feat.idx <- CE_entropy_FS1(A, 250)
# names of the features to keep
names.CEi <- rownames(A)[feat.idx]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 12.90768 mins (3.816624 hours for 250 features)
```
With FS1: keep 9 features: 5 56 121 185 291 349 416 421 537
Selected 250 features
Best features
```{r feature selection with FS2}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
A <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
feat.idx <- CE_entropy_FS2(A, 1)
# names of the features to keep
names.CEi <- rownames(A)[feat.idx]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 1.79 hours (1. 381417 days for 250 features) (12.36min for 1 feature)
```
With FS2: keep 9 features: 1 33 45 56 185 267 278 281 307
Select 250 features
Best features
```{r feature selection with FS2 new}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
X <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
A <- tcrossprod(X, X) # takes 0.3676331 to calculate the transpose and the cross-product
feat.idx <- CE_entropy_FS2_new(A, 250)
# names of the features to keep
names.CEi <- rownames(X)[feat.idx]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # (5.10729 hours for 250 features) (2.358974 mins for 1 feature)
# obtain exaclty same result
```
With FS2: keep 9 features:
# saving the new dataset
```{r}
pf$data <- profiles
pf$feat_cols <- names.CEi
pf %>%
saveRDS("../../input/BBBC022_2013/old/Pf_Gustafsdottir_fs_svd_FS2_new_250.rds")
```