-
Notifications
You must be signed in to change notification settings - Fork 1
/
feature_selection_SVD_entropy_cpp_old.Rmd
174 lines (117 loc) · 6.37 KB
/
feature_selection_SVD_entropy_cpp_old.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Feature Selection SVD entropy c++"
output: html_document
---
This Markedown perform feature selection based on the paper [Novel Unsupervised Feature Filtering of Biological Data](https://academic.oup.com/bioinformatics/article-abstract/22/14/e507/227946/Novel-Unsupervised-Feature-Filtering-of-Biological).
The method is based on SVD-entropy. The features are selected based on their contribution to the entropy.
There are three different methods of feature selection:
1) Simple Ranking (SR), select the features that contribute to the Entropy more than the mean + std of all entropy contribution
2) Forward Selection (FS), choose best and recalculate entropy of all other without the best. Do that until mc features are selected
3) Backward Elimination (BE), eliminate the lowest entropy contribution feature, until there are just mc features left.
For computational reason, the code was generated in c++, which can be linked using Rcpp library.
Moreover to do some linear algebra, the library Armadillo is used.
```{r libraries, include=FALSE, message=FALSE}
library(ggplot2)
library(caret)
library(magrittr)
library(dplyr)
library(tidyverse)
library(stringr)
library(Rcpp)
```
## Data
The input contains only the high median correlation compounds.
The input data is a 3752 by 803 matrix.
There are 3752 different observations and 799 features (extracted with CellProfiler).
Each compound (938 different) has 4 replicates.
```{r import data old, message=FALSE}
#set.seed(42)
# name of the data file
filename <- "Pf_Gustafsdottir.rds"
# import data
pf <- readRDS(file.path("..", "..", "input", "BBBC022_2013", "old", filename))
profiles <- pf$data
dim(profiles)
variables <- pf$feat_cols
metadata <- pf$factor_cols
```
## Feature selection
```{r feature selection with SR}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
A <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
CE <- CE_entropy_SR(A)
# average of all CE
c <- mean(CE)
# standard deviation of all CE
d <- sd(CE)
# features to keep, when CEi > c + d
ind.CEi <- which(CE >= c + d) # select 387 features
# names of the features to keep
names.CEi <- rownames(A)[ind.CEi]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 12.06843 mins
```
With SR: keep 9 features: 1 33 56 121 185 267 278 281 307
```{r feature selection with FS1}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
A <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
feat.idx <- CE_entropy_FS1(A, 250)
# names of the features to keep
names.CEi <- rownames(A)[feat.idx]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 12.90768 mins (3.816624 hours for 250 features)
```
With FS1: keep 9 features: 5 56 121 185 291 349 416 421 537
Selected 250 features
Best features: 56 421 5 537 349 416 185 121 291 552 302 52 118 278 283 281 288 282 276 287 277 735 273 270 26 279 557 161 286 274 34 7 741 33 262 289 22 271 290 44 115 285 16 90 337 258 35 284 217 24 779 549 264 683 280 160 41 587 749 187 23 269 30 113 588 92 241 781 272 635 313 746 36 295 418 586 151 275 55 742 582 789 536 31 139 732 221 37 578 25 293 740 266 745 32 69 87 165 522 265 43 111 179 131 27 267 255 540 117 564 636 171 133 585 642 96 94 228 780 181 615 348 191 640 518 589 112 530 546 195 294 584 580 381 173 367 583 778 114 343 311 734 682 49 38 120 733 332 186 119 42 771 788 342 545 320 731 17 637 550 341 581 198 216 577 527 521 110 744 554 150 777 251 89 592 606 29 85 338 220 237 572 396 579 795 602 164 570 254 2 14 641 141 116 532 315 336 46 169 770 594 184 391 28 240 574 412 182 292 535 19 211 104 303 607 785 268 661 539 600 93 18 748 393 601 138 21 575 159 170 235 767 363 226 346 407 791 323 784 395 205 528 632 736 197 411 54 231 665 253
```{r feature selection with FS2}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
A <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
feat.idx <- CE_entropy_FS2(A, 1)
# names of the features to keep
names.CEi <- rownames(A)[feat.idx]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # 1.79 hours (1. 381417 days for 250 features) (12.36min for 1 feature)
```
With FS2: keep 9 features: 1 33 45 56 185 267 278 281 307
Select 250 features
Best features: 56 1 267 185 33 307 278 45 281 46 121 282 283 273 287 114 47 277 270 262 288 120 170 171 261 552 292 591 291 115 276 418 417 23 34 286 416 6 590 349 271 285 584 259 558 575 279 26 275 280 579 284 583 389 274 290 589 272 571 265 58 253 517 518 767 766 175 5 35 574 211 179 468 415 164 557 258 7 36 215 214 153 527 547 548 537 538 467 528 566 195 578 289 588 117 116 231 210 230 22 393 266 269 24 16 165 204 145 118 119 59 189 191 737 48 218 219 150 398 570 144 41 586 577 587 608 40 653 354 648 98 406 361 362 565 263 254 582 57 205 151 181 410 573 385 607 337 636 342 641 601 302 130 131 85 90 293 42 184 37 733 569 309 747 294 568 183 235 139 135 402 239 103 249 159 141 221 220 592 14 264 25 248 238 233 188 229 228 199 241 251 250 240 736 609 234 244 245 161 160 173 152 182 667 232 133 201 169 208 168 209 225 224 683 125 155 154 124 149 129 148 158 403 198 138 532 82 526 217 216 525 466 308 51 76 81 106 86 91 71 646 96 631 549 550 539 55 66 317 101
```{r feature selection with FS2 new}
start.time <- Sys.time()
# load the c++ function
Rcpp::sourceCpp('ranking_SVD_entropy.cpp')
# transpose the dataset to have featxobs (mxn)
X <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t(.)
A <- tcrossprod(X, X) # takes 0.3676331 to calculate the transpose and the cross-product
feat.idx <- CE_entropy_FS2_new(A, 250)
# names of the features to keep
names.CEi <- rownames(X)[feat.idx]
profiles %<>% select(one_of(names.CEi, metadata))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken # (5.10729 hours for 250 features) (2.358974 mins for 1 feature)
# obtain exaclty same result
```
With FS2: keep 9 features:
# saving the new dataset
```{r}
pf$data <- profiles
pf$feat_cols <- names.CEi
pf %>%
saveRDS("../../input/BBBC022_2013/old/Pf_Gustafsdottir_fs_svd_FS2_new_250.rds")
```