A categorical feature selection approach based on information theoretical considerations.
Implementation of the fast correlation-based filter (FCBF) proposed by Yu and Liu:
@inproceedings{inproceedings,
author = {Yu, Lei and Liu, Huan},
year = {2003},
month = {01},
pages = {856-863},
title = {Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution},
volume = {2},
journal = {Proceedings, Twentieth International Conference on Machine Learning}
}
Data for testing is taken from the UCI Machine Learning Repository. See also notes on the contained lung cancer dataset.
from fcbf import fcbf, data
dataset = data.lung_cancer
X = dataset.loc[:, [dataset.columns[0]] + dataset.columns[2:].tolist()]
y = dataset[dataset.columns[1]].astype(int)
print(X)
print(y)
relevant_features, irrelevant_features, correlations = fcbf(X, y, su_threshold=0.1, base=2)
print('relevant_features:', relevant_features, '(count:', len(relevant_features), ')')
print('irrelevant_features:', irrelevant_features, '(count:', len(irrelevant_features), ')')
print('correlations:', correlations)
Using pip, execute the following
pip install fcbf
TODO
TODO
Code is released under the MIT License. All dependencies are copyright to the respective authors and released under the respective licenses.