tree_model

Fix 'Deprecated:gensim.models.Word2Vec.load_word2vec_format' with 'ge…

Jun 26, 2018

08e5a12 · Jun 26, 2018

Name	Name	Last commit message	Last commit date
parent directory ..
results	results	Initial commit	Jun 9, 2017
AlignmentFeatureGenerator.py	AlignmentFeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
CountFeatureGenerator.py	CountFeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
FeatureGenerator.py	FeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
README.md	README.md	fixed logo filepath	Jun 15, 2017
SentimentFeatureGenerator.py	SentimentFeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
SvdFeatureGenerator.py	SvdFeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
TargetFeatureGenerator.py	TargetFeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
TfidfFeatureGenerator.py	TfidfFeatureGenerator.py	appended copyright notice to .py and .md	Jun 9, 2017
Word2VecFeatureGenerator.py	Word2VecFeatureGenerator.py	Fix 'Deprecated:gensim.models.Word2Vec.load_word2vec_format' with 'ge…	Jun 26, 2018
average.py	average.py	appended copyright notice to .py and .md	Jun 9, 2017
cleanup.py	cleanup.py	appended copyright notice to .py and .md	Jun 9, 2017
dosiblOutputFinal.csv	dosiblOutputFinal.csv	Initial commit	Jun 9, 2017
generateFeatures.py	generateFeatures.py	removed AlignmentFeatureGenerator as it was not used in the model; co…	Sep 26, 2017
helpers.py	helpers.py	appended copyright notice to .py and .md	Jun 9, 2017
hold_out_ids.txt	hold_out_ids.txt	Initial commit	Jun 9, 2017
ngram.py	ngram.py	appended copyright notice to .py and .md	Jun 9, 2017
note.txt	note.txt	Initial commit	Jun 9, 2017
score.py	score.py	appended copyright notice to .py and .md	Jun 9, 2017
test_bodies_processed.csv	test_bodies_processed.csv	Initial commit	Jun 9, 2017
test_stances_unlabeled.csv	test_stances_unlabeled.csv	Initial commit	Jun 9, 2017
test_train_splits.py	test_train_splits.py	appended copyright notice to .py and .md	Jun 9, 2017
train_bodies_processed.csv	train_bodies_processed.csv	Initial commit	Jun 9, 2017
train_stances_processed.csv	train_stances_processed.csv	Initial commit	Jun 9, 2017
training_ids.txt	training_ids.txt	Initial commit	Jun 9, 2017
xgb_train.py	xgb_train.py	appended copyright notice to .py and .md	Jun 9, 2017
xgb_train_cvBodyId.py	xgb_train_cvBodyId.py	appended copyright notice to .py and .md	Jun 9, 2017

README.md

SOLAT IN THE SWEN - tree_model

Overview

This model takes as input a few text-based features derived from the headline and body of an article. Then it feeds the features into Gradient Boosted Trees to predict the relation between the headline and the body (agree/disagree/discuss/unrelated)

Feature Engineering

1. Preprocessing (generateFeatures.py)

The labels (agree, disagree, discuss, unrelated) are encoded into numeric target values as (0, 1, 2, 3). The text of headline and body are then tokenized and stemmed (by preprocess_data() in helpers.py). Finally Uni-grams, bi-grams and tri-grams are created out of the list of tokens. These grams and the original text are used by the following feature extractor modules.

2. Basic Count Features (CountFeatureGenerator.py)

This module takes the uni-grams, bi-grams and tri-grams and creates various counts and ratios which could potentially signify how a body text is related to a headline. Specifically, it counts how many times a gram appears in the headline, how many unique grams there are in the headline, and the ratio between the two. The same statistics are computed for the body text, too. It then calculates how many grams in the headline also appear in the body text, and a normalized version of this overlapping count by the number of grams in the headline. The results are saved in the pickle file which will be read back in by the classifier.

3. TF-IDF Features (TfidfFeatureGenerator.py)

This module constructs sparse vector representations of the headline and body by calculating the Term-Frequency of each gram and normalize it by its Inverse-Document Frequency. First off a TfidfVectorizer is fit to the concatenations of headline and body text to obtain the vocabulary. Then using the same vocabulary it separately fits and transforms the headline grams and body grams into sparse vectors. It also calculates the cosine similarity between the headline vector and the body vector.

4. SVD Features (SvdFeatureGenerator.py)

This module takes the TF-IDF features and applies Singular-Value Decomposition to them to obtain a compact, dense vector representation of the headline and body respectively. This procedure is well known and corresponds to finding the latent topics involved in the corpus and represent each headline/body text as a mixture of these topics. The cosine similarities between the SVD features of headline and body text are also computed. This similarity metric is very indicative of whether the body is related to the headline or not.

5. Word2Vec Features (Word2VecFeatureGenerator.py)

This module utilizes the pre-trained word vectors from public sources, add them up to build vector representations of the headline and body. The word vectors were trained on a Google News corpus with 100 billion words and a vocabulary size of 3 million. The resulting word vectors can be used to find synonyms, predict the next word given the previous words, or to manipulate semantics. For example, when you calculate vector(Germany) - Vector(Berlin) + Vector(England) you will obtain a vector that is very close to Vector(London). For the current problem constructing the vector representation out of word vectors could potentially overcome the ambiguities introduced by the fact that headline and body may use synonyms instead of exact words.

6. Sentiment Features (SentimentFeatureGenerator.py)

This modules uses the Sentiment Analyzer in the NLTK package to assign a sentiment polarity score to the headline and body separately. For example, negative score means the text shows a negative opinion of something. This score can be informative of whether the body is being positive about a subject while the headline is being negative. But it does not indicate whether it's the same subject that appears in the body and headline; however, this piece of information should be preserved in other features.

Classifier Construction (xgb_train_cvBodyId.py)

The classifier used in this model is Gradient Boosted Trees. A very efficient implementation of GBDT is XGBoost. 10-fold cross-validation is used to estimate the performance of this model.

Library Dependencies

Python 2.7
Scipy Stack (numpy, scipy and pandas)
scikit-learn
XGBoost
gensim (for word2vec)
NLTK (python NLP library)

Procedure

1. Install all the dependencies

2.git clone https://github.com/Cisco-Talos/fnc-1.git

3. Download the word2vec model trained on Google News corpus. The file GoogleNews-vectors-negative300.bin has to be present in both /deep_learning_model and /tree_model.

4. Generate predictions from the deep_learning_model by running /deep_learning_model/clf.py. This output is represented by dosiblOutputFinal.csv, renamed as deepoutput.csv in our directories.

5. Run generateFeatures.py to produce all the feature files (train_stances_processed.csv and test_stances_processed.csv are the (encoding-wise) cleaned-up version of the orginal csv files, same as the updated files in the orginal FNC-1 GitHub). The following files will be generated:

data.pkl
test.basic.pkl
test.body.senti.pkl
test.body.svd.pkl
test.body.tfidf.pkl
test.body.word2vec.pkl
test.headline.senti.pkl
test.headline.svd.pkl
test.headline.tfidf.pkl
test.headline.word2vec.pkl
test.sim.svd.pkl
test.sim.tfidf.pkl
test.sim.word2vec.pkl
train.basic.pkl
train.body.senti.pkl
train.body.svd.pkl
train.body.tfidf.pkl
train.body.word2vec.pkl
train.headline.senti.pkl
train.headline.svd.pkl
train.headline.tfidf.pkl
train.headline.word2vec.pkl
train.sim.svd.pkl
train.sim.tfidf.pkl
train.sim.word2vec.pkl

6. Comment out line 121 in TfidfFeatureGenerator.py, then uncomment line 122 in the same file. Raw TF-IDF vectors are needed by SvdFeatureGenerator.py during feature generation, but only the similarities are needed for training.

7. Run xgb_train_cvBodyId.py to train and make predictions on the test set. Output file is predtest_cor2.csv (for model averaging)

8. Run average.py to computed the weighted average of this tree model and the CNN model from Doug Sibley (deepoutput.csv). The final output file is averaged_2models_cor4.csv, which is the final submission of our team.

The run() function in average.py is used to find the optimal weights when combining the models. Uncomment line 164 and 165 to call this function. It needs the probabilities predictions from both the deep learning (dosiblOutput.csv) and tree models (predtest_cor2.csv) on the holdout set of the official split. This step is not necessary, as we have decided that the weights we were going to use is 0.5 and 0.5

All the output files are also stored under ./results/ and all parameters are hard-coded.

Questions?

Contact Yuxi Pan (yuxpan@cisco.com) for bugs and questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

tree_model

tree_model

README.md

SOLAT IN THE SWEN - tree_model

Overview

Feature Engineering

Classifier Construction (xgb_train_cvBodyId.py)

Library Dependencies

Procedure

Questions?

Files

tree_model

Directory actions

More options

Directory actions

More options

Latest commit

History

tree_model

Folders and files

parent directory

README.md

SOLAT IN THE SWEN - tree_model

Overview

Feature Engineering

Classifier Construction (xgb_train_cvBodyId.py)

Library Dependencies

Procedure

Questions?