Project by Veronica Salm, Ian DeHaan, Noah Gergel, Siyuan Yu, Xiang Zhang
For this project, we used topic modelling to analize and understand public discourse on twitter. Our results and findings, as well as the approach we took, is discussed in our final report and video. This project contains all of the relevant code files we used to achieve these results, and their contents are briefly discussed in this README.
Below are all of the code files in this repository.
* intruder-detection/
* intruder-detection.py
- Contains code for running intruder detection on our results.
* scripts/
* covid-policy-tracker/
- Contains scripts for summarizing COVID-19 lockdown information, to determine the best weeks to use.
* extract_data/
- Code for extracting the tweets and constructs the dataset we run the algorithms on.
* coherence.py
- Contains the functions we used to get the coherence of our topics.
* DataManager.py
- Contains functions and a wrapper class we used for loading in and managing our datasets.
* lda_cross_validation_twitter.py
- Performs 10-fold cross validation on the twitter dataset for determining the best number of K topics.
* newsgroup_util.py
- Contains utility functions for working with the 20newsgroup dataset. (Downloading, normalizing, etc.)
* twitter_util.py
- Contains utility functions for working with the twitter dataset.
* settings.py
- Contains project-wide settings and constants.
* main.py
- Code demo for the project.
* stop_list_iter.py
- Uses LDA to build a stop list of uninformative terms.
* lda_gibbs.py
- Our implementation of LDA using Gibbs sampling.
Note that intruder-detection
and the sub-directories in scripts
all have their own READMEs, please see them for more information.
In order to run any of our code files, multiple libraries need to be installed.
In particular, numpy
, sklearn
, nltk
, and gensim
. These can be installed with the command
pip3 install numpy sklearn nltk gensim
. Make sure you also have Python 3.7 installed.
There is sometimes an error with nltk
after you install it,
just uncomment the lines 19-21 in newsgroup_util.py
and run again to fix the nltk
error.
Once these libraries are installed, simply run python3 main.py
to see our demo code of running both topic modelling algorithms.
In this repository, we've added functions to streamline the process of downloading the twitter dataset. In particular, to download the twitter dataset you want to run the following function:
def download_twitter(path='./TwitterDataset'):
'''
Downloads the twitter dataset from the git repository:
https://github.com/VeronicaSalm/TwitterDataset
Arguments:
- path (string): an absolute or relative path to the directory where the
Twitter repository should be downloaded to, defaults to
the current directory '.'
'''
os.system(f"git clone https://github.com/VeronicaSalm/TwitterDataset {path}")
That appears in twitter_util.py
. You can just import it with the line from twitter_util import download_twitter
.
For all of the utility code provided, it is thoroughly documented with function headers and comments.