SOParser is a parser and analyzer of the StackOverflow data.
The package contains a bash script for downloading and extracting data into a manageable format. In addition to this the, the repository contains topic modeling code in Python.
Execute ./downloadAndPrepareData.sh
to download and prepare the data. NB: You will need ~100gb available disc
space to be able to run the script.
downloadAndPrepareData.sh
- a bash script that downloads and prepares the data. The script creates one file per month (Jan. 2013 - Dec. 2014), each file contains the questions and answers posted in that month.SOParser.py
- a Python script that 1) extracts all users that will be used in the analysis (users with minimum 50 posts over 2013-2014), 2) extracts the questions and answers (title, text - excluding code snippets, tags, ) written by those users and saves in data files used in later stages of the analysis. The output is one TSV file per month.TextProcessor.py
- performs tokenization, stemming, TF-IDF, and month-by-month LDA on the files generated bySOParser.py
.TopicComparator.py
- Compares topics month-by-month, e.g. compares the topics generated for 2013-05 with the topics generated for 2013-06 and 2013-07, etc.UserComparator.py
- Compares topics month-by-month in terms of users
You might need to run nltk.download() to download stopwords.