This project investigates the changing sentiments of developers on StackOverflow.
Class: CAPP 30123 (Computer Science with Applications III) @ UChicago
Group name: HackyStacks
Group members: Adam Shelton, Dhruval Bhatt, Li Liu, Sanittawan Tan
Data sets: Stack Overflow Data
- Code Files: Contains all scripts and code used throughout the project
- Analysis: Code used for the final Big Data analysis
-
Sentiment Popularity: The meat and potatoes - all the code for the sentiment analysis by language
- C
- Git
- Dataproc
- Java
- Javascript
- Python
- R
- Rust
- SQL
- Unix
- Visualization
- VADER sentiment analysis.ipynb
- Sentiments with Dask.ipynb Parallel Computing with Dask
- adam_text_sentiment_test.py: Alternative sentiment analysis code that was tested
- Time Series Plot
-
- Exploring languages & frameworks: Top tags
- descr_toptags.py
- descr_toptag.py Alternative version
- Exploring Users: Distribution of user activities (questions and answers)
- descr_users_activities.py MapReduce version
- descr_spark_users_activities.py Apache Spark version
- descr_bash_users_activities.sh Accompanied Bash script for launching Spark cluster
- Exploring Questions: Questions that receive most number of answers 2008 to 2019
- Exploring Answer Providers: Locations of users who receive "Illuminator" badge
- descr_users_gold_bash.sh Bash script for data prep before running the below code
- descr_users_gold_ans.py GeoPy version1
- descr_gmap_users_gold_ans.py Google Maps version1
- descr_optimized_users_locations.py Improved version
- mrjob.conf MRJob Dataproc configuration file
- Exploring Tag Network: Bi-grams of adjacent tags
- Exploring languages & frameworks: Top tags
-
- Processing: Code used to prepare raw XML data-sets for the analysis
- First Drafts: Several first drafts of processing code, each completed by a different person
- cleaning_MPI.py: Final processing code using MPI
- main_process_data.py: Final processing code (not parallelized)
- main_process_data_no_hardcoded.py: Final processing code (not parallelized)
- output_to_csv.py: MPI script to convert text files from analysis to CSVs
- Visualizations: Code used to create all the MPI descriptive statistics visualizations
- visualizations_files: Contains raw SVG files for all figures (see the main visualizations folder for rasterized visualizations and their descriptions
- visualizations.md: Markdown document displaying visualizations and their code
- visualizations.Rmd: R Markdown document displaying visualizations and their full code
- Analysis: Code used for the final Big Data analysis
- Data: Contains all data files small enough to upload to GitHub
- Samples: Small subsets of data files used for testing
- sample_badges.csv: 500 lines of Badges.xml converted to CSV
- sample_comments.csv: 500 lines of Comments.xml converted to CSV
- sample_PostHistory.csv: 500 lines of PostHistory.xml converted to CSV
- sample_PostLinks.csv: 500 lines of PostLinks.xml converted to CSV
- sample_posts.csv: 500 lines of Posts.xml converted to CSV
- sample_tags.csv: 500 lines of Tags.xml converted to CSV
- sample_users.csv: 500 lines of Users.xml converted to CSV
- sample_votes.csv: 500 lines of Votes.xml converted to CSV
- Samples: Small subsets of data files used for testing
- Meeting Minutes: Contains resources from weekly minutes
- Output Data: Contains the results from the Big Data analysis
- raw_text_files: Contains unprocessed text files straight from the analysis scripts
- top_questions.txt: Most popular questions by year
- top_tags.txt: Tags by number of posts
- twograms.txt: All two-grams of tags in posts
- users_gold_badge_locations.txt: The location of each user with gold
- mpi_trials.csv: Running times of different MPI configurations
- top_questions.csv: Most popular questions by year
- top_tags.csv: Tags by number of posts
- twograms.csv: All two-grams of tags in posts
- user_ac_out: The number of posts per user
- users_gold_badge_locations.csv: The location of each user with gold
- raw_text_files: Contains unprocessed text files straight from the analysis scripts
- Reference Materials and Documents
- Project Proposal
- Server Access Guide: Provides instructions on how to access and use the server
- Visualizations: Contains all rasterized PNGs of the visualizations
- mpi-1-1.png: Line plot of how MPI running time decreases as more nodes are added
- mpi-2-1.png: Bar plot of the relationship between number of hosts and running time
- mpi-cost-1.png: Line plot of how Google Cloud costs increase as more nodes are added
- ques-year-1.png: Bar plot of top questions for each year
- top-tags-1-png: Bar plot of the top six tags in all posts
- two-grams-1.png: Network graph of the top 150 tags
- user-act-1.png: Density plot of user activity
- usr-act-tmap-1.png: Treemap of user activity
- Zenhub Workflow
1 Note: The two versions of the code below are almost identical. The main difference is the package used for Geocoding.
2 Note: Different bi-gram generating methods