Big Data Analysis of the Developer Community

This project investigates the changing sentiments of developers on StackOverflow.

Project Information

Class: CAPP 30123 (Computer Science with Applications III) @ UChicago

Group name: HackyStacks

Group members: Adam Shelton, Dhruval Bhatt, Li Liu, Sanittawan Tan

Data sets: Stack Overflow Data

Quick Links

Code Files: Contains all scripts and code used throughout the project
- Analysis: Code used for the final Big Data analysis
  - Sentiment Popularity: The meat and potatoes - all the code for the sentiment analysis by language
    - C
    - Git
    - Dataproc
    - Java
    - Javascript
    - Python
    - R
    - Rust
    - SQL
    - Unix
    - Visualization
    - VADER sentiment analysis.ipynb
    - Sentiments with Dask.ipynb Parallel Computing with Dask
    - adam_text_sentiment_test.py: Alternative sentiment analysis code that was tested
    - Time Series Plot
  - Descriptive Analysis
    - Exploring languages & frameworks: Top tags
      - descr_toptags.py
      - descr_toptag.py Alternative version
    - Exploring Users: Distribution of user activities (questions and answers)
      - descr_users_activities.py MapReduce version
      - descr_spark_users_activities.py Apache Spark version
      - descr_bash_users_activities.sh Accompanied Bash script for launching Spark cluster
    - Exploring Questions: Questions that receive most number of answers 2008 to 2019
      - descr_max_ans_q.py
    - Exploring Answer Providers: Locations of users who receive "Illuminator" badge
      - descr_users_gold_bash.sh Bash script for data prep before running the below code
      - descr_users_gold_ans.py GeoPy version¹
      - descr_gmap_users_gold_ans.py Google Maps version¹
      - descr_optimized_users_locations.py Improved version
      - mrjob.conf MRJob Dataproc configuration file
    - Exploring Tag Network: Bi-grams of adjacent tags
      - decrs_bi_grams_tags.py²
      - decrs_n_grams_tags.py²
- Processing: Code used to prepare raw XML data-sets for the analysis
  - First Drafts: Several first drafts of processing code, each completed by a different person
    - adam_process_data.py:
    - dhruval_process_data.py:
    - nikki_process_users_votes.py:
  - cleaning_MPI.py: Final processing code using MPI
  - main_process_data.py: Final processing code (not parallelized)
  - main_process_data_no_hardcoded.py: Final processing code (not parallelized)
  - output_to_csv.py: MPI script to convert text files from analysis to CSVs
- Visualizations: Code used to create all the MPI descriptive statistics visualizations
  - visualizations_files: Contains raw SVG files for all figures (see the main visualizations folder for rasterized visualizations and their descriptions
  - visualizations.md: Markdown document displaying visualizations and their code
  - visualizations.Rmd: R Markdown document displaying visualizations and their full code
Data: Contains all data files small enough to upload to GitHub
- Samples: Small subsets of data files used for testing
  - sample_badges.csv: 500 lines of Badges.xml converted to CSV
  - sample_comments.csv: 500 lines of Comments.xml converted to CSV
  - sample_PostHistory.csv: 500 lines of PostHistory.xml converted to CSV
  - sample_PostLinks.csv: 500 lines of PostLinks.xml converted to CSV
  - sample_posts.csv: 500 lines of Posts.xml converted to CSV
  - sample_tags.csv: 500 lines of Tags.xml converted to CSV
  - sample_users.csv: 500 lines of Users.xml converted to CSV
  - sample_votes.csv: 500 lines of Votes.xml converted to CSV
Meeting Minutes: Contains resources from weekly minutes
- Whiteboard Pictures: Photos of drawings/diagrams from meetings
  - Week 4
  - Week 6
  - Week 7
  - Week 8
- Notes
  - Week 4
Output Data: Contains the results from the Big Data analysis
- raw_text_files: Contains unprocessed text files straight from the analysis scripts
  - top_questions.txt: Most popular questions by year
  - top_tags.txt: Tags by number of posts
  - twograms.txt: All two-grams of tags in posts
  - users_gold_badge_locations.txt: The location of each user with gold
- mpi_trials.csv: Running times of different MPI configurations
- top_questions.csv: Most popular questions by year
- top_tags.csv: Tags by number of posts
- twograms.csv: All two-grams of tags in posts
- user_ac_out: The number of posts per user
- users_gold_badge_locations.csv: The location of each user with gold
Reference Materials and Documents
- Project Proposal
- Server Access Guide: Provides instructions on how to access and use the server
Visualizations: Contains all rasterized PNGs of the visualizations
- mpi-1-1.png: Line plot of how MPI running time decreases as more nodes are added
- mpi-2-1.png: Bar plot of the relationship between number of hosts and running time
- mpi-cost-1.png: Line plot of how Google Cloud costs increase as more nodes are added
- ques-year-1.png: Bar plot of top questions for each year
- top-tags-1-png: Bar plot of the top six tags in all posts
- two-grams-1.png: Network graph of the top 150 tags
- user-act-1.png: Density plot of user activity
- usr-act-tmap-1.png: Treemap of user activity
Zenhub Workflow

¹ Note: The two versions of the code below are almost identical. The main difference is the package used for Geocoding.

² Note: Different bi-gram generating methods

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
code_files		code_files
data		data
minutes		minutes
output_data		output_data
refs_docs		refs_docs
visualizations		visualizations
zenhub_workflow		zenhub_workflow
Big-Data-Project.Rproj		Big-Data-Project.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analysis of the Developer Community

Project Information

Quick Links

Table of Contents

About

Releases

Packages

Contributors 4

Languages

liu431/Big-Data-Project

Folders and files

Latest commit

History

Repository files navigation

Big Data Analysis of the Developer Community

Project Information

Quick Links

Table of Contents

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages