Stack Overflow mining scripts used for the following paper during the 19th International Conference on Mining Software Repositories (MSR '22):
Mining the Usage of Reactive Programming APIs: A Mining Study on GitHub and Stack Overflow.
Complementary scripts, also utilized during the paper production, are available in:
Under the folders in /assets
, data either genereated by or collected for the scripts execution can be found. The table gives a brief description of each folder:
Folder | Description |
---|---|
data explorer | Contains posts collected from Stack Exchange Data Explorer |
extracted-posts | Includes JSON files having the posts related to the most relenvat topics (RQ3) |
lda-results | Contains the results of the last LDA execution |
operators-search | Includes the results for the operator search for Rx libraries |
operators | Includes JSON files consisting of Rx libraries' operators |
result-processing | Contains data presented in the Result section (RQ2) |
The file stopwords.txt
contains a list of stop words used during preprocessing.
The results for the last LDA (Latent Dirichlet Allocation) are available under /assets/2022-01-12 02-21-28/
. As detailed in the paper, the execution with the following settings generated the most coherent results:
Parameter | Value |
---|---|
Topics | 23 |
HyperParameters | α=β=0.01 |
Iterations | 1,000 |
Each result is comprised of three CSV files following the bellow file name pattern:
- [file name of the posts file]_doctopicdist_[#topics]_[analyzed post field].csv - contains the posts' ids and their distribution of topics+proportion, including the dominant topic and its proportion in a separate column for easy retrieval;
- [file name of the posts file]_topicdist_[#topics]_[analyzed post field].csv - the topic distribution along with their words+proportion descendingly sorted by word proportion;
- [file name of the posts file]_topicdist_[#topics]_[analyzed post field] - topwords.csv - (extra) the same as the above one but presenting the topics only with their top words (set in config) to facilitate the open card sorting technique.
Where:
- [file name of the posts file]: is a file under
assets/data explorer/consolidated sources
and set through config; - [#topics]: number of topics for that specific execution;
- [analyzed post field]: either Title or Body (see Configuration).
Most of the scripts utilize Golang as the main language and they have be executed the following version:
- Go v1.17.5
Before execution of the Golang scripts, the following command must be issued in a terminal (inside the root of the project) to download the dependencies:
go mod tidy
The Go scripts are available under the /cmd
folder
Script to unify all the CSV acquired from Stack Exchange Data Explorer.
go run cmd/consolidate-sources/main.go
💾 After execution, the result is available at assets/data explorer/consolidated sources/
.
Script to extract post from a given topic.
go run cmd/extract-posts/main.go
💾 After execution, the result is available at assets/extracted-posts
.
Script to execute the LDA algorithm.
go run cmd/lda/main.go
💾 After execution, the result is available at assets/lda-results
.
Script to generate random posts according to their topics and facilitate the open sort (topic labeling) execution.
go run cmd/open-sort/main.go
💾 After execution, the result is available at assets/opensort
.
Script to search for operators among the Stack Overflow posts.
go run cmd/operators-search/main.go
💾 After execution, the result is available at assets/operators-search
.
Script to process results and generate info about the topics, the popularities and difficulties.
go run cmd/process-results/main.go
💾 After execution, the result is available at assets/result-processing
.
The LDA script require the setting of some configuration in a JSON(config.json) under /configs
folder. This JSON is expecting a array of objects, each one representing a LDA execution. The objective must have the following structure (this is the object present by default in config.json):
{
"fileName": "all_withAnswers",
"field": "Body",
"combineTitleBody": true,
"minTopics": 10,
"maxTopics": 35,
"sampleWords": 20
}
Where:
- fileName(string): the name of the file with the posts(at
assets/data explorer/consolidated sources
); - field(string): the field to considered in LDA (either Title or Body);
- combineTitleBody(boolean): set it to combine title and body and assign the result to the post's Body field (only applicable if
field
is set to"Body"
); - minTopics(integer): the minimum quantity of posts to be generated;
- maxTopics(integer): the maximum quantity of posts to be generated;
- sampleWords(integer): the amount of sample top words to be included in an extra file with file name ending with
- topwords
.
Possible requirements:
- Internet browser
- Node.js (tested with v14.17.5)
We elaborated a tiny JS script to download the Stack Overflow posts (questions with and without accepted answers) related to the rx libraries from Stack Exchange Data Explorer (SEDE).
It's available at /scripts/data explorer/data-explorer.js
. To execute it, one must:
- Be logged in SEDE;
- Place the script in the DevTools's Console;
- Call
executeQuery
passing 0 (for RxJava), 1 (for RxJS), and 2 (for RxSwift) as a parameter.
Moreover, there's a second script(/scripts/data explorer/rename.js
) that can be used to move (and rename) the results to the their proper folder /assets/data explorer/[rx library folder]
, so they can be further used by the Go consolidate-sources script. In order for this second JS script to work, one must place the results under /scripts/data explorer/staging area
and call the script in a terminal (with node) and passing either 0 (for RxJava), 1 (for RxJS), and 2 (for RxSwift). For example:
node rename 0
Before execution of node.js script, one must execute the following terminal command within /scripts/data explorer/
:
npm install
As detailed in the paper, these were the Stack Overflow tags used:
- rx-java, rx-java2, rx-java3 (RxJava)
- rxjs, rxjs5, rxjs6, rxjs7 (RxJS)
- rx-swift (RxSwift)
As defined in the preprocessing phase in the paper, some terms commonly found in the Stack Overflow posts were removed from the corpus. Those include:
differ
,specif
,deal
,prefer
,easili
,easier
,mind
,current
,solv
,proper
,modifi
,explain
,hope
,help
,wonder
,altern
,sens
,entir
,ps
,solut
,achiev
,approach
,answer
,requir
,lot
,feel
,pretti
,easi
,goal
,think
,complex
,eleg
,improv
,look
,complic
,day
,chang
,issu
,add
,edit
,remov
,custom
,suggest
,comment
,ad
,refer
,stackblitz
,link
,mention
,detect
,face
,fix
,attach
,perfect
,mark
,reason
,suppos
,notic
,snippet
,demo
,line
,piec
,appear
Topic # | Label/Name |
---|---|
0 | Concurrency |
1 | Stream Creation and Composition |
2 | Typing and Correctness |
3 | UI for Web-based Systems |
4 | Input Validation |
5 | Introductory Questions |
6 | Testing and Debugging |
7 | REST API Calls |
8 | Android Development |
9 | Data Access |
10 | State Management and JavaScript |
11 | Control Flow |
12 | HTTP Handling |
13 | Stream Manipulation |
14 | Error Handling |
15 | Stream Lifecycle |
16 | Array Manipulation |
17 | Web Development |
18 | General Programming |
19 | iOS Development |
20 | Multicasting |
21 | Timing |
22 | Dependency Management |
Scripts used to produce some tables and figures present in the paper are located at the GitHub Mining repository. That was made to facilitate cross evaluation (GitHub + SO data) that some of those illustrations required.