Skip to content

Latest commit

 

History

History

stackoverflow

StackOverflow Dataset Generation

SQL parser is built and only functions in Python2 env. However, since NaturalCC is designed on Python3, we have processed SQL/C#/Python data in a Python2 based environment. and saved them in stack_overflow.zip. If interested in the data processing, you can follow original stack_overflow.

Step 1. Download StackOverflow C#/SQL/Python datasets

bash dataset/stack_overflow/download.sh

Step 2. SQL Generation

  1. flatten SQL code/docstring at ~/stack_overflow/flatten/sql
python -m dataset.stack_overflow.flatten -l sql
  1. move those decompressed files to ~/stack_overflow/flatten/sql
unzip dataset/stack_overflow/sql_tokens.zip -d ~/stack_overflow/flatten/sql
  1. binarize SQL dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/sql

Step 3. C# Generation

  1. install antlr4-python3-runtime
pip install antlr4-python3-runtime==4.5.2
  1. flatten C# code/docstring at ~/stack_overflow/flatten/csharp
python -m dataset.stack_overflow.flatten -l csharp
  1. tokenize code/docstring into code_token/docstring_token
python -m dataset.stack_overflow.tokenization -l csharp

Since generating code_token/docstring_token is slow, you can move those decompressed files to ~/stack_overflow/flatten/csharp

unzip dataset/stack_overflow/csharp_tokens.zip -d ~/stack_overflow/flatten/csharp
  1. binarize C# dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/csharp

Step 4. Python Generation

  1. flatten Python code/docstring at ~/stack_overflow/flatten/python
python -m dataset.stack_overflow.flatten -l python
  1. tokenize code/docstring into code_token/docstring_token
python -m dataset.stack_overflow.tokenization -l python
  1. binarize Python dataset
python -m dataset.stackoverflow.summarization.preprocess -f config/python