SQL parser is built and only functions in Python2 env. However, since NaturalCC is designed on Python3, we have processed SQL/C#/Python data in a Python2 based environment. and saved them in stack_overflow.zip. If interested in the data processing, you can follow original stack_overflow.
bash dataset/stack_overflow/download.sh
- flatten SQL code/docstring at
~/stack_overflow/flatten/sql
python -m dataset.stack_overflow.flatten -l sql
- move those decompressed files to
~/stack_overflow/flatten/sql
unzip dataset/stack_overflow/sql_tokens.zip -d ~/stack_overflow/flatten/sql
- binarize SQL dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/sql
- install antlr4-python3-runtime
pip install antlr4-python3-runtime==4.5.2
- flatten C# code/docstring at ~/stack_overflow/flatten/csharp
python -m dataset.stack_overflow.flatten -l csharp
- tokenize
code/docstring
intocode_token/docstring_token
python -m dataset.stack_overflow.tokenization -l csharp
Since generating code_token/docstring_token is slow, you can move those decompressed files to ~/stack_overflow/flatten/csharp
unzip dataset/stack_overflow/csharp_tokens.zip -d ~/stack_overflow/flatten/csharp
- binarize C# dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/csharp
- flatten Python code/docstring at ~/stack_overflow/flatten/python
python -m dataset.stack_overflow.flatten -l python
- tokenize
code/docstring
intocode_token/docstring_token
python -m dataset.stack_overflow.tokenization -l python
- binarize Python dataset
python -m dataset.stackoverflow.summarization.preprocess -f config/python