AST classification

A ML classification task where input consists of simplified AST representations of python programs. Here is short presentation slideshow of this project.

Dataset

The dataset is in data. It has been collected, processed, anonymized and filtered from this Codeforces dataset. More information about the dataset is available here. Code for creating my sub-dataset is in util.

The dataset consists of simplified ASTs of ~400000 short python programs grouped into 104 classes. The programs in one class are accepted Python submissions for one particular Codeforces task. I have removed information about all identifiers. The line x = ['elem1', 'elem2'] is represented as

    [('Assign', [('Name', ['Store']), ('List', ['Constant', 'Constant', 'Load'])])]

data/data_statistics.txt contains dataset statistics. Decimal quantiles for lines, tokens and maximal tree depths are: (TODO(aleloi): add plots)

num lines stats: [4.0, 6.0, 7.0, 8.0, 10.0, 11.0, 13.0, 16.0, 21.0]
num tokens stats: [36.0, 47.0, 57.0, 65.0, 73.0, 82.0, 94.0, 112.0, 146.0]
depth stats: [7.0, 7.0, 8.0, 8.0, 9.0, 9.0, 10.0, 10.0, 11.0]

Some ASTs are very large because some solutions start with a long template. Initially I filtered out the longest 1% of the data. Modify dataset.DataArgs.drop_large_threshold_tokens to change this.

Two classes in the dataset correspond to the same problem. It appears twice on Codeforces, as codeforces.com/contest/1465/problem/A and as codeforces.com/contest/1411/problem/A. Since I noticed this quite late in development, there is a hack in dataset.py that manually assigns both these problems the same label. Otherwise, the labels would be permuted, and that hinders evaluation of saved models.

The dataset contains plenty of duplicate ASTs. As stated above, there are ~400k ASTs, but only ~200k unique ASTs. There is a version of the dataset without duplicates in data/cont*_prob*.txt.uniq.txt.

The solution distribution in the dataset is pretty inhomogeneous. The most common problems have ~20k solutions, and the least common has ~1000. I want to compare results to Tree-based CNN, which had exactly 500 problems per class. To ensure a uniform prior over problem class, I first randomly split the dataset, and then use weighted sampling. One epoch draws 500*num_classes samples from the training split. Check train_dgl.py for dataset options.

Tree-LSTM vs Tree-Based CNN

The goal of this project is to compare the performance of Tree-LSTM to Tree-based CNN. I only plan to implement Tree-LSTM, and compare with the TBCNN performance on a simimilar classification task to the one described in the TBCNN article.

This is primarily an education project for training neural networks to take ASTs as input.

Results

I got 97% accuracy on my 104 problems dataset with ordinary LSTM (TODO eval on test set at epoch 28) and 98% accuracy with TreeLSTM. Training the TreeLSTM took about a day on a laptop. Training speed was ~30 samples per second for TreeLSTM and about 400 samples per second for LSTM. I used Adam momentums $(0.9, 0.999)$ (Pytorch default), and learning rate 0.0001.

With the original dataset, I do not observe any overfitting. The dataset contains duplicate entries that end up in both validation and training set. This reduces observed overfitting, because we in part evaluate on the training set. It also contains two identical problems. After ca epoch 100 and 3-4 hours of training, misclassified training samples are almost entirely composed of the duplicate problem. This makes the network see relatively large gradients every 100 samples, which I think has a slight regularizing effect.

When duplicate solutions are removed, and the two duplicate problems are merged I got 97.2% accuracy (TODO eval on test set at epoch 28) on the 103 remaining classes with linear LSTM. It does start overfitting after processing 28*104*500 ~ 1000k samples trained at lr=0.002 for the first 10 epochs and then lr=0.0001. I did not include any regularization because I discovered the overfitting quite late and because the Tree-based CNN article says that they found that no regularization worked best for RNN models.

Tree-LSTM

I made a simple Pytorch Tree-LSTM implementation by following equations (2-8) in the Tree-Structured Long Short-Term Memory article. It managed to train at ~16 samples per second on CPU. I managed to push it up to 22 samples/second. Training on GPU was even slower at 5 samples/s. I then adapted an LSTM implementation from the Deep Graph Network examples repository which now runs at 40 samples/s. This is ca 10 times slower than the linear LSTM speed.

DGLTreeLSTM(
  (embedding): Embedding(88, 40, padding_idx=87)
  (lstm_cell): ChildSumTreeLSTMCell(
    (W_iou): Linear(in_features=40, out_features=600, bias=False)
    (U_iou): Linear(in_features=200, out_features=600, bias=False)
    (U_f): Linear(in_features=200, out_features=200, bias=True)
  )
  (fc_layers): Sequential(
    (0): Linear(in_features=200, out_features=200, bias=True)
    (1): Tanh()
    (2): Linear(in_features=200, out_features=104, bias=True)
  )
  (loss): CrossEntropyLoss()
  (train_loss): CrossEntropyLoss()
)

Tree-LSTMs train much faster when measured in samples processed. They also produce a better results.

TODO image when reduped dataset run finishes

Run the model

To run you need python >= 3.9 and cuda 11.3. Run pip install -r requirements.txt. Python 3.9 is required because I used argparse.BooleanOptionalAction which is a 3.9 feature. Cuda 11.3 is required because the pytorch and dgl versions in requirements.txt say so. I think I haven't hard-coded running on GPU, so a cuda-less dgl and pytorch probably also work, but I haven't tested.

There is a runner script with far too many arguments. One can point it to a python file and ask it to predict which codeforses task it solves. I've included a saved model, so this should work after installing the dependencies:

python main.py --mode predict \
   --saved_model_dir \
   ./results/model_linear__classes_104__emb_dim_40__lstm_dim_200__fc_depth_3__label_smoothing_0_05__lr_0_002__prune_uniq_True \
   --predict_path ./data/my/1454_D.py

This currently outputs

Using device [cuda].
Loading model from ./results/model_linear__classes_104__emb_dim_40__lstm_dim_200__fc_depth_3__label_smoothing_0_05__lr_0_002__prune_uniq_True...
./results/model_linear__classes_104__emb_dim_40__lstm_dim_200__fc_depth_3__label_smoothing_0_05__lr_0_002__prune_uniq_True contains runs 04-03-2022-21--21--08
Latest run is: 04-03-2022-21--21--08
Loading epoch: Epoch--199-Loss--3.50.pt out of 200 epochs
Missing/unexpected: <All keys matched successfully>

P=0.951, problem=1454D
P=0.001, problem=467A
P=0.001, problem=271A

Tensorboard

I am logging training curves, gradient values, gradient norms, weight values and the token embeddings to Tensorboard.

The --small_train_samples argument specifies how many samples of the training set are used after each epoch to estimate average gradient values and norms. The idea is to have the same sample distribution as the training data, so I use a smaller subset of training samples and the same batch size. For every weight matrix and bias vector $w$, I make a histogram of all $\|\frac{\partial \text{loss}_\text{batch}}{\partial w}\|_2$ for all the batches. The result may look like this:

The figure shows that gradient norms slowly decrease as training progresses. This is a useful debugging tool: before I added label smoothing, gradients increase instead of decreasing.

To see the graphs, run python -m tensorboard.main --logdir=results/ and open http://localhost:6006 once it has processed the data. I checked in some saved tensorboard data together with the trained LSTM model (although that may not have any histograms).

Confusion matrix

I wondered whether there is a pattern in which problems are misclassified into which. The figure below is a kind of confusion matrix, but I've removed the diagonal, and cell numbers are not probabilities. It is essentially a histogram of all true and predicted classes of misclassified samples from the test set.

The classes are ordered lexicographically with string comparison. I don't remember why, but problem names ended up as strings in util/collect_dataset.py. When I generated the dataset, I made a frequency analysis on parts of the full Codeforces data, and picked the 104 most common classes. Then I re-run data collection on a much larger subset of the CF dump, and in the end I got lots of problems from contests 1400-1500. Classes 11-65 are all 14xx contests. They also seem to be the classes between which there is least confusion.

The only thing I can think of that separates them from other classes is number of programs per class. The 14xx are on average 5-10 times less common in the dataset than other problems. It could of course happen that 14xx problems are different in some other way, e.g. problem setters keep track of recent problems and don't propose similars ones, but I find that explanation unlikely.

When a problem has a large class, say 5000 data points, it takes about 10 epochs until the networks sees the same instance again. Perhaps for a large class the network learns that there is large variability in the training and all kinds of inputs may get classified to this class. It learns to be less confident and often assigns high probability to the largest classes.

TODO

Do some analysis of the resulting embedding: which nodes are closest, are there clusters? I expect For, While to be close. Break, Continue should also be a group.
Write TreeLSTM
Extract some interesting metrics: between which tokens does the LSTM forget?
Check whether training is faster if we initialize embedding and LSTM weights from a pre-trained model with fewer classes.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
results/model_linear__classes_104__emb_dim_40__lstm_dim_200__fc_depth_3__label_smoothing_0_05__lr_0_002__prune_uniq_True/04-03-2022-21--21--08		results/model_linear__classes_104__emb_dim_40__lstm_dim_200__fc_depth_3__label_smoothing_0_05__lr_0_002__prune_uniq_True/04-03-2022-21--21--08
util		util
.gitignore		.gitignore
README.md		README.md
U_l2_grads.png		U_l2_grads.png
confusion_matrix.png		confusion_matrix.png
dataset.py		dataset.py
dgl_lstm_model.py		dgl_lstm_model.py
export_logits.py		export_logits.py
linear_lstm_model.py		linear_lstm_model.py
main.py		main.py
predict.py		predict.py
requirements.txt		requirements.txt
tensorboard_linear_LSTM.png		tensorboard_linear_LSTM.png
tensorboard_linear_no_overfit.png		tensorboard_linear_no_overfit.png
tensorboard_linear_overfit.png		tensorboard_linear_overfit.png
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AST classification

Dataset

Tree-LSTM vs Tree-Based CNN

Results

Tree-LSTM

Run the model

Tensorboard

Confusion matrix

TODO

About

Releases

Packages

Languages

aleloi/ast-classification

Folders and files

Latest commit

History

Repository files navigation

AST classification

Dataset

Tree-LSTM vs Tree-Based CNN

Results

Tree-LSTM

Run the model

Tensorboard

Confusion matrix

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages