Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about baselines for CodRep #13

Closed
mallamanis opened this issue Apr 12, 2018 · 8 comments
Closed

Discussion about baselines for CodRep #13

mallamanis opened this issue Apr 12, 2018 · 8 comments

Comments

@mallamanis
Copy link

Hi,
Thanks a lot for organizing this :) Hope that you don't mind the drive-by issue submission: I would like to suggest three additional, strong, but reasonable, baselines:

  • Random prediction over the lines where after the replacement the code still parses;
  • The line that is the most similar to the line being added (e.g. max % common tokens between the lines);
  • The combination of the above.

The reason I am suggesting this, is that these baselines seem easy "hacks" to achieve reasonable performance without any machine learning.

@egor-bogomolov
Copy link
Contributor

Sounds reasonable. For now I've implemented second one and halfway in implementing first/third (halfway means that it's implemented but works slowly and not debugged properly).
Second idea on it's own gives around 85% accuracy.

@egor-bogomolov
Copy link
Contributor

Small correction, I matched prefix and suffix instead of tokens.

@chenzimin
Copy link
Collaborator

chenzimin commented Apr 12, 2018

Hi all,

Thank you for sharing ideas to improve the competition.

These baseline algorithms that I included in this competition are really "stupid". I did not include other baselines since I do not know how these would perform compared with your algorithms. The baseline algorithms should be easily beaten and easily understood.

If, however, that everyone seems to be able to beat the proposed baseline algorithms. And the idea of "The line that is the most similar to the line being added" or "the code still parses", is commonly adopted. Then I would add the proposed baseline algorithms to the set of baseline algorithms in the competition.

@monperrus monperrus changed the title Baselines Discussion about baselines Apr 13, 2018
@monperrus monperrus changed the title Discussion about baselines Discussion about baselines for CodRep Apr 13, 2018
@monperrus
Copy link
Collaborator

Very nice to see you here @mallamanis!

Hope that you don't mind the drive-by issue submission

Oh no, that's great to have a lively issue tracker for discussing all sorts of things!

@mallamanis
Copy link
Author

Hi all,
Thanks for your prompt responses :)

My rationale for suggesting those baselines was that filtering the lines where parsing fails would be a constraint that I would bake-in to any model to filter obvious false positives. The "most similar line" baseline would be the first "sanity check" to check if an ML model captures something beyond line similarity (such as context around each line) which seems a somewhat strong predictor.

@monperrus
Copy link
Collaborator

The "most similar line" baseline would be the first "sanity check" to check if an ML model captures something beyond line similarity (such as context around each line) which seems a somewhat strong predictor.

100% agree.

The corollary is that this dataset and loss function helps to see what ML techniques are good at capturing line similarity.

@mallamanis
Copy link
Author

Agreed, although I am sure that capturing some information about the context will also useful.

@monperrus
Copy link
Collaborator

No more activity here, closing the issue. Don't hesitate to reopen if appropriate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants