-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion about baselines for CodRep #13
Comments
Sounds reasonable. For now I've implemented second one and halfway in implementing first/third (halfway means that it's implemented but works slowly and not debugged properly). |
Small correction, I matched prefix and suffix instead of tokens. |
Hi all, Thank you for sharing ideas to improve the competition. These baseline algorithms that I included in this competition are really "stupid". I did not include other baselines since I do not know how these would perform compared with your algorithms. The baseline algorithms should be easily beaten and easily understood. If, however, that everyone seems to be able to beat the proposed baseline algorithms. And the idea of "The line that is the most similar to the line being added" or "the code still parses", is commonly adopted. Then I would add the proposed baseline algorithms to the set of baseline algorithms in the competition. |
Very nice to see you here @mallamanis!
Oh no, that's great to have a lively issue tracker for discussing all sorts of things! |
Hi all, My rationale for suggesting those baselines was that filtering the lines where parsing fails would be a constraint that I would bake-in to any model to filter obvious false positives. The "most similar line" baseline would be the first "sanity check" to check if an ML model captures something beyond line similarity (such as context around each line) which seems a somewhat strong predictor. |
100% agree. The corollary is that this dataset and loss function helps to see what ML techniques are good at capturing line similarity. |
Agreed, although I am sure that capturing some information about the context will also useful. |
No more activity here, closing the issue. Don't hesitate to reopen if appropriate! |
Hi,
Thanks a lot for organizing this :) Hope that you don't mind the drive-by issue submission: I would like to suggest three additional, strong, but reasonable, baselines:
The reason I am suggesting this, is that these baselines seem easy "hacks" to achieve reasonable performance without any machine learning.
The text was updated successfully, but these errors were encountered: