-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port new Tokeniser from Linguist #193
Comments
Linguist tokenize is defined using flex-based 1. Generating Go code from flex grammarGolang does have limited version of it in ported https://gitlab.com/cznic/golex but it is missing 2 features to in order to be used with the above definition:
(see logs in details for reproduction instructions)
At this point it's a hard to estimate the effort of adding those features upstream. 2. Porting lexer grammar to RagelInstructive go-nuts thread on this subject points out worth trying a bit more complex solution, similar to discussion in #167, based on ragel, another FSM generator that can be "compiled" to Go code. That would only require porting 1 file 3. Using flex-generated native lexer through the cgoHidden behind a compilation tag, this option includes direct usage of the same native, flex-generated tokenizer from the Linguist. This is a low-hanging fruit as does not require much effort to port and is a simplest way to verify the hypothesis of classifier accuracy from #194. |
#193 (comment) updated to include another option of using existing flex-based tokenizer though cgo. |
Part of the #155
Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.
This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.
This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.
The text was updated successfully, but these errors were encountered: