Problem with underscore when using RegexTokenizer() #40

ajolles-kenshoo · 2018-03-18T12:12:44Z

Hello,
here is an issue I'm facing when using RegexTokenizer:
When using RegexTokenizer in Spark pipeline, jpmml-sparkml allows two types of patterns:
"\s+" and "\W+".
When using "\W+" with gaps=True, it removes non alphanumerical characters, but also underscores ("_") are not removed.
However, in the case when underscores appear in the text, the function toPMMLBytes returns an error which is related to the underscore.
So it looks like underscores can not be removed, but also can't be left inside.

Thanks

vruusmann · 2018-03-18T12:54:27Z

The function toPMMLBytes returns an error which is related to the underscore.

Can you paste this error here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with underscore when using RegexTokenizer() #40

Problem with underscore when using RegexTokenizer() #40

ajolles-kenshoo commented Mar 18, 2018

vruusmann commented Mar 18, 2018

Problem with underscore when using RegexTokenizer() #40

Problem with underscore when using RegexTokenizer() #40

Comments

ajolles-kenshoo commented Mar 18, 2018

vruusmann commented Mar 18, 2018