-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6
Comments
This RFE is closely related to jpmml/jpmml-sklearn#4 - Scikit-Learn and Apache Spark ML use the same thinking in the area of text feature engineering. All those transformations should map to the Please ignore the The biggest obstacle in the way of moving on with NLP functionality is that the JPMML-Evaluator library doesn't support the |
Thanks Villu. I'll look into that as soon as I can figure out how to actually harvest the terms, term frequencies, and inverse document frequencies that are relevant to the final model from the pipeline. I am a Spark noob and have yet to figure out how to do this using the Java ML api without making an extra pass through the Dataset and computing it. |
Perhaps a custom transformer added to the pipeline will do the trick |
For example, if you have a The same principle applies to other Apache Spark ML transformers as well - upon fitting a pipeline, an |
Excellent, thanks |
My pipeline consists of following stages: I was checking the list of features and models supported in ConverterUtil Class, first three features/models (Tokenizer, StopWordsRemover, CountVectorizer) are not listed there. I am using them to convert texts into feature vector. Do you have any plans to support them? Can you please suggest a workaround? Thanks, |
These transforms implement very simple string splitting/filtering/counting functionality. Even in combination, would be much easier to implement than the originally requested
Would need to implement basic
Implement the missing pieces yourself, and submit PR(s). |
Support for |
Hi, first thanks for this excellent project (and also pmml-evaluator)! I have a Spark ML pipeline which uses Tokenizer, HashingTF, and IDF in order to feed a column containing text to a multiclass classifier which predicts a category. How feasible / hard would it be to support such a pipeline in jpmml-sparkml? I was thinking about taking a shot at it. Should Tokenizer get converted to an org.dmg.pmml.DocumentTermMatrix, or something else? And what about HashingTF and IDF? What pmml objects should those be converted to? Thanks in advance
The text was updated successfully, but these errors were encountered: