Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6

Closed
n-merritt opened this issue Oct 26, 2016 · 8 comments

Comments

@n-merritt
Copy link

Hi, first thanks for this excellent project (and also pmml-evaluator)! I have a Spark ML pipeline which uses Tokenizer, HashingTF, and IDF in order to feed a column containing text to a multiclass classifier which predicts a category. How feasible / hard would it be to support such a pipeline in jpmml-sparkml? I was thinking about taking a shot at it. Should Tokenizer get converted to an org.dmg.pmml.DocumentTermMatrix, or something else? And what about HashingTF and IDF? What pmml objects should those be converted to? Thanks in advance

@vruusmann
Copy link
Member

vruusmann commented Oct 26, 2016

This RFE is closely related to jpmml/jpmml-sklearn#4 - Scikit-Learn and Apache Spark ML use the same thinking in the area of text feature engineering.

All those transformations should map to the TextIndex element in one way or another.

Please ignore the DocumentTermMatrix element, and all other elements that are defined in the scope of the TextModel element. This part of the PMML specification has been deprecated, so it would be pointless to use it as foundation for future work.

The biggest obstacle in the way of moving on with NLP functionality is that the JPMML-Evaluator library doesn't support the TextIndex element yet. So, you should point your initial efforts into that direction.

@n-merritt
Copy link
Author

Thanks Villu. I'll look into that as soon as I can figure out how to actually harvest the terms, term frequencies, and inverse document frequencies that are relevant to the final model from the pipeline. I am a Spark noob and have yet to figure out how to do this using the Java ML api without making an extra pass through the Dataset and computing it.

@n-merritt
Copy link
Author

Perhaps a custom transformer added to the pipeline will do the trick

@vruusmann
Copy link
Member

For example, if you have a ml.feature.IDF instance in the pipeline, and then fit the pipeline, then this ml.feature.IDF instance becomes a ml.feature.IDFModel instance. You can then call IDFModel#idf() to obtain its IDF vector.

The same principle applies to other Apache Spark ML transformers as well - upon fitting a pipeline, an Estimator instance becomes a Model instance, which then provides all the information about the training dataset.

@n-merritt
Copy link
Author

Excellent, thanks

@bipptech
Copy link

My pipeline consists of following stages:
Tokenizer, StopWordsRemover, CountVectorizer, StringIndexer, RandomForestClassifier, IndexToString

I was checking the list of features and models supported in ConverterUtil Class, first three features/models (Tokenizer, StopWordsRemover, CountVectorizer) are not listed there. I am using them to convert texts into feature vector.

Do you have any plans to support them? Can you please suggest a workaround?
Alternatively, Can you suggest another set of features/models (listed in ConverterUtil Class) which can be used to convert texts into feature vectors?

Thanks,
Bipp.

@vruusmann
Copy link
Member

My pipeline consists of following stages: Tokenizer, StopWordsRemover, CountVectorizer, ..

These transforms implement very simple string splitting/filtering/counting functionality. Even in combination, would be much easier to implement than the originally requested IDF transform.

Do you have any plans to support them?

Would need to implement basic TextIndex element support (string tokenization and stop word filtering) into the JPMML-Evaluator library first.

Can you please suggest a workaround?

Implement the missing pieces yourself, and submit PR(s).

@vruusmann
Copy link
Member

Support for CountVectorizer, Tokenizer, NGram, StopWordsRemover, RegexTokenizer and IDF transformers is available in JPMML-SparkML version 1.1.8 and newer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants