What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6

n-merritt · 2016-10-26T21:48:32Z

Hi, first thanks for this excellent project (and also pmml-evaluator)! I have a Spark ML pipeline which uses Tokenizer, HashingTF, and IDF in order to feed a column containing text to a multiclass classifier which predicts a category. How feasible / hard would it be to support such a pipeline in jpmml-sparkml? I was thinking about taking a shot at it. Should Tokenizer get converted to an org.dmg.pmml.DocumentTermMatrix, or something else? And what about HashingTF and IDF? What pmml objects should those be converted to? Thanks in advance

vruusmann · 2016-10-26T22:31:57Z

This RFE is closely related to jpmml/jpmml-sklearn#4 - Scikit-Learn and Apache Spark ML use the same thinking in the area of text feature engineering.

All those transformations should map to the TextIndex element in one way or another.

Please ignore the DocumentTermMatrix element, and all other elements that are defined in the scope of the TextModel element. This part of the PMML specification has been deprecated, so it would be pointless to use it as foundation for future work.

The biggest obstacle in the way of moving on with NLP functionality is that the JPMML-Evaluator library doesn't support the TextIndex element yet. So, you should point your initial efforts into that direction.

n-merritt · 2016-10-28T17:15:32Z

Thanks Villu. I'll look into that as soon as I can figure out how to actually harvest the terms, term frequencies, and inverse document frequencies that are relevant to the final model from the pipeline. I am a Spark noob and have yet to figure out how to do this using the Java ML api without making an extra pass through the Dataset and computing it.

n-merritt · 2016-10-28T17:23:06Z

Perhaps a custom transformer added to the pipeline will do the trick

vruusmann · 2016-10-28T17:44:53Z

For example, if you have a ml.feature.IDF instance in the pipeline, and then fit the pipeline, then this ml.feature.IDF instance becomes a ml.feature.IDFModel instance. You can then call IDFModel#idf() to obtain its IDF vector.

The same principle applies to other Apache Spark ML transformers as well - upon fitting a pipeline, an Estimator instance becomes a Model instance, which then provides all the information about the training dataset.

n-merritt · 2016-10-28T18:04:31Z

Excellent, thanks

bipptech · 2017-01-10T02:56:49Z

My pipeline consists of following stages:
Tokenizer, StopWordsRemover, CountVectorizer, StringIndexer, RandomForestClassifier, IndexToString

I was checking the list of features and models supported in ConverterUtil Class, first three features/models (Tokenizer, StopWordsRemover, CountVectorizer) are not listed there. I am using them to convert texts into feature vector.

Do you have any plans to support them? Can you please suggest a workaround?
Alternatively, Can you suggest another set of features/models (listed in ConverterUtil Class) which can be used to convert texts into feature vectors?

Thanks,
Bipp.

vruusmann · 2017-01-10T07:25:04Z

My pipeline consists of following stages: Tokenizer, StopWordsRemover, CountVectorizer, ..

These transforms implement very simple string splitting/filtering/counting functionality. Even in combination, would be much easier to implement than the originally requested IDF transform.

Do you have any plans to support them?

Would need to implement basic TextIndex element support (string tokenization and stop word filtering) into the JPMML-Evaluator library first.

Can you please suggest a workaround?

Implement the missing pieces yourself, and submit PR(s).

vruusmann · 2017-05-13T22:21:50Z

Support for CountVectorizer, Tokenizer, NGram, StopWordsRemover, RegexTokenizer and IDF transformers is available in JPMML-SparkML version 1.1.8 and newer.

vruusmann mentioned this issue May 8, 2017

Add support for multinomial LogisticRegression models #16

Closed

vruusmann closed this as completed May 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6

What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6

n-merritt commented Oct 26, 2016

vruusmann commented Oct 26, 2016 •

edited

Loading

n-merritt commented Oct 28, 2016

n-merritt commented Oct 28, 2016

vruusmann commented Oct 28, 2016

n-merritt commented Oct 28, 2016

bipptech commented Jan 10, 2017

vruusmann commented Jan 10, 2017

vruusmann commented May 13, 2017

What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6

What would be involved in supporting Tokenizer, IDF, and HashingTF features? #6

Comments

n-merritt commented Oct 26, 2016

vruusmann commented Oct 26, 2016 • edited Loading

n-merritt commented Oct 28, 2016

n-merritt commented Oct 28, 2016

vruusmann commented Oct 28, 2016

n-merritt commented Oct 28, 2016

bipptech commented Jan 10, 2017

vruusmann commented Jan 10, 2017

vruusmann commented May 13, 2017

vruusmann commented Oct 26, 2016 •

edited

Loading