Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202

selbenna · 2020-01-07T17:51:18Z

Hi Villu,

I'm trying to create a pmml file from the sklearn model below. I use TfidfVectorizer on character level and a random forest classifier.
The model predicts the datatype of column based on the name of this column. It works just fine with configuration below but when I create a PMML pipeline I get an error.

Here's my code:

raw_data = pd.read_csv("data_columns.csv")
X = raw_data["name"].tolist() 

labels = raw_data["type"].tolist()
le = LabelEncoder()
labels = le.fit_transform(labels)
labels = to_categorical(labels)

vectorizer = TfidfVectorizer(analyzer = "char_wb", ngram_range=(1, 3), preprocessor = None, 
                                                lowercase = False, tokenizer = Splitter(), token_pattern = None, 
                                                norm = None)

rfc = RandomForestClassifier(n_estimators=500)

pipeline = PMMLPipeline([
  ("transformer", vectorizer),
	("classifier",rfc) ])

pipeline.fit(X, labels)
sklearn2pmml(pipeline, "datatype_prediction.pmml", with_repr = True)

I get this error:

Standard output is empty
Standard error:
janv. 07, 2020 6:41:26 PM org.jpmml.sklearn.Main run
INFOS: Parsing PKL..
janv. 07, 2020 6:43:00 PM org.jpmml.sklearn.Main run
INFOS: Parsed PKL in 93847 ms.
janv. 07, 2020 6:43:00 PM org.jpmml.sklearn.Main run
INFOS: Converting..
janv. 07, 2020 6:43:00 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
janv. 07, 2020 6:43:00 PM org.jpmml.sklearn.Main run
**SEVERE: Failed to convert
java.lang.IllegalArgumentException: The value of 'sklearn.ensemble.forest.RandomForestClassifier.classes_' attribute (Java class java.util.ArrayList) is not a supported array type**
        at org.jpmml.sklearn.PyClassDict.getArray(PyClassDict.java:163)
        at sklearn.Classifier.getClasses(Classifier.java:43)
        at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: The value of 'sklearn.ensemble.forest.RandomForestClassifier.classes_' attribute (Java class java.util.ArrayList) is not a supported array type
        at org.jpmml.sklearn.PyClassDict.getArray(PyClassDict.java:163)
        at sklearn.Classifier.getClasses(Classifier.java:43)
        at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)

Traceback (most recent call last):

  File "<ipython-input-293-05f0766c0610>", line 1, in <module>
    sklearn2pmml(pipeline, "datatype_prediction.pmml", with_repr = True)

  File "/Users/.local/lib/python3.7/site-packages/sklearn2pmml/__init__.py", line 265, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")

**RuntimeError**: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

Could you please tell me what's wrong in the vectorizer? Or how could I use character level analyzer in TfidfVectorizer?
Thank you,

Sarah

The text was updated successfully, but these errors were encountered:

vruusmann · 2020-01-07T19:58:19Z

Let's solve one issue at a time. Right now, the SkLearn2PMML/JPMML-SkLearn stack is complaining an unexpected RandomForestClassifier.classes_ attribute value.

Why are you fitting a LabelEncoder object separately (and then passing its transformation results to the (PMML)Pipeline.fit(X, y) method)? Why don't you simply do the following?

raw_data = pd.read_csv("data_columns.csv")

X = raw_data["name"]
y = raw_data["type"]

pipeline = PMMLPipeline([
  ("transformer", vectorizer),
  ("classifier",rfc)
])
pipeline.fit(X, y)

selbenna · 2020-01-08T09:11:32Z

Thank you for your response!

I tried what you suggested and I get this error now:

Standard output is empty
Standard error:
janv. 08, 2020 9:47:22 AM org.jpmml.sklearn.Main run
INFOS: Parsing PKL..
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
INFOS: Parsed PKL in 53752 ms.
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
INFOS: Converting..
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
**java.lang.IllegalArgumentException: char_wb**
        at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:153)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeDefineFunction(TfidfVectorizer.java:84)
        at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:77)
        at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
        at sklearn.Composite.encodeFeatures(Composite.java:129)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: char_wb
        at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:153)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeDefineFunction(TfidfVectorizer.java:84)
        at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:77)
        at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
        at sklearn.Composite.encodeFeatures(Composite.java:129)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94

What do you think?
Thanks!

vruusmann · 2020-01-08T09:20:00Z

java.lang.IllegalArgumentException: char_wb

That's the exception we've been looking for - it means that the "char_wb" text analyzer type is currently not supported.

Here's what can be done about it:

Would it be possible to replace "char_wb" text analyzer with "char" text analyzer? I presume your text are mostly single-word tokens, so it shouldn't make any difference.
It it's possible, then transform your text input from words to whitespace-separated tokens using the regex string transformer (insert it as the first step into your pipeline). Then, replace the "char" text analyzer with "word" text analyzer, and everything should work.

selbenna · 2020-01-08T09:26:32Z

1 - I already tried with the "char" text analyzer and I get the same error.

2 - I thought about this idea but wasn't sure it was going to work. Thanks for suggesting it, i'm going to try it and will let you know :)

Thanks

vruusmann · 2020-01-08T09:30:06Z

1 - I already tried with the "char" text analyzer and I get the same error.

The only supported text analyzer mode is "word".

Try to transform your text input column so that it could be regarded as a collection of words. The simplest solution would be to use a regex that surrounds every "useful" character with whitespace character.

It should be possible to upgrade the SkLearn2PMML/JPMML-SkLearn stack to support "char" and "char_wb" text analyzers as well. However, this is not a priority for me.

Leaving this issue open, in case my priorities change.

selbenna · 2020-01-08T14:51:21Z

I used a lambda function to split the column names into characters like this:
raw_data["name"] = raw["name"].apply(lambda word: " ".join(word)) x = original_data["name"]

Then I used the pipeline above with the 'word' analyzer and I don't have the error anymore.

But I don't know how to include this preprocessing to the pipeline ...
Can I add this lambda function in the PMMLpipeline using ExpressionTransformer feature?

Could you please help?

Thanks

vruusmann · 2020-01-08T15:11:36Z

Can I add this lambda function in the PMMLpipeline using ExpressionTransformer feature?

Probably not, because your lambda uses Python language constructs/functions that are not supported by ExpressionTransformer yet.

It should be possible using special-purpose string transformers. I'd try to formalize a regular expression (regex) pattern, and apply it to the original (aka raw) text feature using the sklearn2pmml.preprocessing.ReplaceTransformer.

Some regex that inserts whitespace characters into a string.

selbenna · 2020-01-09T09:51:22Z

I found a regex patter to insert whitespace characters:

regex_pattern = "(?<!^)(\B|b)(?!$)"
transformer =  ReplaceTransformer(regex_pattern, " ")

vectorizer = TfidfVectorizer(analyzer = "word", ngram_range=(1, 2), preprocessor = None,lowercase = False,
                             tokenizer = Splitter(), norm = None)

pipeline = PMMLPipeline([
          ("transformer", transformer),
          ("preprocessing", vectorizer),
	      ("classifier",rfc) ])

pipeline.fit(x, y)

but i get this error:

TypeError: cannot use a string pattern on a bytes-like object`

How should I fit the data into this pipeline?

Thanks a lot for your help.

vruusmann · 2020-01-09T10:09:00Z

but i get this error:

That's a 100% Python language stack error (not (J)PMML one). It means that you're mixing/confusing str and bytes data types somewhere.

Perhaps you need to convert a bytes object to a str object by specifying what it the intended character encoding. Also, upgrading from Python 2.7 to Python 3.X might solve the issue.

selbenna · 2020-01-10T10:06:48Z

I found the error, by trying to inject the output of the transformer which is a ndarray object to the TfidfVectorizer I get:

TypeError: cannot use a string pattern on a bytes-like object

So I convert the ndarray to a Series:

transformed_data = transformer(x)

array_to_series = map(lambda x: x[0], transformed_data) ser = pd.Series(array_to_series)

I apply the TfidfVectorizer to ser

X = vectorizer.fit_transform(ser)

and I don't get the error anymore.

So my question is how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline please?

pipeline = PMMLPipeline([ ("transformer", transformer), ("preprocessing", vectorizer), ("classifier",rfc) ])

Should I insert another transformation between te transformer and the vectorizer?

Thank you!

vruusmann · 2020-01-10T20:34:10Z

Are you using the latest SkLearn2PMML package version? The ReplaceTransformer transformer should be returning a single-column 2-D Numpy array currently (the _col2d(X) utility method):
https://github.com/jpmml/sklearn2pmml/blob/0.52.1/sklearn2pmml/preprocessing/__init__.py#L295

I wonder why the TfidfVectorizer step doesn't like it. Perhaps the ReplaceTransformer transformer should use some other return type/configuration.

how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline

Have you tried wrapping the whole feature engineering into sklearn_pandas.DataFrameMapper or sklearn.compose.ColumnTransformer? These meta-transformers are pretty good at reshaping data between steps.

Something like this:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    (["name"], [ReplaceTransformer(..), TfidfVectorizer(..)])
  ])),
  ("classifier", RandomForestClassifier())
])

selbenna · 2020-01-13T12:47:51Z

Yes I'm using the latest Sklearn2PMML. the ReplaceTransformer returns a 2-D numpy array and that's the problem, the TfidVectorizer doesn't like it.

I tried:

`pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["name"], [ReplaceTransformer("(?<!^)(\B)(?!$)", " "), vectorizer])
])),
("classifier", rfc)
])

pipeline.fit(raw_data,y)`

But I get this error:

TypeError: ['name']: cannot use a string pattern on a bytes-like object

I tried also with ColumnTransformer and I get the same error.

selbenna · 2020-01-13T13:06:36Z

Is there a way I could convert the output of _col2d(x) using this function inside the pipeline?

array_to_series = map(lambda x: x[0], transform(x))
ser = pd.Series(array_to_series)

This ser variable would then be fed to the vectorizer:

vectorizer.fit_transform(ser)

Thank you!

vruusmann mentioned this issue May 5, 2023

Failure to create a pmml file when using CountVectorizer with analyzer = 'char' #380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202

Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202

selbenna commented Jan 7, 2020 •

edited by vruusmann

Loading

vruusmann commented Jan 7, 2020

selbenna commented Jan 8, 2020 •

edited by vruusmann

Loading

vruusmann commented Jan 8, 2020

selbenna commented Jan 8, 2020

vruusmann commented Jan 8, 2020

selbenna commented Jan 8, 2020

vruusmann commented Jan 8, 2020

selbenna commented Jan 9, 2020 •

edited by vruusmann

Loading

vruusmann commented Jan 9, 2020

selbenna commented Jan 10, 2020

vruusmann commented Jan 10, 2020 •

edited

Loading

selbenna commented Jan 13, 2020

selbenna commented Jan 13, 2020

Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202

Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202

Comments

selbenna commented Jan 7, 2020 • edited by vruusmann Loading

vruusmann commented Jan 7, 2020

selbenna commented Jan 8, 2020 • edited by vruusmann Loading

vruusmann commented Jan 8, 2020

selbenna commented Jan 8, 2020

vruusmann commented Jan 8, 2020

selbenna commented Jan 8, 2020

vruusmann commented Jan 8, 2020

selbenna commented Jan 9, 2020 • edited by vruusmann Loading

vruusmann commented Jan 9, 2020

selbenna commented Jan 10, 2020

vruusmann commented Jan 10, 2020 • edited Loading

selbenna commented Jan 13, 2020

selbenna commented Jan 13, 2020

selbenna commented Jan 7, 2020 •

edited by vruusmann

Loading

selbenna commented Jan 8, 2020 •

edited by vruusmann

Loading

selbenna commented Jan 9, 2020 •

edited by vruusmann

Loading

vruusmann commented Jan 10, 2020 •

edited

Loading