-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202
Comments
Let's solve one issue at a time. Right now, the SkLearn2PMML/JPMML-SkLearn stack is complaining an unexpected Why are you fitting a raw_data = pd.read_csv("data_columns.csv")
X = raw_data["name"]
y = raw_data["type"]
pipeline = PMMLPipeline([
("transformer", vectorizer),
("classifier",rfc)
])
pipeline.fit(X, y) |
Thank you for your response! I tried what you suggested and I get this error now:
What do you think? |
That's the exception we've been looking for - it means that the "char_wb" text analyzer type is currently not supported. Here's what can be done about it:
|
1 - I already tried with the "char" text analyzer and I get the same error. 2 - I thought about this idea but wasn't sure it was going to work. Thanks for suggesting it, i'm going to try it and will let you know :) Thanks |
The only supported text analyzer mode is "word". Try to transform your text input column so that it could be regarded as a collection of words. The simplest solution would be to use a regex that surrounds every "useful" character with whitespace character. It should be possible to upgrade the SkLearn2PMML/JPMML-SkLearn stack to support "char" and "char_wb" text analyzers as well. However, this is not a priority for me. Leaving this issue open, in case my priorities change. |
I used a lambda function to split the column names into characters like this: Then I used the pipeline above with the 'word' analyzer and I don't have the error anymore. But I don't know how to include this preprocessing to the pipeline ... Could you please help? Thanks |
Probably not, because your lambda uses Python language constructs/functions that are not supported by It should be possible using special-purpose string transformers. I'd try to formalize a regular expression (regex) pattern, and apply it to the original (aka raw) text feature using the Some regex that inserts whitespace characters into a string. |
I found a regex patter to insert whitespace characters: regex_pattern = "(?<!^)(\B|b)(?!$)"
transformer = ReplaceTransformer(regex_pattern, " ")
vectorizer = TfidfVectorizer(analyzer = "word", ngram_range=(1, 2), preprocessor = None,lowercase = False,
tokenizer = Splitter(), norm = None)
pipeline = PMMLPipeline([
("transformer", transformer),
("preprocessing", vectorizer),
("classifier",rfc) ])
pipeline.fit(x, y) but i get this error:
How should I fit the data into this pipeline? Thanks a lot for your help. |
That's a 100% Python language stack error (not (J)PMML one). It means that you're mixing/confusing Perhaps you need to convert a |
I found the error, by trying to inject the output of the transformer which is a ndarray object to the TfidfVectorizer I get:
So I convert the ndarray to a Series:
I apply the TfidfVectorizer to ser
and I don't get the error anymore. So my question is how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline please?
Should I insert another transformation between te transformer and the vectorizer? Thank you! |
Are you using the latest SkLearn2PMML package version? The I wonder why the
Have you tried wrapping the whole feature engineering into Something like this: pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["name"], [ReplaceTransformer(..), TfidfVectorizer(..)])
])),
("classifier", RandomForestClassifier())
]) |
Yes I'm using the latest Sklearn2PMML. the I tried: `pipeline = PMMLPipeline([ pipeline.fit(raw_data,y)` But I get this error:
I tried also with ColumnTransformer and I get the same error. |
Is there a way I could convert the output of _col2d(x) using this function inside the pipeline?
This ser variable would then be fed to the vectorizer:
Thank you! |
Hi Villu,
I'm trying to create a pmml file from the sklearn model below. I use TfidfVectorizer on character level and a random forest classifier.
The model predicts the datatype of column based on the name of this column. It works just fine with configuration below but when I create a PMML pipeline I get an error.
Here's my code:
I get this error:
Could you please tell me what's wrong in the vectorizer? Or how could I use character level analyzer in TfidfVectorizer?
Thank you,
Sarah
The text was updated successfully, but these errors were encountered: