-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436
Comments
Are you talking about |
The sklearn version :-) |
I read the initial comment wrong, because I got the impression that the So, the question is really about "retrofitting" an existing SkLearn pipeline, to make it "invalid value aware" long time after it was trained and saved? The trouble is that Scikit-Learn is missing a consistent support for invalid values in the first place. It has been added sporadically, to different estimator classes at different times. There are really two options here:
Please indicate which pathway (of the above two) are you likely to consider, so that we can keep brainstorming in the right direction. |
I fixed it for now by the second option. You could consider making all categoricals with |
My bad - there is a third option, which may qualify as a SkLearn2PMML/JPMML-SkLearn bug. Any time when the converter sets a Right now, the |
This issue reminds me of another issue: #428 |
The converter currently knows whether the input column had an explicit If the decorator was set, then its stated invalid value treatment should prevail. However, when it was not set, then a flexible default should be applied. Can you point me to official documentation about Scikit-Learn's invalid value (aka unknown value) handling policy? I assume that they were not allowed in the past (eg. SkLearn 0.X versions), but have been gradually enabled in recent versions (esp. 1.3.X and newer). The "flexible default" should try to match this evolution. |
I'm not sure where to find that. I think the check of invalid values is done at the transformer or estimator level. E.g. the sklearn OneHotEncoder has the option |
Thinking out loud for my future self. The The right place for detecting the correct Manual detection by each transformer converter seems too complex and fragile in comparison. Also, the automated detection component should land in the JPMML-Converter library, and would be easily reusable in other PMML production libraries such as JPMML-R, JPMML-SparkML, etc. as well. |
@fritshermans Thanks for raising the issue! However, a proper fix to it looks like a major change in another library, which may take unspecified amount of time (ie. can't fix it quickly at SkLearn2PMML package level). You keep running your manual PMML post-processing workflow in the meantime. |
thanks a lot for your quick response! i'm creating a small regex-replace to fix the pmml :-) |
That should do the job. But since we're dealing with XML documents, you may also consider using XSL Transformations (XSLT), applied using a small Java or Python application. |
When I train a sklearn pipeline containing a TargetEncoder and convert it using
sklearn2pmml
to a PMML file, I get an error when a categorical value that was not seen during training is present in new data. The desired behavior is that the default value is returned. When I would create the pipeline using thePMMLPipeline
object and define the categorical variable usingCategoricalDomain
withinvalid_value_treatment = "as_is"
, it works well on unseen categorical data.Is there a way to avoid this problem when I want to convert an existing trained sklearn pipeline?
The text was updated successfully, but these errors were encountered: