Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436

fritshermans · 2024-12-03T10:09:11Z

When I train a sklearn pipeline containing a TargetEncoder and convert it using sklearn2pmml to a PMML file, I get an error when a categorical value that was not seen during training is present in new data. The desired behavior is that the default value is returned. When I would create the pipeline using the PMMLPipeline object and define the categorical variable using CategoricalDomain with invalid_value_treatment = "as_is", it works well on unseen categorical data.

Is there a way to avoid this problem when I want to convert an existing trained sklearn pipeline?

The text was updated successfully, but these errors were encountered:

vruusmann · 2024-12-03T10:34:07Z

When I train a sklearn pipeline containing a TargetEncoder ...

Are you talking about category_encoders.target_encoder.TargetEncoder or sklearn.preprocessing.TargetEncoder here?

fritshermans · 2024-12-03T10:38:56Z

The sklearn version :-)

vruusmann · 2024-12-03T11:02:37Z

Is there a way to avoid this problem when I want to convert an existing trained sklearn pipeline?

I read the initial comment wrong, because I got the impression that the TargetEncoder converter is performing a bad job. However, this cannot be the case, because the MapValues@defaultValue attribute is correctly set to the mean value here (this is the value that gets returned when the MapValues table does not contain a mapping for the input value):
https://github.com/jpmml/jpmml-sklearn/blob/1.8.6/pmml-sklearn/src/main/java/sklearn/preprocessing/TargetEncoder.java#L78-L80

So, the question is really about "retrofitting" an existing SkLearn pipeline, to make it "invalid value aware" long time after it was trained and saved?

The trouble is that Scikit-Learn is missing a consistent support for invalid values in the first place. It has been added sporadically, to different estimator classes at different times.

There are really two options here:

Modify the SkLearn pipeline, by prepending a meta-transformer to it that filters all problematic columns through appropriate ContnuousDomain, CategoricalDomain or OrdinalDomain decorators. These decorators allow you to set the desired nvalid value treatment using the Domain.invalid_value_treatment attribute (you already got this).
Modify the resulting PMML document, by visitng all MiningField elements, and appending a MiningField@invalidValueTreatment="as_is" attribute to them . Please note that the default value for this attribute is returnInvalid (also takes effect when the attribute is not defined): https://dmg.org/pmml/v4-4-1/MiningSchema.html#xsdType_INVALID-VALUE-TREATMENT-METHOD

Please indicate which pathway (of the above two) are you likely to consider, so that we can keep brainstorming in the right direction.

fritshermans · 2024-12-03T11:06:43Z

I fixed it for now by the second option. You could consider making all categoricals with invalidValueTreatment="as_is" but I can understand you wouldn't like that...

vruusmann · 2024-12-03T11:07:41Z

There are really two options here:

My bad - there is a third option, which may qualify as a SkLearn2PMML/JPMML-SkLearn bug.

Any time when the converter sets a <Expression>@defaultValue attribute, it should perform an internal sanity check that the input field has "invalid values enabled".

Right now, the MapValue@defaultValue attribute is set, but invalid values are actually prevented from reaching it because there is a blocking MiningField@invalidValueTreatment="returnInvalid" declaration in the way.

vruusmann · 2024-12-03T11:10:31Z

This issue reminds me of another issue: #428

vruusmann · 2024-12-03T11:15:49Z

Any time when the converter sets a @DefaultValue attribute, it should perform an internal sanity check that the input field has "invalid values enabled".

The converter currently knows whether the input column had an explicit Domain decorator assigned to it or not.

If the decorator was set, then its stated invalid value treatment should prevail. However, when it was not set, then a flexible default should be applied.

Can you point me to official documentation about Scikit-Learn's invalid value (aka unknown value) handling policy? I assume that they were not allowed in the past (eg. SkLearn 0.X versions), but have been gradually enabled in recent versions (esp. 1.3.X and newer). The "flexible default" should try to match this evolution.

fritshermans · 2024-12-03T11:19:07Z

I'm not sure where to find that. I think the check of invalid values is done at the transformer or estimator level. E.g. the sklearn OneHotEncoder has the option handle_unknown='error'. So if an unseen value is presented to the trained OneHotEncoder it will through an error. In a pipeline this is the place where the value is checked.

vruusmann · 2024-12-03T11:31:07Z

Thinking out loud for my future self.

The MiningSchema element (together with all MiningField elements) is generated automatically based on the model body (that's why it often appears "pruned" - if a model does not need some input, it is not listed in model's input schema).

The right place for detecting the correct MiningField@invalidValueTreatment (and possibly MiningField@missingValueTreatment) attribute values would be around the same time/place.

Manual detection by each transformer converter seems too complex and fragile in comparison. Also, the automated detection component should land in the JPMML-Converter library, and would be easily reusable in other PMML production libraries such as JPMML-R, JPMML-SparkML, etc. as well.

vruusmann · 2024-12-03T11:34:08Z

@fritshermans Thanks for raising the issue! However, a proper fix to it looks like a major change in another library, which may take unspecified amount of time (ie. can't fix it quickly at SkLearn2PMML package level). You keep running your manual PMML post-processing workflow in the meantime.

fritshermans · 2024-12-03T11:38:11Z

thanks a lot for your quick response! i'm creating a small regex-replace to fix the pmml :-)

vruusmann · 2024-12-03T11:56:30Z

i'm creating a small regex-replace to fix the pmml

That should do the job.

But since we're dealing with XML documents, you may also consider using XSL Transformations (XSLT), applied using a small Java or Python application.

vruusmann changed the title ~~Converted sklearn pipeline with TargetEncoder does not work for unseen categorical values~~ Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline Dec 3, 2024

vruusmann mentioned this issue Dec 21, 2024

Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer) #438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436

Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024

vruusmann commented Dec 3, 2024

vruusmann commented Dec 3, 2024 •

edited

Loading

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024 •

edited

Loading

vruusmann commented Dec 3, 2024

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024 •

edited

Loading

Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436

Detect invalid value treatment policy based on the "transformer composition" of SkLearn pipeline #436

Comments

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024

vruusmann commented Dec 3, 2024

vruusmann commented Dec 3, 2024 • edited Loading

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024 • edited Loading

vruusmann commented Dec 3, 2024

fritshermans commented Dec 3, 2024

vruusmann commented Dec 3, 2024 • edited Loading

vruusmann commented Dec 3, 2024 •

edited

Loading

vruusmann commented Dec 3, 2024 •

edited

Loading

vruusmann commented Dec 3, 2024 •

edited

Loading