-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The MiningField@invalidValueTreatment
attribute gets silently overriden by OneHotEncoder
transformer
#428
Comments
First of all, if you want to train an XGBoost model using categorical features, then you shouldn't be using an explicit one-hot encoding transformation (such as
Can confirm this observation. When running your example code (thanks for taking the time to provide it, including the imports section), then I get the following PMML markup: <MiningSchema>
<MiningField name="Y" usageType="target"/>
<MiningField name="X2" missingValueReplacement="OTHER" missingValueTreatment="asIs"/>
<MiningField name="X1" missingValueReplacement="-1" missingValueTreatment="asIs" invalidValueTreatment="asMissing"/>
</MiningSchema> Indeed, the This attribute is set two times in the pipeline. First, the TLDR: It's a bug - there are two instances of The |
MiningField@invalidValueTreatment
attribute gets silently overriden by OneHotEncoder
transformer
I've had this "new decorator overriding the old decorator" issue in my private TODO list for a long time. Perhaps it's time to work on it, now that someone is demonstrably suffering from it. I think the solution would be to assume that an explicit decoration (here: However, the converter should log a warning message every time when "ignoring" some decoration. |
Also, a generic note about PMML input value preparation algorithm. The input value can belong to one of three value spaces: valid, invalid (ie. non-missing, but not valid) or missing. The invalid value treatment is applied first. The missing value treatment is applied after that. If the model supports missing values, then at the end of input value preparation there should be only valid or missing values present. If the model does not support missing values, only valid values should be present. I can see the following logical error in your Python pipeline that you're replacing missing values with invalid values. For example, the domain of the "X1" field is It means that you're intentionally inputting an invalid value (ie. a value that was not present in the training dataset) to your model. What good can it be/do? Granted, decision tree models (such as XGBoost) are quite lenient towards invalid values. The evaluation path simply follows the "default way" (instead of erroring out). This makes me think that perhaps SkLearn2PMML decorator classes should also check that the provided invalid and missing value replacement values are actually valid values. |
@vruusmann Would appreciate a fix, it will save us from having an extra step in the pipeline of inserting the tags into the PMML file.
You're right about the example, this is a very simplified version of the data for the sake of example, but I should perhaps have taken the time to make it a bit more realistic or complete.
This could be good. I ran into a java error (below) when calling sklearn2pmml() with a pipeline that assigned a value to missing categorical data that was not present in the original data set (this also happens when changing an X2 value to missing in the example I gave). I know this doesn't make sense to do, but it happened when testing the pipeline with a subset of data during development, which by chance didn't contain any instances of the category that missing values were assigned . Having a check and giving an error or warning on the python side would probably be more clear and easier to troubleshoot.
|
The fix about "decorator overrides" would go into the base JPMML-Converter library. It'll take some time to propagate it up to the JPMML-SkLearn library level. Is the use case about
I also have something similar noted in my private TODO list. The situation can likely fixed by tweaking Please note that Scikit-Learn calls invalid values as "unknown values". Alternatively, you may try replacing But it would be even better if you got rid of one-hot encoding on your XGBoost pipelines in the first place. |
Another idea: the JPMML-SkLearn library should raise an error when it encounters a People keep following 5+ year old tutorials, completely missing out the new and correct way of doing things. |
The use of But thanks for pointing out that skipping |
Hello
I'm trying to create a PMML file that includes a specification for how to handle invalid values for both numerical and categorical features. I'm getting the PMML output I expect for the numerical features, but the categorical features don't have any tags mentioning invalidValueTreatment in the schema.
I've tried two things: handling invalid values as missing and handling invalid values by assigning them a specific value (both in code below), but am getting the same result.
Reproducible example:
Evaluation of TEST_INPUT_2 using jpmml_evaluator results in an error
without the invalidValueTreatment tags, but if I manually add invalidValueTreatment="asMissing" for X2 in the PMML file, then evaluation works as expected.
Is there any way to get the PMML output to contain info on how to handle invalid values, or have I missed something about the intended behavior here? Thanks
The text was updated successfully, but these errors were encountered: