-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SelectKBest leading to Logistic Regression probability discrepancies between scikit-learn and jpmml-evaluator #93
Comments
My setup for reference:
java -cp /root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/guava-20.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-3.0.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.3.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.3.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.6.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.1.3.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.4.2.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.2.4.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-buwtw3w9.pkl.z --pmml-output pipeline.pmml |
Very interesting observation. What happens if you replace the "direct use" of Please try rearranging your code like this, and report back! from sklearn2pmml import SelectorProxy
pipeline = PMMLPipeline([
("vectorizer", vectorizer),
("feature_selector", SelectorProxy(feature_selector)), # THIS!
("classifier", classifier)
])
|
Using SelectorProxy(feature_selector), the results align perfectly between sklearn and jpmml-sklearn:
If that's the suggested workaround, we'll go with it. Thanks! |
Thanks for reporting back such great news! The results between Scikit-Learn and (J)PMML should actually align up to 14th or 15th decimal place (you're only checking the first seven decimal places). In the future, if you continue your research and happen to find a discrepany in the area of 12th or 13th decimal place, then you should let me know about it again.
Apparently, the JPMML-SkLearn library handles the There are several other bug reports about Scikit-Learn and (J)PMML prediction mismatches, and all these pipelines appear to contain the |
Yup, SelectKBest could also be causing those discrepancies in #68 and #69. I'll take a look at SelectKBest.java to see if I can track it down, but in the meanwhile will close this ticket given the workaround with SelectorProxy(). Thanks! |
Similar to #82, I noticed a sizable inconsistency when I incorporated SelectKBest feature selection with the Logistic Regression classifier and altered the number of features k.
I'm using the following sklearn snippet:
and jpmml snippet:
For the sentence above, the class 1 predictions for different values of k are:
When I run the pipeline without feature selection, the results match perfectly. I ran this across multiple datasets and got the same strange behavior. Below is a 100 line training file (extracted from the UofMich Sentiment Analysis Challenge corpus on Kaggle) that I used to generate above results:
and read with this python snippet:
The text was updated successfully, but these errors were encountered: