-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in Accuracies between scikit-learn and pmml models #68
Comments
What do you mean - replacing class |
Yes. That, and also replacing LinearSVC in scikit learn with SVC with Linear Kernel for pmml pipeline.
and
Could the fact that for scikit learn, the tfidf's tokenizer is the default None value vs the pmml pipeline's tfidf tokenizer being the splitter causing such major change? |
Impossible! If you look at the source code of It overrides the |
Thank you for your clarification. Can I ask you about what the class Splitter.java does? |
You mean the following Java class? It's the Java counterpart for the following Python class: As you can see from its |
Villu, thank you for your comments. I think the pmmlPipeline does not support a tfidfTokenizer with l2 norm . Is that right? If so, potentially the difference between the accuracies of the scikit Learn pipeline model and PMMLPipeline model could be a result of not normalizing the tfidf vectors? Is there a way that I could have a PmmlPipeline with normalized tfidfs? |
If the
It is rather difficult to make normalization work, because in order to compute the denominator, one would need to compute term frequencies for all the terms in your vocabulary. However, the PMML representation only keeps "active" terms (ie. terms that are actually needed by the model), and omits "inactive" terms. For example, suppose your starting vocabulary contains 10'000 terms, but the random forest model "uses" only 1'000 of them. To make normalization work, then it would be necessary to compute the frequencies of the unused 9'000 terms as well - seems like a waste of resources, if they are not needed for anything else at later stages. Scikit-Learn does not distinguish between "active" and "inactive" terms so clearly. If there's a |
It follows from the previous comment that you can improve the relative performance of (J)PMML by performing "term selection" in your pipeline. For example, try replacing Well, it also follows that the converter should enable normalization when the last step of your pipeline is a "greedy" classifier. For example, in case of |
Hi, this is a follow up to https://github.com/jpmml/sklearn2pmml/issues/65.
As described, I have two models (both Linear SVMs with 5000 features) one created using scikit learn's pipeline and one created using sklearn2pmml pipeline. Both models are multi-class models (# classes is around 13) and are trained on the same set of data with the same set of parameters. Evaluating them on the same set of data has resulted in different accuracies. The difference in precision between the two models is 24% for one class and 5% for some other class. Recall is similar. Is this behavior expected?
The text was updated successfully, but these errors were encountered: