Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in Accuracies between scikit-learn and pmml models #68

Closed
nejatb opened this issue Nov 22, 2017 · 8 comments
Closed

Difference in Accuracies between scikit-learn and pmml models #68

nejatb opened this issue Nov 22, 2017 · 8 comments

Comments

@nejatb
Copy link

nejatb commented Nov 22, 2017

Hi, this is a follow up to https://github.com/jpmml/sklearn2pmml/issues/65.

As described, I have two models (both Linear SVMs with 5000 features) one created using scikit learn's pipeline and one created using sklearn2pmml pipeline. Both models are multi-class models (# classes is around 13) and are trained on the same set of data with the same set of parameters. Evaluating them on the same set of data has resulted in different accuracies. The difference in precision between the two models is 24% for one class and 5% for some other class. Recall is similar. Is this behavior expected?

@vruusmann
Copy link
Member

I have two models (both Linear SVMs with 5000 features) one created using scikit learn's pipeline and one created using sklearn2pmml pipeline.

What do you mean - replacing class sklearn.pipeline.Pipeline with sklearn2pmml.PMMLPipeline in your script is causing different predictions?

@nejatb
Copy link
Author

nejatb commented Nov 22, 2017

Yes. That, and also replacing LinearSVC in scikit learn with SVC with Linear Kernel for pmml pipeline.

pipeline = Pipeline([

('union', FeatureUnion(
    transformer_list=[
        ('InputField1', Pipeline([
            ('selector', ItemSelector(key='InputField1')),
            ('vect', TfidfVectorizer(stop_words='english')),
            ])),

        ('InputField2', Pipeline([
            ('selector', ItemSelector(key='InputField2')),
            ('vect', TfidfVectorizer(stop_words='English'),
            ]))
        ],
('select', SelectKBest(chi2, k=5000)),
('clf', LinearSVC(dual=False, tol=1e-3)),
])

and

pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
            ("InputField1", TfidfVectorizer(stop_words = 'english',
                                        norm = None,
                                        tokenizer = Splitter(),
                                        ngram_range = (1,1),
                                        max_df = 0.9,
                                        min_df = 2
                                        )),
            ("InputField2",TfidfVectorizer(stop_words = 'english',
                                        norm = None,
                                        tokenizer = Splitter(),
                                        ngram_range = (1,1),
                                        max_df = 0.7,
                                        min_df = 5
                                    ))
            ],df_out=True) ),

("selector", SelectKBest(chi2,k=5000)),
("classifier", SVC(kernel='linear', tol=1e-3)),
    ]) 

Could the fact that for scikit learn, the tfidf's tokenizer is the default None value vs the pmml pipeline's tfidf tokenizer being the splitter causing such major change?

@vruusmann
Copy link
Member

Yes.

Impossible!

If you look at the source code of sklearn2pmml.PMMLPipeline, then it does not contain any "business logic" in itself:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/__init__.py#L23-L49

It overrides the fit(X, y) method, but the purpose of doing so is just to grab column names, and then the control is passed on to the fit(X, y) method of the regular pipeline class.

@nejatb
Copy link
Author

nejatb commented Nov 22, 2017

Thank you for your clarification. Can I ask you about what the class Splitter.java does?
Does it split based on words? And is it different from the default None value of the tifidfTokenizer?

@vruusmann
Copy link
Member

Can I ask you about what the class Splitter.java does?

You mean the following Java class?
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/feature_extraction/text/Splitter.java

It's the Java counterpart for the following Python class:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/feature_extraction/text/__init__.py#L16-L33

As you can see from its fit(X, y) method, the default behaviour is to tokenize text by whitespace character(s).

@nejatb
Copy link
Author

nejatb commented Dec 1, 2017

Villu, thank you for your comments. I think the pmmlPipeline does not support a tfidfTokenizer with l2 norm . Is that right? If so, potentially the difference between the accuracies of the scikit Learn pipeline model and PMMLPipeline model could be a result of not normalizing the tfidf vectors?

Is there a way that I could have a PmmlPipeline with normalized tfidfs?
Thank you

@vruusmann
Copy link
Member

I think the pmmlPipeline does not support a tfidfTokenizer with l2 norm. Is that right?

If the norm attribute is anything else than null (aka None), then the converter should throw the following IllegalArgumentException (sorry, no helpful exception message there):
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/feature_extraction/text/TfidfVectorizer.java#L71-L74

Is there a way that I could have a PmmlPipeline with normalized tfidfs?

It is rather difficult to make normalization work, because in order to compute the denominator, one would need to compute term frequencies for all the terms in your vocabulary. However, the PMML representation only keeps "active" terms (ie. terms that are actually needed by the model), and omits "inactive" terms.

For example, suppose your starting vocabulary contains 10'000 terms, but the random forest model "uses" only 1'000 of them. To make normalization work, then it would be necessary to compute the frequencies of the unused 9'000 terms as well - seems like a waste of resources, if they are not needed for anything else at later stages.

Scikit-Learn does not distinguish between "active" and "inactive" terms so clearly. If there's a TfIdfVectorizer transformation in the pipeline, then the frequencies of all 10'000 features would be computed (and it will be straightforward to sum their values to get the denominator value).

@vruusmann
Copy link
Member

It follows from the previous comment that you can improve the relative performance of (J)PMML by performing "term selection" in your pipeline.

For example, try replacing SVC (uses all terms) with RandomForestClassifier (uses a subset of most significant terms), and see how the performance numbers change.

Well, it also follows that the converter should enable normalization when the last step of your pipeline is a "greedy" classifier. For example, in case of SVC, the frequencies of all terms in the vocabulary will be computed anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants