Difference in Accuracies between scikit-learn and pmml models #68

nejatb · 2017-11-22T00:35:04Z

Hi, this is a follow up to https://github.com/jpmml/sklearn2pmml/issues/65.

As described, I have two models (both Linear SVMs with 5000 features) one created using scikit learn's pipeline and one created using sklearn2pmml pipeline. Both models are multi-class models (# classes is around 13) and are trained on the same set of data with the same set of parameters. Evaluating them on the same set of data has resulted in different accuracies. The difference in precision between the two models is 24% for one class and 5% for some other class. Recall is similar. Is this behavior expected?

vruusmann · 2017-11-22T02:40:50Z

I have two models (both Linear SVMs with 5000 features) one created using scikit learn's pipeline and one created using sklearn2pmml pipeline.

What do you mean - replacing class sklearn.pipeline.Pipeline with sklearn2pmml.PMMLPipeline in your script is causing different predictions?

nejatb · 2017-11-22T02:46:08Z

Yes. That, and also replacing LinearSVC in scikit learn with SVC with Linear Kernel for pmml pipeline.

pipeline = Pipeline([

('union', FeatureUnion(
    transformer_list=[
        ('InputField1', Pipeline([
            ('selector', ItemSelector(key='InputField1')),
            ('vect', TfidfVectorizer(stop_words='english')),
            ])),

        ('InputField2', Pipeline([
            ('selector', ItemSelector(key='InputField2')),
            ('vect', TfidfVectorizer(stop_words='English'),
            ]))
        ],
('select', SelectKBest(chi2, k=5000)),
('clf', LinearSVC(dual=False, tol=1e-3)),
])

and

pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
            ("InputField1", TfidfVectorizer(stop_words = 'english',
                                        norm = None,
                                        tokenizer = Splitter(),
                                        ngram_range = (1,1),
                                        max_df = 0.9,
                                        min_df = 2
                                        )),
            ("InputField2",TfidfVectorizer(stop_words = 'english',
                                        norm = None,
                                        tokenizer = Splitter(),
                                        ngram_range = (1,1),
                                        max_df = 0.7,
                                        min_df = 5
                                    ))
            ],df_out=True) ),

("selector", SelectKBest(chi2,k=5000)),
("classifier", SVC(kernel='linear', tol=1e-3)),
    ])

Could the fact that for scikit learn, the tfidf's tokenizer is the default None value vs the pmml pipeline's tfidf tokenizer being the splitter causing such major change?

vruusmann · 2017-11-22T02:54:29Z

Yes.

Impossible!

If you look at the source code of sklearn2pmml.PMMLPipeline, then it does not contain any "business logic" in itself:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/__init__.py#L23-L49

It overrides the fit(X, y) method, but the purpose of doing so is just to grab column names, and then the control is passed on to the fit(X, y) method of the regular pipeline class.

nejatb · 2017-11-22T17:13:34Z

Thank you for your clarification. Can I ask you about what the class Splitter.java does?
Does it split based on words? And is it different from the default None value of the tifidfTokenizer?

vruusmann · 2017-11-23T07:04:03Z

Can I ask you about what the class Splitter.java does?

You mean the following Java class?
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/feature_extraction/text/Splitter.java

It's the Java counterpart for the following Python class:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/feature_extraction/text/__init__.py#L16-L33

As you can see from its fit(X, y) method, the default behaviour is to tokenize text by whitespace character(s).

nejatb · 2017-12-01T21:34:53Z

Villu, thank you for your comments. I think the pmmlPipeline does not support a tfidfTokenizer with l2 norm . Is that right? If so, potentially the difference between the accuracies of the scikit Learn pipeline model and PMMLPipeline model could be a result of not normalizing the tfidf vectors?

Is there a way that I could have a PmmlPipeline with normalized tfidfs?
Thank you

vruusmann · 2017-12-02T02:02:50Z

I think the pmmlPipeline does not support a tfidfTokenizer with l2 norm. Is that right?

If the norm attribute is anything else than null (aka None), then the converter should throw the following IllegalArgumentException (sorry, no helpful exception message there):
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/feature_extraction/text/TfidfVectorizer.java#L71-L74

Is there a way that I could have a PmmlPipeline with normalized tfidfs?

It is rather difficult to make normalization work, because in order to compute the denominator, one would need to compute term frequencies for all the terms in your vocabulary. However, the PMML representation only keeps "active" terms (ie. terms that are actually needed by the model), and omits "inactive" terms.

For example, suppose your starting vocabulary contains 10'000 terms, but the random forest model "uses" only 1'000 of them. To make normalization work, then it would be necessary to compute the frequencies of the unused 9'000 terms as well - seems like a waste of resources, if they are not needed for anything else at later stages.

Scikit-Learn does not distinguish between "active" and "inactive" terms so clearly. If there's a TfIdfVectorizer transformation in the pipeline, then the frequencies of all 10'000 features would be computed (and it will be straightforward to sum their values to get the denominator value).

vruusmann · 2017-12-02T02:09:37Z

It follows from the previous comment that you can improve the relative performance of (J)PMML by performing "term selection" in your pipeline.

For example, try replacing SVC (uses all terms) with RandomForestClassifier (uses a subset of most significant terms), and see how the performance numbers change.

Well, it also follows that the converter should enable normalization when the last step of your pipeline is a "greedy" classifier. For example, in case of SVC, the frequencies of all terms in the vocabulary will be computed anyways.

vruusmann mentioned this issue Dec 26, 2017

SelectKBest leading to Logistic Regression probability discrepancies between scikit-learn and jpmml-evaluator jpmml/jpmml-evaluator#93

Closed

vruusmann closed this as completed in jpmml/jpmml-sklearn@d4d2af1 Dec 27, 2017

nejatb mentioned this issue Jan 8, 2018

Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in Accuracies between scikit-learn and pmml models #68

Difference in Accuracies between scikit-learn and pmml models #68

nejatb commented Nov 22, 2017

vruusmann commented Nov 22, 2017

nejatb commented Nov 22, 2017 •

edited

Loading

vruusmann commented Nov 22, 2017

nejatb commented Nov 22, 2017

vruusmann commented Nov 23, 2017

nejatb commented Dec 1, 2017 •

edited

Loading

vruusmann commented Dec 2, 2017

vruusmann commented Dec 2, 2017

Difference in Accuracies between scikit-learn and pmml models #68

Difference in Accuracies between scikit-learn and pmml models #68

Comments

nejatb commented Nov 22, 2017

vruusmann commented Nov 22, 2017

nejatb commented Nov 22, 2017 • edited Loading

vruusmann commented Nov 22, 2017

nejatb commented Nov 22, 2017

vruusmann commented Nov 23, 2017

nejatb commented Dec 1, 2017 • edited Loading

vruusmann commented Dec 2, 2017

vruusmann commented Dec 2, 2017

nejatb commented Nov 22, 2017 •

edited

Loading

nejatb commented Dec 1, 2017 •

edited

Loading