Implementation of sklearn.preprocessing.Normalizer #50

yaronskaya · 2017-10-05T06:14:56Z

Hi, I've tried to implement sklearn.preprocessing.Normalizer with l1 norm as custom Transformer.

public List<Feature> encodeFeatures(List<Feature> features, SkLearnEncoder encoder) {

        List<Feature> result = new ArrayList<>();

        Apply sumExpression = PMMLUtil.createApply("+");
        for(Feature feature : features){
            sumExpression.addExpressions(feature.toContinuousFeature().ref());
        }
        FieldName name = FieldName.create("sum-of-features");
        DerivedField sumField = encoder.createDerivedField(name, sumExpression);
        ContinuousFeature sumFeature = new ContinuousFeature(encoder, sumField);

        for(int i = 0; i < features.size(); i++) {
            Feature feature = features.get(i);
            ContinuousFeature continuousFeature = feature.toContinuousFeature();
            Expression expression = continuousFeature.ref();
            expression = PMMLUtil.createApply("/", expression, sumFeature.ref());
            DerivedField derivedField = encoder.createDerivedField(createName(continuousFeature), expression);
            result.add(new ContinuousFeature(encoder, derivedField));
        }
        return result;
}

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

Am I doing something wrong?

vruusmann · 2017-10-05T12:09:40Z

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

The + arithmetic function is a binary function, which takes exactly two arguments:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#arith

If you want to sum any number of arguments, then you should use the sum aggregation function:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#min

So, the following code change should do the trick:

Apply sumExpression = PMMLUtil.createApply("sum");

yaronskaya · 2017-10-06T09:56:59Z

@vruusmann, thanks for response!
Anyway, is there any implementation of normalization? I'm particularly interested why TfIdfVectorizer doesn't support it.

vruusmann · 2017-10-06T10:51:29Z

@vruusmann, thanks for response!

Did it solve your problem? I closed this issue, because theoretically the change from + to sum should be it, but it would be nice to hear what happened in practice.

Anyway, is there any implementation of normalization? I'm particularly interested why TfIdfVectorizer doesn't support it.

Lack of time - need to focus on more important projects.

yaronskaya · 2017-10-09T08:21:51Z

Yes, it works for me!
Can I ask one more question here? I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants. Is there a way to overcome that?

vruusmann · 2017-10-09T08:34:36Z

I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants.

This behaviour can be controlled by customizing the value of the TextIndex@isCaseSensitive attribute:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

Please note that this attribute may be overriden by the TextIndexNormalization@isCaseSensitive attribute:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndexNormalization

Check your PMML document to see the actual configuration (the TextIndex element should be always present, the TextIndexNormalization element is more specific).

The default value of both attributes is false, which means that the capitalization of tokens is ignored. It you change it to true, then only tokens that have correct capitalization will be taken into consideration.

Is the (J)PMML behaviour different from Scikit-Learn behaviour? If so, then it might be worthwhile to open a new issue to implement an appropriate fix. However, be sure to accompany this issue with a reproducible Python code example (eg. based on the "Sentiment" dataset, which is part of the JPMML-SkLearn integration test suite under the src/test/resources/ directory) - don't have the time to triangulate the potential problem myself.

yaronskaya · 2017-10-09T08:54:36Z

Thank you for quick response.

This behaviour can be controlled by customizing the value of the TextIndex@isCaseSensitive attribute:

it helped me.

Is the (J)PMML behaviour different from Scikit-Learn behaviour?

After the fix it gives same results.

I've implemented different ways of normalization that give the same result as Scikit-learn normalizer. I'm also thinking of integrating it into TfIdfVectorizer.
Can I create pull request for it?

vruusmann · 2017-10-09T09:14:59Z

Is the (J)PMML behaviour different from Scikit-Learn behaviour?

After the fix it gives same results.

The goal is that (J)PMML and Scikit-Learn predictions should match by default. So, it might be necessary to revisit the converter for the CountVectorizer (or TfidfVectorizer) transformation, and make sure that all case-sensitivity attributes are properly initialized.

Can I create pull request for it?

I don't generally accept PRs for IPR (copyrights etc.) reasons.

However, you're welcome to summarize your observations and code changes (eg. a patchfile), and I will carry them over to the JPMML-SkLearn repository as my original work. Will credit you in the commit message.

vruusmann · 2017-10-09T09:20:44Z

Opened a new issue about TF-IDF case-sensitivity:
#51

Please list all your relevant observations (and suggested fixes) there.

vruusmann closed this as completed Oct 5, 2017

vruusmann mentioned this issue Oct 9, 2017

Ensure the case-sensitivity of TF-IDF transformations #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of sklearn.preprocessing.Normalizer #50

Implementation of sklearn.preprocessing.Normalizer #50

yaronskaya commented Oct 5, 2017 •

edited by vruusmann

Loading

vruusmann commented Oct 5, 2017

yaronskaya commented Oct 6, 2017

vruusmann commented Oct 6, 2017

yaronskaya commented Oct 9, 2017

vruusmann commented Oct 9, 2017 •

edited

Loading

yaronskaya commented Oct 9, 2017

vruusmann commented Oct 9, 2017 •

edited

Loading

vruusmann commented Oct 9, 2017

Implementation of sklearn.preprocessing.Normalizer #50

Implementation of sklearn.preprocessing.Normalizer #50

Comments

yaronskaya commented Oct 5, 2017 • edited by vruusmann Loading

vruusmann commented Oct 5, 2017

yaronskaya commented Oct 6, 2017

vruusmann commented Oct 6, 2017

yaronskaya commented Oct 9, 2017

vruusmann commented Oct 9, 2017 • edited Loading

yaronskaya commented Oct 9, 2017

vruusmann commented Oct 9, 2017 • edited Loading

vruusmann commented Oct 9, 2017

yaronskaya commented Oct 5, 2017 •

edited by vruusmann

Loading

vruusmann commented Oct 9, 2017 •

edited

Loading

vruusmann commented Oct 9, 2017 •

edited

Loading