Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of sklearn.preprocessing.Normalizer #50

Closed
yaronskaya opened this issue Oct 5, 2017 · 8 comments
Closed

Implementation of sklearn.preprocessing.Normalizer #50

yaronskaya opened this issue Oct 5, 2017 · 8 comments

Comments

@yaronskaya
Copy link

yaronskaya commented Oct 5, 2017

Hi, I've tried to implement sklearn.preprocessing.Normalizer with l1 norm as custom Transformer.

public List<Feature> encodeFeatures(List<Feature> features, SkLearnEncoder encoder) {

        List<Feature> result = new ArrayList<>();

        Apply sumExpression = PMMLUtil.createApply("+");
        for(Feature feature : features){
            sumExpression.addExpressions(feature.toContinuousFeature().ref());
        }
        FieldName name = FieldName.create("sum-of-features");
        DerivedField sumField = encoder.createDerivedField(name, sumExpression);
        ContinuousFeature sumFeature = new ContinuousFeature(encoder, sumField);

        for(int i = 0; i < features.size(); i++) {
            Feature feature = features.get(i);
            ContinuousFeature continuousFeature = feature.toContinuousFeature();
            Expression expression = continuousFeature.ref();
            expression = PMMLUtil.createApply("/", expression, sumFeature.ref());
            DerivedField derivedField = encoder.createDerivedField(createName(continuousFeature), expression);
            result.add(new ContinuousFeature(encoder, derivedField));
        }
        return result;
}

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

Am I doing something wrong?

@vruusmann
Copy link
Member

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

The + arithmetic function is a binary function, which takes exactly two arguments:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#arith

If you want to sum any number of arguments, then you should use the sum aggregation function:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#min

So, the following code change should do the trick:

Apply sumExpression = PMMLUtil.createApply("sum");

@yaronskaya
Copy link
Author

@vruusmann, thanks for response!
Anyway, is there any implementation of normalization? I'm particularly interested why TfIdfVectorizer doesn't support it.

@vruusmann
Copy link
Member

@vruusmann, thanks for response!

Did it solve your problem? I closed this issue, because theoretically the change from + to sum should be it, but it would be nice to hear what happened in practice.

Anyway, is there any implementation of normalization? I'm particularly interested why TfIdfVectorizer doesn't support it.

Lack of time - need to focus on more important projects.

@yaronskaya
Copy link
Author

Yes, it works for me!
Can I ask one more question here? I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants. Is there a way to overcome that?

@vruusmann
Copy link
Member

vruusmann commented Oct 9, 2017

I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants.

This behaviour can be controlled by customizing the value of the TextIndex@isCaseSensitive attribute:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

Please note that this attribute may be overriden by the TextIndexNormalization@isCaseSensitive attribute:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndexNormalization

Check your PMML document to see the actual configuration (the TextIndex element should be always present, the TextIndexNormalization element is more specific).

The default value of both attributes is false, which means that the capitalization of tokens is ignored. It you change it to true, then only tokens that have correct capitalization will be taken into consideration.

Is the (J)PMML behaviour different from Scikit-Learn behaviour? If so, then it might be worthwhile to open a new issue to implement an appropriate fix. However, be sure to accompany this issue with a reproducible Python code example (eg. based on the "Sentiment" dataset, which is part of the JPMML-SkLearn integration test suite under the src/test/resources/ directory) - don't have the time to triangulate the potential problem myself.

@yaronskaya
Copy link
Author

Thank you for quick response.

This behaviour can be controlled by customizing the value of the TextIndex@isCaseSensitive attribute:

it helped me.

Is the (J)PMML behaviour different from Scikit-Learn behaviour?

After the fix it gives same results.

I've implemented different ways of normalization that give the same result as Scikit-learn normalizer. I'm also thinking of integrating it into TfIdfVectorizer.
Can I create pull request for it?

@vruusmann
Copy link
Member

vruusmann commented Oct 9, 2017

Is the (J)PMML behaviour different from Scikit-Learn behaviour?

After the fix it gives same results.

The goal is that (J)PMML and Scikit-Learn predictions should match by default. So, it might be necessary to revisit the converter for the CountVectorizer (or TfidfVectorizer) transformation, and make sure that all case-sensitivity attributes are properly initialized.

Can I create pull request for it?

I don't generally accept PRs for IPR (copyrights etc.) reasons.

However, you're welcome to summarize your observations and code changes (eg. a patchfile), and I will carry them over to the JPMML-SkLearn repository as my original work. Will credit you in the commit message.

@vruusmann
Copy link
Member

Opened a new issue about TF-IDF case-sensitivity:
#51

Please list all your relevant observations (and suggested fixes) there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants