-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PMMLCountVectorizer and PMMLTfIdfVectorizer classes #74
Comments
This exception is thrown to inform you that normalization in general, and the "norm" attribute in particular, is not supported at the moment. If the converter simply ignored this fact, then the PMML representation of the Scikit-Learn pipeline wouldn't be exact/correct, and you'd be getting mismatching predictions later on. I remember that I have discussed the background of this "not supported" decision someplace else before. In brief, Scikit-Learn calculates term frequencies all in one go, whereas (J)PMML calculates them one by one, as they are needed (kind of lazy evaluation approach). It would be terribly inefficient to perform normalization over the complete vocabulary - (J)PMML would need to invoke term frequency calculation both for "active" and "inactive" terms. In most cases, the majority of terms fall into the "inactive" category. You should consider rearranging your Scikit-Learn pipeline to perform normalization after you've identified which terms are "active": pipeline = Pipeline([
("generate terms", TfIdfVectorizer()), # All terms
("select terms", SelectKBest(k = 500)) # Identify/select 500 "active" terms
("normalize terms", Normalizer()) # Perform normalization over the subset of 500 "active" terms
("classifier", RandomForestClassifier())
]) Would the explicit normalization using the Some estimator types contain "internal" normalization logic, so the explicit normalization step might be unnecessary/redundant.
The Java side must reproduce the prediction logic of the Scikit-Learn/Python side. In other words, if some Python class attribute is significant from the prediction logic perspective, then the converter must look it up, and generate appropriate PMML markup for it.
When in doubt, compare Scikit-Learn predictions with (J)PMML predictions over a representative data set. |
I refactored this issue to address a more generic usability concern. Namely, it's rather tricky to get the paramerization of Scikit-Learn's The solution would be to introduce |
Hi Villu,
Following issue #68 , we have been attempting to add support for normalized tfidfVectorizer, since our tfidfVectorizer was part of the pipeline. The way I went about doing this is rather hacky but is:
I created a python class called NormalizedFeatureExtraction and in that class, I copied the logistics of scikit learn's tfidfVectorizer but I enforced normalization by removing the norm parameter from the init method and also removing self.norm and instead, normalizing the vectors wherever needed. However, to satisfy the need of the java side, I still return None for
and I kept your implementation of jpmml tfidfVectorizer as its corresponding java side.
I wanted to ask you if there is any better way to do this? And also, when making this change to the python side, do I have to make any changes to the java side as well? Is there any reason why the jpmml tfidfVectorizer throws an exception when norm is not None? Could having normalization the way I have added it, without any change to the java side, potentially cause any logistic problems?
Many thanks for your clarifications.
The text was updated successfully, but these errors were encountered: