Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting sklearn.feature_extraction.text.TfidfVectorizer #4

Closed
ricaldam opened this issue Jan 29, 2016 · 6 comments
Closed

Supporting sklearn.feature_extraction.text.TfidfVectorizer #4

ricaldam opened this issue Jan 29, 2016 · 6 comments

Comments

@ricaldam
Copy link

It would be great if the transformation sklearn.feature_extraction.text.TfidfVectorizer can be supported by JPMML-sklearn. It would be even better if both sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfTransformer can be supported (TfidfVectorizer is the combination of these two)

@vruusmann
Copy link
Member

This is a major change, because it requires updating the core JPMML-Evaluator library also. The difficult part is class CountVectorizer, because it must be translated to PMML's TextIndex element and evaluated as such.

Earlier this week, I did take a deeper look into CountVectorizer and TfidfTransformer transformers. My estimation was that it would take approximately two weeks to implement a full solution (a MVP-style solution would be doable in a week or so).

@vruusmann
Copy link
Member

Can you provide an example script that would capture the intended use of TfidfVectorizer? This class takes very many parameters, so it would be important to know which subset of parameters should be implemented first.

There's example usage in SkLearn documentation, but I'm interested in finding out the exact details of your use case. For demonstration purposes, you could replace your real text corpus with some demo text corpus (publicly accessible in the internet) though.

@ricaldam
Copy link
Author

I'm currently working almost exclusively with text datasets, so this functionality would indeed be really helpful. And I'm pretty sure this is something other people would appreciate.

Regarding the script, I think you can consider the example given in the sklearn documentation http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction, since in my case I'm just using the default parameters of TfidfVectorizer. That might change in the future, but for the moment you can consider this is my use case.

Thank you very much

@ricaldam
Copy link
Author

Hello @vruusmann. Is there any news regarding this issue? Thank you.

@clocklear
Copy link

I too am interested if there is any progress on this front.

@vruusmann
Copy link
Member

vruusmann commented Feb 1, 2017

Fixed in commit 65636a7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants