Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature parity with sklearn2pmml #25

Closed
tyers opened this issue Nov 11, 2019 · 6 comments
Closed

feature parity with sklearn2pmml #25

tyers opened this issue Nov 11, 2019 · 6 comments

Comments

@tyers
Copy link

tyers commented Nov 11, 2019

Hi Villu,
I've been investigating the use of spark pipelines to resolve some of the issues i've been having with sklearn2pmml recently. I'm in the process of working through all of my feature transformations and i have found a number of coverage issues.
I'll list them all here, but please let me know if you would like me to split these in to separate issues to make tracking easier.

Modulo
Spark SQL provides a modulo operator, but this is not supported in the jpmml-sparkml documentation as a supported oporator.

Duration Transformations
within sklearn2pmml you recently introduced the duration transformer classes; within spark sql it doesn't look like there is a drop in replacement here, however it is possible to convert a datetime value to a unix timestamp, effectively providing access to seconds_since_year(date,1970)
There is also a datediff function to calculate the days between two dates directly, and could potentially resolve in pmml to two days_since_year transformations followed by a subtraction.

Lookup Transformation
As far as i can tell there is not a simple way to do the equivalent of the sklearn2pmml.preprocessing.LookupTransformer within a sparkml pipeline. Just wanted to check with you whether generating a SQL case when/else/otherwise would be able to function as a suitable replacement here, or would this result in some horrendously inefficient pmml representation?

@vruusmann
Copy link
Member

I've been investigating the use of spark pipelines to resolve some of the issues i've been having with sklearn2pmml recently.

Scikit-Learn and Apache Spark are distinct ML frameworks. If the underlying platforms do not provide feature parity (are not designed with FP in mind), then why do you expect SkLearn2PMML and PySpark2PMML to do so?

I'll list them all here, but please let me know if you would like me to split these in to separate issues to make tracking easier.

Would be appreciated if you could make some time and let me know what's missing and the associated priority.

You could/should open separate issues with the JPMML-SparkML project. The actual Apache Spark-to-PMML conversion code will land there. The PySpark2PMML project is just a thin Python wrapper around the JPMML-SparkML library.

Modulo

There isn't a modulo built-in function in the PMML 4.3 specification. The (J)PMML stack has "standardized" on implementing an extension function "x-modulo" as a workaround.

Duration Transformations

Apache Spark SQL contains some date/time functions. Will have to see if some of them can be mapped to PMML's built-in date/time functions without much "concept drift". If not, should probably introduce our own transformer subclasses or SQL functions.

Lookup Transformation

You could use the Apache Spark SQL "CASE-WHEN" or "IF-ELSE" functions, which translate to PMML's built-in "if" function.

However, a standalone transformer subclass might be more effective, as it allows you to provide the mapping in some dict (aka java.util.Map) form.

@vruusmann
Copy link
Member

vruusmann commented Nov 11, 2019

Shall we close this issue, or change it to some umbrella issue to track all those sub-tasks? You decide.

Anyway, the big TODO here is that the JPMML-SparkML is currently 100% Java language based, but it should be converted to Java/Scala mix so that it would be possible to start implementing custom transformer subclasses (AFAIK, these cannot be effectively done in Java).

Once the project is past that hurdle, I'd like to start with implementing the very basic stuff like porting CategoricalDomain and ContinuousDomain decorator classes to JPMML-SkLearn/PySpark2PMML.

@tyers
Copy link
Author

tyers commented Nov 11, 2019

then why do you expect SkLearn2PMML and PySpark2PMML to do so?

poor choice of words on my part here; these are really just my observations on things that are implemented on the sklearn-pmml side (so are possible in pmml) but do not have coverage in spark.

Would be appreciated if you could make some time and let me know what's missing and the associated priority.
You could/should open separate issues with the JPMML-SparkML project.

This is everything i have found with my c.1 day of investigation, no really priority order on these. I'll get issues opened there for these things now, and can open more if i come across anything else

There isn't a modulo built-in function in the PMML 4.3 specification. The (J)PMML stack has "standardized" on implementing an extension function "x-modulo" as a workaround.

Understood; i was aware of this from previous discussions over on jpmml-sklearn and figured that the same may be possible here.

However, a standalone transformer subclass might be more effective, as it allows you to provide the mapping in some dict (aka java.util.Map) form.

Having reviewed the docs SparkSQL does provide a map function which looks to do the required here.

@vruusmann
Copy link
Member

All newly raised sub-issues have been ack'ed. Will work on them probably starting from the end of this month, as I'm currently busy preparing for the unveil/release of a new product, and would like to do another iteration of JPMML-SkLearn/SkLearn2PMML first as well.

@tyers
Copy link
Author

tyers commented Nov 11, 2019

Thanks villu

@vruusmann
Copy link
Member

Closing as a functional duplicate of jpmml/jpmml-sparkml#81 and jpmml/jpmml-sparkml#83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants