Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating PMMLs from lifelines #833

Open
AbdealiLoKo opened this issue Sep 17, 2019 · 6 comments
Open

Creating PMMLs from lifelines #833

AbdealiLoKo opened this issue Sep 17, 2019 · 6 comments

Comments

@AbdealiLoKo
Copy link
Contributor

Following the conversation on #188
Thought I'd create this issue so it can be tracked and a solution can be found.

It would be awesome if we can figure out a way to create PMML/PFAs from lifeline models as that is the standard

Currently, PMML creation using sklearn2pmml does not work on lifelines because of a pickling error.

@AbdealiLoKo
Copy link
Contributor Author

I spent some time in this today.
I tried sklearn2pmml as that is kind of the standard that is available. It is from jpmml.
In it, you first try to make a PMMLPipeline and then try to dump that pipeline to PMML.

Code:

import pandas as pd
from sklearn2pmml.pipeline import PMMLPipeline
from lifelines.utils.sklearn_adapter import sklearn_adapter
from lifelines import CoxPHFitter

df = pd.DataFrame({
    'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
    'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
    'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2],
    'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
})

CoxSklearnRegression = sklearn_adapter(CoxPHFitter, event_col='E')
pipeline = PMMLPipeline([
    ("regressor", sklearn_adapter(CoxPHFitter, event_col='E')())
])
pipeline.fit(df[['var', 'age', 'E']], df[['T']])

print(pd.DataFrame({
    'pred': pipeline.predict(df[['var', 'age', 'E']]),
    'actual': df['T'].tolist(),
}))

Creating the PMML Pipeline worked great.

After this, when I try:

from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "coxph.pmml", with_repr = True)

It errors:

Standard error:
Oct 04, 2019 10:16:35 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Oct 04, 2019 10:16:35 PM org.jpmml.sklearn.Main run
SEVERE: Failed to parse PKL
net.razorvine.pickle.PickleException: failed to __setstate__()
	at net.razorvine.pickle.Unpickler.load_build(Unpickler.java:409)
	at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:234)
	at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
	at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
	at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
	at org.jpmml.sklearn.Main.run(Main.java:104)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.NoSuchMethodException: net.razorvine.pickle.objects.ClassDict.__setstate__([Ljava.lang.Object;)
	at java.lang.Class.getMethod(Class.java:1786)
	at net.razorvine.pickle.Unpickler.load_build(Unpickler.java:406)
	... 6 more

Exception in thread "main" net.razorvine.pickle.PickleException: failed to __setstate__()
	at net.razorvine.pickle.Unpickler.load_build(Unpickler.java:409)
	at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:234)
	at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
	at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
	at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
	at org.jpmml.sklearn.Main.run(Main.java:104)
	at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.NoSuchMethodException: net.razorvine.pickle.objects.ClassDict.__setstate__([Ljava.lang.Object;)
	at java.lang.Class.getMethod(Class.java:1786)
	at net.razorvine.pickle.Unpickler.load_build(Unpickler.java:406)
	... 6 more

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-f6552a17c384> in <module>
     26 
     27 from sklearn2pmml import sklearn2pmml
---> 28 sklearn2pmml(pipeline, "coxph.pmml", with_repr = True)

~/miniconda3.7/lib/python3.7/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug, java_encoding)
    250                                 print("Standard error is empty")
    251                 if retcode:
--> 252                         raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
    253         finally:
    254                 if debug:

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

I think I know why it happens.
The library is probably (at some point) pickling the model, and sending it to a JavaRuntimeEnvironment and unpickling it there. Problem is that the class created from sklearn_adapter(CoxPHFitter, event_col='E') doesn't really exist anywhere, its just a temporary class created in https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/utils/sklearn_adapter.py#L123

So, as long as the sklearn-adapter method is going to create dynamic classes - this won't work.

@CamDavidsonPilon I'm curious to understand why dynamic classes was chosen for this ? And thoughts on moving it to static classes - while being more verbose, it would be a bit more robust I think.

@CamDavidsonPilon
Copy link
Owner

And thoughts on moving it to static classes

I'm open to this. As you know, current support for sklearn is limited because of these dynamic classes. Can you describe what this might look like?

@AbdealiLoKo
Copy link
Contributor Author

AbdealiLoKo commented Oct 4, 2019

Could you point me to docs/info on what the sklearn adapter is intended for and what limitations are known?

That way we may be able to identify a better architecture. I do have some thoughts, but wanted to check if they solve other needs.

Note: With regard to PMML I think even if they are made static, some more things may be needed. I'll continue my exploration

@CamDavidsonPilon
Copy link
Owner

CamDavidsonPilon commented Oct 4, 2019

It's intention is to create an API that a) resembles and b) is compatible with scikit-learn. That is, have classes that behave like, for example, sklearn.linear_model.LinearRegression but contain a lifelines model. That way, these classes can plug into tools like GridSearchCV to find the best group of parameters for a model.

Known limitations are the ones you've bumped into: serialization: i) because autograd/jax creates anonymous functions, they don't work will with most serialization libraries, ii) creating dynamic classes almost always falls with serialization libraries too.

Docs are here: https://lifelines.readthedocs.io/en/latest/Compatibility%20with%20scikit-learn.html
And unit tests are useful: https://github.com/CamDavidsonPilon/lifelines/blob/master/tests/utils/test_utils.py#L917 (and note that xfail tests too)

@AbdealiLoKo
Copy link
Contributor Author

AbdealiLoKo commented Oct 5, 2019

@CamDavidsonPilon Now that the pickling issues are resolved, I wanted to take a look at PMML too.
I realized that the error I'm seeing here may not be related to pickling/joblib issues it is probably that the java library jpmml-sklearn doesn't have support for lifelines pickles.
Had created jpmml/jpmml-sklearn#183 for this.

I was wondering if you're aware of any library (in python/R/java etc.) which supports PMMLs for similar estimators to the ones in lifelines?

The part that gets me a bit confused is that lifelines returns a prob value for every time T in the timeline - while most PMMLs I see for Regression etc. have a single score value and not an array of scores.
So, I wanted to see the implementation that some other libraries may have.

@CamDavidsonPilon
Copy link
Owner

The part that gets me a bit confused is that lifelines returns a prob value for every time T in the timeline - while most PMMLs I see for Regression etc. have a single score value and not an array of scores.

This is true if you are predicting the survival function. Choosing predict_median or predict_percentile returns a single value per observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants