-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer
)
#438
Comments
You can make this requirement transparent by using an in-line
This will eliminate all doubt about the first part of the computation.
Explicit missing/invalid value treatment support is currently available at the first step of a Scikit-Learn pipeline. The SkLearn2PMML package calls these special transformers as "decorators", and they are located in the Now, in principle, it is possible to make some "ordinary" transformers also support invalid/missing value treatment, if the underlying PMML element (that gets generated) has For example, class So, to begin answering your question - can you perhaps move the non-valid value treatment commands to the All But I wouldn't want to only upgrade the |
ExpressionTransformer
)
Seems closely related to #436 |
The business logic of transformer = ExpressionTransformer("sklearn2pmml.preprocessing.seconds_since_year(X[0]) if pandas.notnull(X[0]) else None") |
@philip-bingham You can take the PMML document generated by SkLearn2PMML, and post-process using your own Python helper tool, which adds those attributes as appropriate. |
Thanks for looking into this @vruusmann , from the above comments it doesn't seem that there's a way to achieve this without changes to the package? This approach: Doesn't work, because X[0] and X[1] are the results of the SecondsSinceYearTransformer, which is where the error is thrown so we don't even reach this transformer. This looks promising:
but would require some new functions right? And in the expression evaluator it has a predefined list of modules that it can use functions from: so would sklearn2pmml need to be added to this list for this to work? I will play around with this in my local branch. I also tried modifying the transformer in my local branch to convert to float instead of int so that nulls are allowed and propagate:
This allows me to get a PMML file out, however when I try to evaluate on the same dataframe with jpmml_evaluator, I get an error about using the pandas datetime dtype: The fitted DateTimeDomain() is expecting this dtype So I guess i need to go back and cast to a supported dtype before fitting, but then I think I'm going to run into issues with this part of the operation: Because I think non-pandas datetime dtypes don't support subtraction with nulls |
You could work around this situation by introducing more "branching" into your pipeline. The dea is to generate the "field is NA/is not NA" flag in the first step(s), and then export it for later use using SkLearn2PMML's cross-references mechanism. The branching can be implemented using the Literally, in one branch you handle the "all flags are OK" situation, and in the other(s) all other situations.
It would be necessary to extract the contents of the This function could then be called from its current position, or from the
Yes, all expressions are evaluated in a vanilla (ie. newly created and empty) environment. Whatever you want to use in this environment, you have to import first.
Support for low-level Numpy/Pandas classes is handled by the JPMML-Python library. The JPMML-SkLearn library deals with Scikit-Learn level classes. I just released JPMML-Python 1.2.7, which added support for unpickling all sorts of Numpy datetime64 data type arrays: Your issue appears to be closely related. But JPMML-Python probably needs some more tweaking in the area, because it deals with Anyway, such Pickling errors should be posted to JPMML-Python issue tracker (preferably accompanied with a Pickle file of some sort). They will be ignored here as off-topical. |
I'm trying to take advantage of the datetime functionality presented here https://openscoring.io/blog/2020/03/08/sklearn_date_datetime_pmml/ which works great for datetime fields that are always populated.
For each sample in my data I have the datetime the sample was created, then a historic datetime for an event related to this sample that may or may not have happened. I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.
I'm currently using this mapper config:
When I attempt to fit_transform, I get an error because the SecondsSinceYearTransformer is receiving some NaT values, and the DurationTransformer class attempts to cast whatever value it gets to int, which fails:
IntCastingNaNError: ['historic_event']: Cannot convert non-finite values (NA or inf) to integer
Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers? Ideally I'd be able to tell it to just pass through missing values and return a null that LGBM is capable of handling, although I assume I'd then have to updated my duration_transformer() to understand what to do with null values
The text was updated successfully, but these errors were encountered: