Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer) #438

Open
philip-bingham opened this issue Dec 20, 2024 · 6 comments

Comments

@philip-bingham
Copy link

philip-bingham commented Dec 20, 2024

I'm trying to take advantage of the datetime functionality presented here https://openscoring.io/blog/2020/03/08/sklearn_date_datetime_pmml/ which works great for datetime fields that are always populated.

For each sample in my data I have the datetime the sample was created, then a historic datetime for an event related to this sample that may or may not have happened. I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.

I'm currently using this mapper config:

def duration_transformer():
    return ExpressionTransformer("(X[0] - X[1])/(60*60*24)", dtype=float)

memory = Memory()

mapper = DataFrameMapper(
# first one is list comprehension, for each column in cat_columns it will map to categorical domain and then label encode the category
              [
                 (["datetime"], [DateTimeDomain(), make_memorizer_union(memory, names=["memorized_datetime"]), SecondsSinceMidnightTransformer(), Alias(make_hour_of_day_transformer(), "HourOfDay", prefit = False)], {'alias':'hour_of_day'}),
                 (["historic_event"], [DateTimeDomain(), make_recaller_union(memory, names=["memorized_datetime"]), SecondsSinceYearTransformer(year = 1900), Alias(duration_transformer(), "days_since_historic_event", prefit = False)], {'alias':'days_since_historic_event'}),
          
                 
              ], input_df=False, df_out=True
                )

When I attempt to fit_transform, I get an error because the SecondsSinceYearTransformer is receiving some NaT values, and the DurationTransformer class attempts to cast whatever value it gets to int, which fails:

IntCastingNaNError: ['historic_event']: Cannot convert non-finite values (NA or inf) to integer

Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers? Ideally I'd be able to tell it to just pass through missing values and return a null that LGBM is capable of handling, although I assume I'd then have to updated my duration_transformer() to understand what to do with null values

@vruusmann
Copy link
Member

I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.
ExpressionTransformer("(X[0] - X[1])/(60*60*24)", dtype=float)

You can make this requirement transparent by using an in-line if-else expression:

transformer = ExpressionTransformer("(X[0] - X[1])/(60*60*24) if pandas.notnull(X[1]) else None", dtype=float)

This will eliminate all doubt about the first part of the computation.

Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers?

Explicit missing/invalid value treatment support is currently available at the first step of a Scikit-Learn pipeline. The SkLearn2PMML package calls these special transformers as "decorators", and they are located in the sklearn2pmml.decoration package (cf. with "ordinary" transformers that are located in the sklearn2pmml.preprocessing package).

Now, in principle, it is possible to make some "ordinary" transformers also support invalid/missing value treatment, if the underlying PMML element (that gets generated) has <Expression>@mapMissingTo and/or <Expression>@defaultValue attributes.

For example, class ExpressionTransformer is generating Apply elements, which does provide such attributes:
https://github.com/jpmml/sklearn2pmml/blob/0.112.0/sklearn2pmml/preprocessing/__init__.py#L230

So, to begin answering your question - can you perhaps move the non-valid value treatment commands to the ExpressionTransformer step?

All DurationTransformer subclasses appear to be generating Apply elements as well, which means that it's possible to introduce DurationTransformer@mapMissingTo, DurationTransformer@defaultValue, etc. attributes if necessary.

But I wouldn't want to only upgrade the DurationTransformer class in isolation. This functional enhancement should be applied to all SkLearn2PMML custom transformer classes at once. Seems like quite a lot of work, so I cannot give any estimates when that might happen.

@vruusmann vruusmann changed the title NaT/Missing value handling for datetime preprocessing functions Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer) Dec 21, 2024
@vruusmann
Copy link
Member

Seems closely related to #436

@vruusmann
Copy link
Member

can you perhaps move the non-valid value treatment commands to the ExpressionTransformer step?

The business logic of DurationTransformer subclasses could be extracted into an utility function, which could be calleable from within Python expressions:

transformer = ExpressionTransformer("sklearn2pmml.preprocessing.seconds_since_year(X[0]) if pandas.notnull(X[0]) else None")

@vruusmann
Copy link
Member

All DurationTransformer subclasses appear to be generating Apply elements as well, which means that it's possible to introduce DurationTransformer@mapMissingTo, DurationTransformer@defaultValue, etc. attributes if necessary.

@philip-bingham You can take the PMML document generated by SkLearn2PMML, and post-process using your own Python helper tool, which adds those attributes as appropriate.

@philip-bingham
Copy link
Author

philip-bingham commented Dec 24, 2024

Thanks for looking into this @vruusmann , from the above comments it doesn't seem that there's a way to achieve this without changes to the package?

This approach:
transformer = ExpressionTransformer("(X[0] - X[1])/(60*60*24) if pandas.notnull(X[1]) else None", dtype=float)

Doesn't work, because X[0] and X[1] are the results of the SecondsSinceYearTransformer, which is where the error is thrown so we don't even reach this transformer.

This looks promising:

transformer = ExpressionTransformer("sklearn2pmml.preprocessing.seconds_since_year(X[0]) if pandas.notnull(X[0]) else None")

but would require some new functions right? And in the expression evaluator it has a predefined list of modules that it can use functions from:
def to_expr_func(expr, modules = ["math", "re", "pcre", "pcre2", "numpy", "pandas", "scipy"]):

so would sklearn2pmml need to be added to this list for this to work? I will play around with this in my local branch.

I also tried modifying the transformer in my local branch to convert to float instead of int so that nulls are allowed and propagate:

def _float(X):
	if numpy.isscalar(X):
		return float(X)
	else:
		return cast(X, float)

def transform(self, X):
		def to_float_duration(X):
			duration = self._to_duration(pandas.to_timedelta(X - self.epoch))
			return _float(duration)

		return dt_transform(X, to_float_duration)

This allows me to get a PMML file out, however when I try to evaluate on the same dataframe with jpmml_evaluator, I get an error about using the pandas datetime dtype:
JavaError: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pandas._libs.tslibs.timestamps._unpickle_timestamp)

The fitted DateTimeDomain() is expecting this dtype
image

So I guess i need to go back and cast to a supported dtype before fitting, but then I think I'm going to run into issues with this part of the operation:
self._to_duration(pandas.to_timedelta(X - self.epoch))

Because I think non-pandas datetime dtypes don't support subtraction with nulls

@vruusmann
Copy link
Member

This approach doesn't work, because X[0] and X[1] are the results of the SecondsSinceYearTransformer, which is where the error is thrown so we don't even reach this transformer.

You could work around this situation by introducing more "branching" into your pipeline. The dea is to generate the "field is NA/is not NA" flag in the first step(s), and then export it for later use using SkLearn2PMML's cross-references mechanism.

The branching can be implemented using the sklearn2pmml.preprocessing.SelectFirstTransformer transformer:
https://github.com/jpmml/sklearn2pmml/blob/0.112.1/sklearn2pmml/preprocessing/__init__.py#L716

Literally, in one branch you handle the "all flags are OK" situation, and in the other(s) all other situations.

This looks promising, but would require some new functions right?

It would be necessary to extract the contents of the SecondsSinceTransformer.transform(X) method into a reusable utility function.

This function could then be called from its current position, or from the ExpressionTransformer transformer.

So would sklearn2pmml need to be added to this list for this to work?

Yes, all expressions are evaluated in a vanilla (ie. newly created and empty) environment. Whatever you want to use in this environment, you have to import first.

I get an error about using the pandas datetime dtype

Support for low-level Numpy/Pandas classes is handled by the JPMML-Python library. The JPMML-SkLearn library deals with Scikit-Learn level classes.

I just released JPMML-Python 1.2.7, which added support for unpickling all sorts of Numpy datetime64 data type arrays:
jpmml/jpmml-python@3481e2d

Your issue appears to be closely related. But JPMML-Python probably needs some more tweaking in the area, because it deals with timestamp data type (as opposed to datetime64 data types). But looks easily doable.

Anyway, such Pickling errors should be posted to JPMML-Python issue tracker (preferably accompanied with a Pickle file of some sort). They will be ignored here as off-topical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants