CategoricalDomain decorator doesn't work with nullable pandas Int64 column #420

ghost · 2024-05-14T07:59:03Z

I'm working with a pandas DataFrame that is using pandas Int64 data type for integer columns (since there can be missing values represented as pd.NA).

I have reduced the data set for testing purposes to just 2 columns:

cat_col      Int64
num_col    float64

The code is as follows:

df_mapper = DataFrameMapper([
        (["num_col"], [ContinuousDomain()]),
        (["cat_col"], [CategoricalDomain(dtype="category")]),
    ], input_df = True, df_out = True)

xgb = XGBClassifier(enable_categorical=True)

pipeline = PMMLPipeline([
        ("mapper", df_mapper),
        ("classifier", xgb)
    ])

pipeline.fit(X, y)
sklearn2pmml(pipeline, "test.pmml")

which results in the following error:

Standard output is empty
Standard error:
Exception in thread "main" org.jpmml.python.AttributeException: Attribute 'pandas.core.indexes.base.data.data' has an unsupported value (Python class pandas.core.arrays.integer.IntegerArray)
	at org.jpmml.python.CastFunction.apply(CastFunction.java:48)
	at org.jpmml.python.PythonObject.get(PythonObject.java:180)
	at pandas.core.Index$NDArrayData.getData(Index.java:162)
	at pandas.core.Index$NDArrayData.getValues(Index.java:156)
	at pandas.core.Index.getValues(Index.java:76)
	at pandas.core.Index.getArrayContent(Index.java:52)
	at org.jpmml.python.PythonObject.getArray(PythonObject.java:324)
	at org.jpmml.python.PythonObject.getObjectArray(PythonObject.java:364)
	at sklearn2pmml.decoration.DiscreteDomain.getDataValues(DiscreteDomain.java:150)
	at sklearn2pmml.decoration.DiscreteDomain.getDataType(DiscreteDomain.java:66)
	at sklearn.Transformer.updateFeatures(Transformer.java:101)
	at sklearn.Transformer.encode(Transformer.java:75)
	at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:48)
	at sklearn.Initializer.encode(Initializer.java:59)
	at sklearn.Composite.encodeFeatures(Composite.java:112)
	at sklearn.Composite.initFeatures(Composite.java:255)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:113)
	at com.sklearn2pmml.Main.run(Main.java:80)
	at com.sklearn2pmml.Main.main(Main.java:65)
Caused by: java.lang.ClassCastException: Cannot cast pandas.core.MaskedArray to numpy.core.NDArray
	at java.base/java.lang.Class.cast(Class.java:4067)
	at org.jpmml.python.CastFunction.apply(CastFunction.java:45)

If I change the data type of cat_col to standard numpy int64, it works without any error. But I cannot change the source that is producing the DataFrame, it is always using pandas Int64 data type as there can be missing values in the data.
Also, if I use the ContinuousDomain decorator for my cat_col, the error disappears (but then the column is not treated as categorical anymore).

The text was updated successfully, but these errors were encountered:

vruusmann · 2024-05-14T19:52:41Z

My internal TODO list has quite a few open items about the pandas.Int64Dtype data type support.

The current issue can be corrected by initializing the CategoricalDomain.data_values_ attribute using the data_values constructor parameter:

cat_domain = CategoricalDomain(data_values = [[1, 2, 3]])

Alternatively, a problematic attribute may be manually simplified to a dense/unmasked numpy array:

cat_domain.data_values_ = numpy.asarray(cat_domain.data_values_.tolist())

Perhaps such simplification should be applied automatically by the CategoricalDomain.fit(X, y) method.

ghost · 2024-05-15T08:27:08Z

Thanks for the quick response.
The suggested solution resolves the issue.

vruusmann · 2024-05-15T15:23:44Z

Reopening, because the conversion should succeed without any manual intervention.

ghost closed this as completed May 15, 2024

vruusmann reopened this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CategoricalDomain decorator doesn't work with nullable pandas Int64 column #420

CategoricalDomain decorator doesn't work with nullable pandas Int64 column #420

ghost commented May 14, 2024 •

edited by ghost

Loading

vruusmann commented May 14, 2024

ghost commented May 15, 2024

vruusmann commented May 15, 2024

CategoricalDomain decorator doesn't work with nullable pandas Int64 column #420

CategoricalDomain decorator doesn't work with nullable pandas Int64 column #420

Comments

ghost commented May 14, 2024 • edited by ghost Loading

vruusmann commented May 14, 2024

ghost commented May 15, 2024

vruusmann commented May 15, 2024

ghost commented May 14, 2024 •

edited by ghost

Loading