`ExpressionTransformer` should try to rectify feature type information #397

woodly0 · 2023-10-06T13:35:21Z

Hello Villu,

it's been a while and I hope you're fine. I've come back more questions.
Let's start with some code:

# create some data
X = pd.DataFrame(
    {
        "numbers": [1, 2, 3, 40, 5],
        "colors": ["yellow ", "blue", "BLACK", "green", "red"],
    }
)

# create a simple mapper
mapper = DataFrameMapper(
    [
        (
            ["colors"],
            [
                # CategoricalDomain(dtype=str),
                ExpressionTransformer("X[0].lower()"),
                MatchesTransformer("green"),
            ],
            {"alias": "color_green"},
        )
    ],
    df_out=True,
    default=False,
)

The following pipeline doesn't make much sense from a machine learning poit of view, but it shows the issue very well:

pmml_pipe = PMMLPipeline(
    [
        ("mapper", mapper)
    ]
)
# fit and transform
pmml_pipe.fit_transform(X)

# export as PMML
sklearn2pmml(pmml_pipe, "output.pmml", with_repr=True)

In Python, everything works as expected. Now the issue is within the generated output.pmml file, where you can find the following:

<DataDictionary>
	<DataField name="colors" optype="continuous" dataType="double"/>
</DataDictionary>

Knowing that the input has an infinte amount of possible values, how can I set this data type to "string"?

The text was updated successfully, but these errors were encountered:

vruusmann · 2023-10-06T18:58:15Z

It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.

The simplest way to make a clarification is to give feature specification using one of SkLearn2PMML decorators (eg. sklearn2pmml.decoration.CategoricalDomain, OrdinalDomain or ContinuousDomain).

You already have CategoricalDomain in place, but have it commented out. You probably didn't like that it captured the valid value space of your X dataset:

<DataDictionary>
	<DataField name="colors" optype="categorical" dataType="string">
		<Value value="BLACK"/>
		<Value value="blue"/>
		<Value value="green"/>
		<Value value="red"/>
		<Value value="yellow "/>
	</DataField>
</DataDictionary>

Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag:

color_transformers = [
	# THIS!
	CategoricalDomain(dtype = str, with_data = False),
	ExpressionTransformer("X[0].lower()"),
	MatchesTransformer("green"),
]

vruusmann · 2023-10-06T19:07:23Z

It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.

The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.

If there are, then it should make effort to rectify their types. For example, if there are string methods being called on a wildcard feature, then it's reasonable to assume that the type of this feature should be categorical+string (instead of continuous+double).

It is likely that such type rectification should happen during expression parsing phase, which means that the code change should land in the JPMML-Python library instead.

woodly0 · 2023-10-10T08:09:22Z

Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag

This was exactly what I was looking for. Thank you!

The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.

So you are saying that it is still OK to start the pipeline with an ExpressionTransformer or should it be generally avoided?

vruusmann · 2023-10-10T19:48:31Z

So you are saying that it is still OK to start the pipeline with an ExpressionTransformer or should it be generally avoided?

The ExpressionTransformer works best with non-wildcard features.

The easiest way to convert a wildcard feature (has continuous+double type) to a non-wildcard feature is to use SkLearn2PMML decorators.

vruusmann · 2023-10-10T19:51:55Z

Alternatively, the ExpressionTransformer should simply raise a value error when there are wildcard features among the arguments.

IMO, it's better to have the conversion fail, rather than to have it produce a invalid/incomplete PMML document.

vruusmann changed the title ~~Declare input dataType="string"~~ ExpressionTransformer should try to rectify feature type information Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ExpressionTransformer` should try to rectify feature type information #397

`ExpressionTransformer` should try to rectify feature type information #397

woodly0 commented Oct 6, 2023

vruusmann commented Oct 6, 2023

vruusmann commented Oct 6, 2023 •

edited

Loading

woodly0 commented Oct 10, 2023

vruusmann commented Oct 10, 2023

vruusmann commented Oct 10, 2023

ExpressionTransformer should try to rectify feature type information #397

ExpressionTransformer should try to rectify feature type information #397

Comments

woodly0 commented Oct 6, 2023

vruusmann commented Oct 6, 2023

vruusmann commented Oct 6, 2023 • edited Loading

woodly0 commented Oct 10, 2023

vruusmann commented Oct 10, 2023

vruusmann commented Oct 10, 2023

`ExpressionTransformer` should try to rectify feature type information #397

`ExpressionTransformer` should try to rectify feature type information #397

vruusmann commented Oct 6, 2023 •

edited

Loading