-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExpressionTransformer
should try to rectify feature type information
#397
Comments
It's pretty unusual to start the pipeline with an The simplest way to make a clarification is to give feature specification using one of SkLearn2PMML decorators (eg. You already have <DataDictionary>
<DataField name="colors" optype="categorical" dataType="string">
<Value value="BLACK"/>
<Value value="blue"/>
<Value value="green"/>
<Value value="red"/>
<Value value="yellow "/>
</DataField>
</DataDictionary> Well, if you don't like the valid value space information, then simply disable it using the color_transformers = [
# THIS!
CategoricalDomain(dtype = str, with_data = False),
ExpressionTransformer("X[0].lower()"),
MatchesTransformer("green"),
] |
ExpressionTransformer
should try to rectify feature type information
The If there are, then it should make effort to rectify their types. For example, if there are string methods being called on a wildcard feature, then it's reasonable to assume that the type of this feature should be It is likely that such type rectification should happen during expression parsing phase, which means that the code change should land in the JPMML-Python library instead. |
This was exactly what I was looking for. Thank you!
So you are saying that it is still OK to start the pipeline with an |
The The easiest way to convert a wildcard feature (has |
Alternatively, the IMO, it's better to have the conversion fail, rather than to have it produce a invalid/incomplete PMML document. |
Hello Villu,
it's been a while and I hope you're fine. I've come back more questions.
Let's start with some code:
The following pipeline doesn't make much sense from a machine learning poit of view, but it shows the issue very well:
In Python, everything works as expected. Now the issue is within the generated
output.pmml
file, where you can find the following:Knowing that the input has an infinte amount of possible values, how can I set this data type to "string"?
The text was updated successfully, but these errors were encountered: