Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect feature promotion from (high cardinality-) categorical to pseudo-numeric #4

Closed
Paulo19920228 opened this issue Feb 20, 2019 · 4 comments

Comments

@Paulo19920228
Copy link

Hi Villu,

Thanks for your assistance with the previous issue that I raised, it's greatly appreciated!

I have however stumbled across a new issue and was wondering whether you could perhaps take a look at it? I'm getting the following error when trying to convert a Tweedie GBM to PMML:

image

It seems like one of the inputs to the model (MAKE) is causing an issue, however, the same input was used for the Poisson model that I referred to you previously and there were no issues with it one you made allowance for Poisson models in your code.

Your assistance with this would be greatly appreciated.

Thanks for your time.

Regards,
Paulo

@vruusmann
Copy link
Member

This is a failing "sanity check" - it appears that one of the GBM trees contains a split instruction, where a "continuous-type" split is attempted on a "categorical-type" feature. For example, a split instruction ${string_feature} < 2.0.

I have no way of verifying if this sanity check is doing a correct job or not, because I don't know what's the definition of the "MAKE" feature (is it text string, numeric string or number?), and what kind of instructions about this feature have you given to H2O.ai. AFAIK, there is a way to explicitly state which columns are continuous and which others categorical.

How are you interacting with H2O.ai in the first place? Directly, using its Scikit-Learn or R wrappers, or in some other way? My integration tests are developed using the Scikit-Learn wrapper, and I didn't encounter any feature typing issues there.

@Paulo19920228
Copy link
Author

Hi Villu,

Thanks for the speedy response and detailed feedback. After doing some investigating, it appears as though the problem is being caused by the cardinality of the 'MAKE' field being too high (e.g. MAKE has 451 distinct levels). Having a look at the POJO file that is produced by H2O, it seems like H2O converts high cardinality fields to numeric, something which your tool doesn't appear to allow for.

I have since reduced the cardinality of this field and the conversion was successful.

In response to your question, I'm interacting with H2O via R.

Thanks for your time.

Regards,
Paulo

@vruusmann
Copy link
Member

the problem is being caused by the cardinality of the 'MAKE' field being too high (e.g. MAKE has 451 distinct levels).

Interesting fact. Will have to generate a synthetic dataset that would trigger this automatic "categorical-to-pseudo numeric" conversion.

In response to your question, I'm interacting with H2O via R.

There's an native H2O integration available in the SkLearn2PMML package. There will be one day an integration available for the R2PMML package as well.

@vruusmann vruusmann changed the title Throws error while converting Tweedie GBM to PMML Detect feature promotion from (high cardinality-) categorical to pseudo-numeric Feb 21, 2019
@Paulo19920228
Copy link
Author

Thanks for the feedback! I come from an R background, which is the reason why I've been using R to interact with H2O. In the future, I'll explore python.

It would really be great if support is added to the R2PMML package for H2O models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants