Skip to content
This repository has been archived by the owner on May 18, 2022. It is now read-only.

Support for missing attribute #19

Open
liumy601 opened this issue Dec 25, 2021 · 7 comments
Open

Support for missing attribute #19

liumy601 opened this issue Dec 25, 2021 · 7 comments

Comments

@liumy601
Copy link

Hi vruusmann,

Sorry to disturb again, i've been headache for the inconsistent problem about several months. after i checked the doc of xgboost4j, i see after version 0.9, they've made some fixes about the missing value problem. so i upgraded xgboost4j-spark to 1.2.0 with spark 3. but now i still get the inconsistent problem.

image

you can see i only have one categorical feature hour which doesn't contain missing values, but if i remove categorical feature and use only numeric features, then the predict is consistent.

do you have any clues?

@vruusmann vruusmann transferred this issue from jpmml/jpmml-sparkml Dec 25, 2021
@vruusmann
Copy link
Member

i only have one categorical feature hour which doesn't contain missing values

What is your definition of a "missing value"? A Java null reference, Double.NaN (or Float.NaN value), or something else?

The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.

Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.

@vruusmann
Copy link
Member

This project contains an integration test that uses sparse categorical data:
https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

This test is 100% reproducible.

@liumy601 liumy601 reopened this Dec 26, 2021
@liumy601
Copy link
Author

This project contains an integration test that uses sparse categorical data: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

This test is 100% reproducible.

i've tried SparseToDenseTransformer before, and see it fixes the inconsistent problem caused by sparse vector problem.
But my dataset is big and the features num is over 28000 dimensions, xgboost model can't run successfully as it'll have memory problem

@liumy601
Copy link
Author

i only have one categorical feature hour which doesn't contain missing values

What is your definition of a "missing value"? A Java null reference, Double.NaN (or Float.NaN value), or something else?

The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.

Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.

i set missing value to 0, in xgboost4j-spark 1.2.0, if i set missing to other values, then it'll give xgboost training failed error.

@vruusmann
Copy link
Member

i set missing value to 0

The DataField element for the "hour" column does not convey any information about the fact that in your case, the 0 value should be regarded as a missing value (and not as a numeric zero value).

How can the PMML engine make correct predictions if it is missing this critical piece of information?

Take the PMML document, and insert the following DataField/Value child element manually:

<DataField name="hour" optype="categorical" dataType="integer">
  <!-- THIS -->
  <Value property="missing" value="0"/>
</DataField>

@vruusmann
Copy link
Member

It would be nice to automate the generation of extra DataField/Value@property="missing" etc elements.

Here are some related feature requests: jpmml/jpmml-sparkml#14 and jpmml/jpmml-sparkml#25

Newer XGBoost versions also store this information in model dumps. Here's a related Scikit-Learn issue: jpmml/jpmml-sklearn#166

@vruusmann vruusmann changed the title inconsistent predict problem between jpmml and xgboost4j-spark Support for missing attribute Dec 26, 2021
@vruusmann vruusmann reopened this Dec 26, 2021
@liumy601
Copy link
Author

Hi vruusmann,

Unfortunately, after i add the extra DataField/Value@property="missing" fields, the inconsistent problem still exists, i'm frustrated.
and i've tried both xgboost4j-spark 0.82 and 1.2.0, both inconsistent.
Now i don't have any ideas.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants