Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test #99

Closed
wants to merge 1 commit into from
Closed

test #99

wants to merge 1 commit into from

Conversation

yang-wang-ck
Copy link

No description provided.

@yang-wang-ck
Copy link
Author

oops meant for forked repo

@yang-wang-ck
Copy link
Author

Hi Villu! I was going to ask you about this but just trying out a hack for now.

We're using scikit xgboost on google's cloud ml-engine (with a single machine) with a ton of data, and we're running into memory problems (since ml-engine right now doesn't offer a super high-mem machine). This was one way to mitigate, and it seems to work in my test.

In general at CK (think you visited recently :)) we have a framework for unifying tensorflow and scikit and want to see if we can use float16 for all cases without sacrificing accuracy.

Don't think it's urgent or serious, but more of a cost-minimization step.

@yang-wang-ck yang-wang-ck deleted the 1.5.9_test branch March 22, 2019 21:28
@vruusmann
Copy link
Member

vruusmann commented Mar 22, 2019

By switching from binary32 to binary16 you should be able to keep twice as many matrix elements in memory.

Have you figured out if the XGBoost algorithm is also operating on binary16 values, or does it silently promote them to binary32 (aka float) values? In other words, if you train an XGBoost model using binary16, then does it make predictions also in binary16 value space? I'm not too familiar with this data type, but I suppose that there are only four-five decimal places in use (as opposed to six-seven decimal places in case of binary32)?

If XGBoost is treating binary16 data type "natively", then I should teach the JPMML-Evaluator library about it too. The good news is that it wouldn't be too much work - will simply have to create a subclass of org.jpmml.evaluator.ValueFactory that would be producing custom org.jpmml.evaluator.HalfValue objects.

@yang-wang-ck
Copy link
Author

yang-wang-ck commented Mar 24, 2019

So I'm looking at xgboost library code, and it seems like internally everything is indeed converted to float32 except for label values. The resulting boosting model is also in float32 for future predictions. (This is just me skimming the code) And this does make sense, I was falling into switch(size){ case 2: only for when the PMML was deciding what to put for <OutputField name="probability(0.0)" and <OutputField name="probability(1.0)". I was actually confused initially why the tree cutoff values didn't fall in here also.

It seems like when storing label data, xgboost is justing taking whatever it's been passed and not touching it again afterwards- https://github.com/dmlc/xgboost/blob/74009afcacc8ac567b5f00d6f82736189490cb47/python-package/xgboost/core.py#L422

So the solution for me is simply changing our code from X, y = pd.DataFrame(feature_val_dict), pd.Series(np.concatenate(labels)) to X, y = pd.DataFrame(feature_val_dict), pd.Series(np.concatenate(labels), dtype=np.float32). This way our feature data remain float16, while label data are casted to float32. This seems to work without any library changes.

Wonder if I should make a comment to the xgboost team... Anyways, thanks for the insightful comment!

@vruusmann
Copy link
Member

But isn't it the case that when you convert your pandas.DataFrame to xgboost.DMatrix, then XGBoost is performing binary16 -> binary32 conversion automatically:
https://github.com/dmlc/xgboost/blob/74009afcacc8ac567b5f00d6f82736189490cb47/python-package/xgboost/core.py#L255

That is, XGBoost is unable to ingest binary16 values directly, and the application's memory consumption will be higher than working with binary32 (aka float) values right from the beginning?

  • half-precision pandas.DataFrame, which gets promoted to full-precision xgboost.DMatrix. In some point in time, you have two separate Numpy arrays sitting in memory.
  • full-precision pandas.DataFrame, which gets wrapped into a full-precision xgboost.DMatrix. There will be only one Numpy array, which gets passed from one wrapper object (DataFrame) to another (DMatrix). There's no memory increase (related to allocating another Numpy array) happening.

The label data type depends on your modeling problem. When dealing with regression-type problems, then it needs to be a floating-point data type (in most cases binary32). But when dealing with classification-type problems, then it can be an integer data type, including 8- and 16-bit integer types (Byte and Short in the Java/JVM world).

Anyway, I think that the JPMML-SkLearn library should be able to recognize the half-precision floating point data type. Just opened an issue about it. Please don't delete this PR, because it's a good reference + discussion for future generations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants