test #99

yang-wang-ck · 2019-03-21T23:39:01Z

No description provided.

yang-wang-ck · 2019-03-21T23:40:05Z

oops meant for forked repo

yang-wang-ck · 2019-03-22T21:27:05Z

Hi Villu! I was going to ask you about this but just trying out a hack for now.

We're using scikit xgboost on google's cloud ml-engine (with a single machine) with a ton of data, and we're running into memory problems (since ml-engine right now doesn't offer a super high-mem machine). This was one way to mitigate, and it seems to work in my test.

In general at CK (think you visited recently :)) we have a framework for unifying tensorflow and scikit and want to see if we can use float16 for all cases without sacrificing accuracy.

Don't think it's urgent or serious, but more of a cost-minimization step.

vruusmann · 2019-03-22T21:55:17Z

By switching from binary32 to binary16 you should be able to keep twice as many matrix elements in memory.

Have you figured out if the XGBoost algorithm is also operating on binary16 values, or does it silently promote them to binary32 (aka float) values? In other words, if you train an XGBoost model using binary16, then does it make predictions also in binary16 value space? I'm not too familiar with this data type, but I suppose that there are only four-five decimal places in use (as opposed to six-seven decimal places in case of binary32)?

If XGBoost is treating binary16 data type "natively", then I should teach the JPMML-Evaluator library about it too. The good news is that it wouldn't be too much work - will simply have to create a subclass of org.jpmml.evaluator.ValueFactory that would be producing custom org.jpmml.evaluator.HalfValue objects.

yang-wang-ck · 2019-03-24T06:29:29Z

So I'm looking at xgboost library code, and it seems like internally everything is indeed converted to float32 except for label values. The resulting boosting model is also in float32 for future predictions. (This is just me skimming the code) And this does make sense, I was falling into switch(size){ case 2: only for when the PMML was deciding what to put for <OutputField name="probability(0.0)" and <OutputField name="probability(1.0)". I was actually confused initially why the tree cutoff values didn't fall in here also.

It seems like when storing label data, xgboost is justing taking whatever it's been passed and not touching it again afterwards- https://github.com/dmlc/xgboost/blob/74009afcacc8ac567b5f00d6f82736189490cb47/python-package/xgboost/core.py#L422

So the solution for me is simply changing our code from X, y = pd.DataFrame(feature_val_dict), pd.Series(np.concatenate(labels)) to X, y = pd.DataFrame(feature_val_dict), pd.Series(np.concatenate(labels), dtype=np.float32). This way our feature data remain float16, while label data are casted to float32. This seems to work without any library changes.

Wonder if I should make a comment to the xgboost team... Anyways, thanks for the insightful comment!

vruusmann · 2019-03-24T07:45:31Z

But isn't it the case that when you convert your pandas.DataFrame to xgboost.DMatrix, then XGBoost is performing binary16 -> binary32 conversion automatically:
https://github.com/dmlc/xgboost/blob/74009afcacc8ac567b5f00d6f82736189490cb47/python-package/xgboost/core.py#L255

That is, XGBoost is unable to ingest binary16 values directly, and the application's memory consumption will be higher than working with binary32 (aka float) values right from the beginning?

half-precision pandas.DataFrame, which gets promoted to full-precision xgboost.DMatrix. In some point in time, you have two separate Numpy arrays sitting in memory.
full-precision pandas.DataFrame, which gets wrapped into a full-precision xgboost.DMatrix. There will be only one Numpy array, which gets passed from one wrapper object (DataFrame) to another (DMatrix). There's no memory increase (related to allocating another Numpy array) happening.

The label data type depends on your modeling problem. When dealing with regression-type problems, then it needs to be a floating-point data type (in most cases binary32). But when dealing with classification-type problems, then it can be an integer data type, including 8- and 16-bit integer types (Byte and Short in the Java/JVM world).

Anyway, I think that the JPMML-SkLearn library should be able to recognize the half-precision floating point data type. Just opened an issue about it. Please don't delete this PR, because it's a good reference + discussion for future generations.

test

262007c

yang-wang-ck closed this Mar 21, 2019

yang-wang-ck deleted the 1.5.9_test branch March 22, 2019 21:28

vruusmann mentioned this pull request May 24, 2020

Support for half-precision (aka binary16 or float16) floating-point values jpmml/jpmml-python#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test #99

test #99

yang-wang-ck commented Mar 21, 2019

yang-wang-ck commented Mar 21, 2019

yang-wang-ck commented Mar 22, 2019

vruusmann commented Mar 22, 2019 •

edited

Loading

yang-wang-ck commented Mar 24, 2019 •

edited

Loading

vruusmann commented Mar 24, 2019

test #99

test #99

Conversation

yang-wang-ck commented Mar 21, 2019

yang-wang-ck commented Mar 21, 2019

yang-wang-ck commented Mar 22, 2019

vruusmann commented Mar 22, 2019 • edited Loading

yang-wang-ck commented Mar 24, 2019 • edited Loading

vruusmann commented Mar 24, 2019

vruusmann commented Mar 22, 2019 •

edited

Loading

yang-wang-ck commented Mar 24, 2019 •

edited

Loading