Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-261]Update MKLDNN & Add CPP Test #10365

Closed
wants to merge 4 commits into from

Conversation

xinyu-intel
Copy link
Contributor

@xinyu-intel xinyu-intel commented Apr 2, 2018

Description

This pr aims to fix bugs in #8712 by update MKLDNN to the newest. CPP tests are added to monitor data format change of MKL-DNN MXNET-98

@pengzhao-intel

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Update MKLDNN
  • Add cpp tests

@xinyu-intel xinyu-intel requested a review from cjolivier01 as a code owner April 2, 2018 07:38
@zheng-da
Copy link
Contributor

zheng-da commented Apr 2, 2018

Should we update MKLDNN to the latest commit in its master branch? or should we always attach it to a certain version tag? What rule should we follow?
@szha @cjolivier01 @piiswrong

@xinyu-intel xinyu-intel changed the title Update MKLDNN & Add CPP Test [MXNET-261]Update MKLDNN & Add CPP Test Apr 2, 2018
@marcoabreu
Copy link
Contributor

marcoabreu commented Apr 2, 2018

I'd vote for using a stable version instead of the latest master.

@xinyu-intel @pengzhao-intel what's the stability of the master branch?

@zheng-da
Copy link
Contributor

zheng-da commented Apr 2, 2018

are you sure this is a MKL problem? it fails in all configurations.


TEST(MKLDNN_UTIL_FUNC, MemFormat) {
// Check whether the number of format is correct.
CHECK_EQ(mkldnn_format_last, 56);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mkldnn_format_last an enum or constant? If so, you can use static_assert<> somewhere in the code and it doesn't have to be a unit test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's an enum. This request is from @marcoabreu. Please refer #9918 (comment) and jira#98 for the background of this. Thanks.

@pengzhao-intel
Copy link
Contributor

@zheng-da will double check. The tests in python2/3 MKLDNN-CPU passed.

@xinyu-intel
Copy link
Contributor Author

xinyu-intel commented Apr 3, 2018

I think the failure of this ut test may be related to this old version of mklml.
https://github.com/apache/incubator-mxnet/blob/5245ef68191a6d47594bf331ec6e20ba6e93ad4c/ci/docker/install/ubuntu_mklml.sh#L24

@zheng-da
Copy link
Contributor

zheng-da commented Apr 3, 2018

@xinyu-intel I guess you mean mklml?

@xinyu-intel
Copy link
Contributor Author

@zheng-da yes, I've made a mistake. It's mklml not mkldnn:)

@xinyu-intel
Copy link
Contributor Author

I have tried the following four tests with seed(1):

First two passed:
exe1 = y1.simple_bind(mx.cpu(), x=shape)
exe2 = y2.simple_bind(mx.cpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

exe1 = y1.simple_bind(mx.cpu(), x=shape)
exe2 = y2.simple_bind(mx.gpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

Others failed:

exe1 = y1.simple_bind(mx.gpu(), x=shape)
exe2 = y2.simple_bind(mx.cpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

(mismatch 94.4444444444%)
x: array([[[[ 11.774242, 10.667873, 37.325356],
[ -45.697014, 59.5456 , -50.37157 ],
[ -39.387352, -65.5543 , -1.68909 ]]],...
y: array([[[[ 11.774241, 59.5456 , -1.689087],
[ 41.73616 , 95.499115, 16.014626],
[ 12.258306, 22.502499, 45.119247]]],...

exe1 = y1.simple_bind(mx.gpu(), x=shape)
exe2 = y2.simple_bind(mx.gpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

(mismatch 94.4444444444%)
x: array([[[[ 11.77424 , 10.667873, 37.32536 ],
[ -45.697014, 59.5456 , -50.37157 ],
[ -39.387352, -65.5543 , -1.68909 ]]],...
y: array([[[[ 11.774241, 59.545593, -1.689092],
[ 41.736153, 95.49912 , 16.014624],
[ 12.258305, 22.5025 , 45.119247]]],...

It seems that this test cannot pass when using GPU to compute exe1.

@pengzhao-intel
Copy link
Contributor

@marcoabreu @cjolivier01 @zheng-da I think the conclusion (based on @xinyu-intel 's analysis) is latest MKL-DNN fixed the problem but the GPU results of exe1 are not correct. Could anyone look into GPU side?

@marcoabreu Regarding MKL-DNN, there's release version we can use. But the MXNET development progress is very fast so more new features (or bugfix) are needed. Thus, I think it's OK to select the master branch (based on a commit id). Each CI in MKL-DNN is fully verified and tested.
https://github.com/intel/mkl-dnn/releases

BTW, as we see enabling all test cases would be a great practice to improve the quality.

@marcoabreu
Copy link
Contributor

@marcoabreu Regarding MKL-DNN, there's release version we can use. But the MXNET development progress is very fast so more new features (or bugfix) are needed. Thus, I think it's OK to select the master branch (based on a commit id). Each CI in MKL-DNN is fully verified and tested.
https://github.com/intel/mkl-dnn/releases

I'm indifferent about this one, at least for CI. My only concern is when we make a release with MXNet, we're unable to expect our users to use a (potentially) unstable master commit - usually people prefer to use a stable release. This means that users could run into problems because we're validating against a version of MKLDNN which is not even out yet. We have to consider this fact and have to find a solution - e.g. Intel making more frequent releases of the library, back-porting these fixes or something else along those lines. In the end, we don't want to make a release of MXNet that requires a dependency which is not even out yet.

BTW, as we see enabling all test cases would be a great practice to improve the quality.

Definitely! I appreciate efforts in that direction by a lot!

Thanks a lot everybody for all your efforts!

@zheng-da
Copy link
Contributor

zheng-da commented Apr 4, 2018

@xinyu-intel @pengzhao-intel could you describe what is the root of this problem?
It's a little weird that both MKL-DNN and CuDNN have the same bug. Is the bug in the convolution operator? Does the native implementation of MXNet have the bug? Thanks.

@xinyu-intel
Copy link
Contributor Author

@nihui Please help take a look at this pr. The gpu unit test 'depth_wise_conv' which skipped in #10098 can't pass now. Thank you:)

@pengzhao-intel
Copy link
Contributor

ping @nihui

@nihui
Copy link
Contributor

nihui commented Apr 16, 2018

@xinyu-intel hello

I just played with the latest code f3c01d5

build with cuda 8.0.61, and without mkl

I uncomment the skip line and remove all testcase except the depth_wise_conv one
and the test passed whatever I change the binded device to mx.cpu() or mx.gpu()

[nihuini@TENCENT64 ~/incubator-mxnet]$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 CUDA_VISIBLE_DEVICES=4,5,6,7 nosetests --verbose --nocapture tests/python/unittest
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1054985103 to reproduce.
test_operator.test_depthwise_convolution ... ok

----------------------------------------------------------------------
Ran 1 test in 4.753s

OK

@xinyu-intel
Copy link
Contributor Author

@nihui Thanks. A bit confused. I just tested on Tesla P100 and got the same error as before.

@xinyu-intel
Copy link
Contributor Author

xinyu-intel commented Apr 16, 2018

@nihui Can you please help double check base on the following code in incubator-mxnet/tests/python/unittest/test_operator.py:

                             dev = default_context()
-                            exe1 = y1.simple_bind(dev, x=shape)
-                            exe2 = y2.simple_bind(mx.cpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel,
+                            exe1 = y1.simple_bind(mx.gpu(), x=shape)
+                            exe2 = y2.simple_bind(mx.gpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel,
                                     b=(num_filter,))

And I got error as follow:

[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=896461014 to reproduce.
test_operator.test_depthwise_convolution ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1571766810 to reproduce.
FAIL

======================================================================
FAIL: test_operator.test_depthwise_convolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/nfs/pdx/home/zhaopen1/incubator-mxnet/tests/python/unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/nfs/pdx/home/zhaopen1/incubator-mxnet/tests/python/unittest/test_operator.py", line 1303, in test_depthwise_convolution
    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)
  File "/nfs/pdx/home/zhaopen1/.local/lib/python2.7/site-packages/numpy/testing/utils.py", line 1395, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/nfs/pdx/home/zhaopen1/.local/lib/python2.7/site-packages/numpy/testing/utils.py", line 778, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=0.001

(mismatch 94.4444444444%)
 x: array([[[[ -31.054831, -108.180809,  -26.939766],
         [  10.257776,   15.99695 ,   96.046448],
         [   4.541703,   -8.48899 ,   44.320747]]],...
 y: array([[[[ -31.054831,   15.996953,   44.32074 ],
         [  28.470549,   10.288336,   -7.459843],
         [  56.939667,   36.969101,    2.033797]]],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=896461014 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1571766810 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 5.755s

FAILED (failures=1) 

Thank you!

@nihui
Copy link
Contributor

nihui commented Apr 17, 2018

@xinyu-intel issue reproduced on another machine .. investigating ...

@nihui
Copy link
Contributor

nihui commented Apr 17, 2018

#10578
new pull request raised for fixing

@xinyu-intel
Copy link
Contributor Author

Thanks, I will retrigger unit test of this pr as soon as #10578 been merged.

@xinyu-intel
Copy link
Contributor Author

All commits have been merged with #10578 , close this pr.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants