Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Slow CPU inference in Gluon GRU module #13634

Closed
marekjg opened this issue Dec 13, 2018 · 11 comments
Closed

Slow CPU inference in Gluon GRU module #13634

marekjg opened this issue Dec 13, 2018 · 11 comments

Comments

@marekjg
Copy link

marekjg commented Dec 13, 2018

Description

Gluon.GRU is slow on the CPU comparing to ndarray.RNN GRU for the same input.

Environment info

Deep Learning AMI 19, Tesla V100

----------Python Info----------
Version      : 3.7.1
Compiler     : GCC 7.3.0
Build        : ('default', 'Oct 23 2018 19:19:42')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /home/ec2-user/anaconda3/envs/gmarek_mx13/lib/python3.7/site-packages/mxnet
Commit Hash   : b45e1273ece8eba1a011107ce12032af58efe661
----------System Info----------
Platform     : Linux-4.14.77-70.59.amzn1.x86_64-x86_64-with-glibc2.10
system       : Linux
node         : ip-172-31-44-214
release      : 4.14.77-70.59.amzn1.x86_64
version      : #1 SMP Mon Nov 12 22:02:45 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2701.073
BogoMIPS:              4600.18
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.7860 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0006 sec, LOAD: 0.5938 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.0175 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 1.0119 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0114 sec, LOAD: 0.4352 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0866 sec.

Minimum reproducible example

from time import time

import mxnet as mx
from mxnet import nd
from mxnet import gluon
from mxnet.gluon import rnn

inp_dim = 1024
hid_dim = 1024
n_layers = 1
n_parameters = (inp_dim * hid_dim + hid_dim + hid_dim * hid_dim + hid_dim) * 3
n_steps = 100

for ctx in [mx.cpu(), mx.gpu()]:
    gru_params = nd.random.uniform(low=-1, high=1, shape=(n_parameters,), ctx=ctx)
    gru_ndarray = lambda x, h_0: nd.RNN(x, gru_params, h_0, num_layers=n_layers,
                                        state_size=hid_dim, mode='gru', state_outputs=True)
    gru_gluon = rnn.GRU(hid_dim, n_layers, input_size=inp_dim)
    gru_gluon.collect_params().initialize(ctx=ctx)
    gru_gluon.hybridize()

    x = nd.random_normal(0, 1, (1, 1, inp_dim), ctx=ctx)
    h_0 = x

    # JIC: warm-up
    _, _ = gru_gluon(x, h_0)
    nd.waitall()

    for method, gru in [('ndarray', gru_ndarray), ('gluon', gru_gluon)]:
        h = h_0
        start = time()
        for step in range(n_steps):
            _, h = gru(x, h)
            if method == 'gluon':
                h = h[0]
        nd.waitall()
        dt = time() - start
        print(ctx, method, dt)

Steps to reproduce

Run the above script with python

Output

Gluon.GRU is significantly slower than ndarray.RNN
device,method,time:
cpu(0) ndarray 0.07194805145263672
cpu(0) gluon 4.735473394393921
gpu(0) ndarray 0.013593673706054688
gpu(0) gluon 0.04437994956970215

@pengzhao-intel
Copy link
Contributor

@ciyongch could you help take a look for GRU inference?
Did the fused GRU used?

@TaoLv
Copy link
Member

TaoLv commented Dec 13, 2018

I think Gluon GRU is calling unfused RNN cells which contain stacked fully connected and activation operators. But ndarray.RNN is calling a fused implementation. So for me the performance is as expectation.
@marekjg Have you ever compared the result of two implementation?

@pengzhao-intel
Copy link
Contributor

Next step, @marekjg if you can build with USE_BLAS=mkl, the performance will boost a lot.

@pengzhao-intel
Copy link
Contributor

@szha is it possible to apply fused RNN into Gluon?

@marekjg
Copy link
Author

marekjg commented Dec 13, 2018

Thanks for quick response. @TaoLv yes, they're the same but I've removed the comparison and loading step of the parameters for the sake of brevity. @pengzhao-intel I've installed mxnet-cu92mkl and there was already boost in preformance in compare to mxnet-cu92 which I've installed by mistake earlier. Not sure if it helps but I've checked this script in 1.3, 1.4 (when it was @ master) and 1.5 now.

@ciyongch
Copy link
Contributor

@pengzhao-intel @TaoLv @marekjg The current MXNet already supports fusedRNN in Gluon, gluon.rnn.GRU will call fusedGRU, while gluon.rnn.GRUCell will call the fullyconnected + activation implementation. Will take a look at this.

@ciyongch
Copy link
Contributor

@marekjg please build MXNet from source with the the option USE_BLAS=mkl, since currently mxnet-mkl package is built with USE_BLAS=openblas by default. Please correct me if this is behavior is changed @TaoLv

@TaoLv
Copy link
Member

TaoLv commented Dec 13, 2018

@ciyongch Thank you for correcting me. Yes, rnn.GRU is also calling fused RNN implementation and can be hybridized now.

@marekjg please build MXNet from source with the the option USE_BLAS=mkl, since currently mxnet-mkl package is built with USE_BLAS=openblas by default. Please correct me if this is behavior is changed.

Yes, pip packages are built with openblas.

@szha
Copy link
Member

szha commented Dec 13, 2018

gluon.rnn.GRU supports unrolling of samples with different lengths in the same batch, which is not yet supported in the fused kernel interface. cudnn supports that so for GPU implementation we'd need the integration. CPU version is yet to be implemented.

@vdantu
Copy link
Contributor

vdantu commented Dec 13, 2018

@mxnet-label-bot add [Gluon, performance, question]

@eric-haibin-lin
Copy link
Member

CPU kernels were added: #9977

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants