Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Improve memory layout inside GPU NMS kernel #7257

Merged
merged 1 commit into from
Jan 13, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 12, 2021

Motivation

Currently, our NMS API expects input and output data to be in a format [batch, num_anchors, 6], where 6 is for 4 bbox coordinates, score and class id. This format is highly inefficient for class-aware NMS, because to check if two bboxes are the same class, we need to load other 5 values. More discussion in #7154 (comment) and other comments in that PR.

This PR improved the memory layout of buffers used in the NMS kernel. Specifically, it unpacks [batch, num_anchors, 6] input into one [batch, num_anchors, 4] buffer for bboxes, and two [batch, num_anchors] buffers for scores and class ids.

Speed up

This change is expected to give a good speed up for the cases when the number valid boxes is large and there are class ids involved. For example, although the change in #7154 brings only modest speed up, combined with this PR the speed up becomes good, as shown below:

PyTorch NMS workload from FasterRCNN / MaskRCNN (4500 boxes):

Current main : 5.78 milli sec
After rewrite in #7154 to restore class aware NMS: 5.23 milli sec
With memory layout improvement in this PR and the rewrite in #7154: 4.51 milli sec

Other consideration

For other cases, it is likely that there would be no speed up, but it should be no worse. For example, it doesn't improve NMS workload from Gluon SSD. When return_indices=False, which I think only applies to MXNet, there is additional concatenation of three buffers at the end. For Gluon SSD, this concat takes 40 micro sec on GTX 1070 ti, while the main NMS kernel takes 200 micro sec. If people think this additional concat is expensive, we really need to rethink our NMS API: The same concat is already happenning in all other frontends, because all frameworks other than MXNet gives NMS inputs in an unpacked form and concat is necessary to satisfy our NMS API.

please review @mbrookhart @Laurawly @vinx13 @trevor-m

@masahi masahi changed the title Unpack NMS inputs into bbox, scores and class ids [TOPI] Improve memory layout inside GPU NMS kernel Jan 12, 2021
Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm glad this speeds up NMS, that confirms some of our suspicions from [Torch] Restore class-aware NMS for detection models by graph rewrite #7154
  2. I don't love that after this PR, ever framework but MXNET will have an op of the form: concat->split->nms->concat->split. Can we talk about moving the split and concat steps out of this op and reworking the frameworks for a new API?

@mbrookhart
Copy link
Contributor

Even after this, I think we will still need a loop over classes for ONNX and TF, since ONNX explicitly and TF implicitly need max_output_boxes_per_class, while this op even with class id will return max_output_boxes for all classes.

@masahi
Copy link
Member Author

masahi commented Jan 12, 2021

I don't quite follow, maybe you are missing something?

First, this PR doesn't change our NMS API, it only changes the buffer layouts used internally.

Second, the final concat is only required for MXNet, which uses return_indices=False. Our NMS returns something completely different depending on return_indices flag : If True, it returns a big output tensor packed with bboxes, scores and class ids, with invalid entries indicated by -1.

(The valid entries are supposed to move to the top, if invalid_to_bottom flag is True. But our GPU NMS kernel ignores this argument and the output is not reordered. This is another difference with CPU implementation, I think this is a bug)

invalid_to_bottom : optional, boolean
Whether to move all valid bounding boxes to the top.

If return_indices=True, which applies to TF, ONNX, and PyTorch, we only return survived box indices, so there is no need to concat bboxes, scores, and class ids. For this case, there is zero additional overhead after this PR, it is just that concat before NMS done by frontends are now completely pointless.

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yep.

We don't seem to have a test for invalid_to_bottom, it's False by default:

invalid_to_bottom=False,
and the only test that mentions it sets it to False (in test_any). Grep-ing around the codebase, the only place it's ever set as True is in the MXNet frontend:
invalid_to_bottom=True,
and I don't see a unit test for that op.

It might be a bug because it never runs outside of the gluon ssd tutorial, which doesn't check box order:

for target in ["llvm", "cuda"]:
ctx = tvm.context(target, 0)
if ctx.exist:
lib = build(target)
class_IDs, scores, bounding_boxs = run(lib, ctx)
######################################################################
# Display result
ax = utils.viz.plot_bbox(
img,
bounding_boxs.asnumpy()[0],
scores.asnumpy()[0],
class_IDs.asnumpy()[0],
class_names=block.classes,
)
plt.show()

We should probably open an issue for it, or add a unit test and fix the kernel.

I'd still like to see us change the NMS API to better reflect what's going on internally (i.e., remove the concat/split for TF/ONNX/Pytorch), but that doesn't need to be this PR.

@masahi
Copy link
Member Author

masahi commented Jan 12, 2021

Yes, ideally I want to update our NMS to be closer to TF/ONNX/PyTorch, and let MXNet frontend handle split and concat, rather than the other way around (what we have now). Current API is over complicated due to the need to support both styles. If we can assume that return_indices is always True, we can clean up our API a lot. For example, invalid_to_bottom argument only makes sense for MXNet. We don't need coord_start, score_index, and id_index arguments, if inputs are only unpacked.

Supporting max_output_boxes_per_class needs change in the implementation as well. We need to count the number of survived boxes per class. But that's the only change I think, it is definitely doable without writing another kernel.

commit fe8fda81774c2e1a4d434179f62e3a299e084cb7
Author: Masahiro Masuda <[email protected]>
Date:   Wed Dec 30 20:31:29 2020 +0900

    fix write by a single thread

commit 0c21e36
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:32:18 2020 +0900

    minor improvement when topk is available

commit 68c6866
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:10:24 2020 +0900

    finish concat output

commit 37d7a19
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 03:59:28 2020 +0900

    fixed topk handling

commit 1913f97
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:34:24 2020 +0900

    more refactoring

commit 70c65f0
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:27:15 2020 +0900

    unpack input data

commit 3a27397
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:22:16 2020 +0900

    slight change to initialization

commit 9b42008
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:50:36 2020 +0900

    add some comments, remove check the check on negative class id

commit 0aa375d
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:39:49 2020 +0900

    leave a TODO on write by only one thread

commit d75ee0a
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:13:04 2020 +0900

    temp disable write by only thread 0

commit 20b5630
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 10:06:43 2020 +0900

    use one block two avoid global sync issue

commit dd1e230
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 07:59:19 2020 +0900

    make NMS inner loop parallel
fix write by a single thread
@masahi masahi force-pushed the nms-cuda-data-unpack branch from 48beb10 to 03fd7ba Compare January 13, 2021 06:46
@masahi masahi merged commit 1f2b40f into apache:main Jan 13, 2021
@masahi
Copy link
Member Author

masahi commented Jan 13, 2021

Thanks @mbrookhart

masahi added a commit to masahi/tvm that referenced this pull request Jan 14, 2021
commit fe8fda81774c2e1a4d434179f62e3a299e084cb7
Author: Masahiro Masuda <[email protected]>
Date:   Wed Dec 30 20:31:29 2020 +0900

    fix write by a single thread

commit 0c21e36
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:32:18 2020 +0900

    minor improvement when topk is available

commit 68c6866
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:10:24 2020 +0900

    finish concat output

commit 37d7a19
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 03:59:28 2020 +0900

    fixed topk handling

commit 1913f97
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:34:24 2020 +0900

    more refactoring

commit 70c65f0
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:27:15 2020 +0900

    unpack input data

commit 3a27397
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:22:16 2020 +0900

    slight change to initialization

commit 9b42008
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:50:36 2020 +0900

    add some comments, remove check the check on negative class id

commit 0aa375d
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:39:49 2020 +0900

    leave a TODO on write by only one thread

commit d75ee0a
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:13:04 2020 +0900

    temp disable write by only thread 0

commit 20b5630
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 10:06:43 2020 +0900

    use one block two avoid global sync issue

commit dd1e230
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 07:59:19 2020 +0900

    make NMS inner loop parallel
fix write by a single thread
masahi added a commit to masahi/tvm that referenced this pull request Jan 18, 2021
commit fe8fda81774c2e1a4d434179f62e3a299e084cb7
Author: Masahiro Masuda <[email protected]>
Date:   Wed Dec 30 20:31:29 2020 +0900

    fix write by a single thread

commit 0c21e36
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:32:18 2020 +0900

    minor improvement when topk is available

commit 68c6866
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:10:24 2020 +0900

    finish concat output

commit 37d7a19
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 03:59:28 2020 +0900

    fixed topk handling

commit 1913f97
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:34:24 2020 +0900

    more refactoring

commit 70c65f0
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:27:15 2020 +0900

    unpack input data

commit 3a27397
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:22:16 2020 +0900

    slight change to initialization

commit 9b42008
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:50:36 2020 +0900

    add some comments, remove check the check on negative class id

commit 0aa375d
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:39:49 2020 +0900

    leave a TODO on write by only one thread

commit d75ee0a
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:13:04 2020 +0900

    temp disable write by only thread 0

commit 20b5630
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 10:06:43 2020 +0900

    use one block two avoid global sync issue

commit dd1e230
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 07:59:19 2020 +0900

    make NMS inner loop parallel
fix write by a single thread
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
commit fe8fda81774c2e1a4d434179f62e3a299e084cb7
Author: Masahiro Masuda <[email protected]>
Date:   Wed Dec 30 20:31:29 2020 +0900

    fix write by a single thread

commit 0c21e36
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:32:18 2020 +0900

    minor improvement when topk is available

commit 68c6866
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:10:24 2020 +0900

    finish concat output

commit 37d7a19
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 03:59:28 2020 +0900

    fixed topk handling

commit 1913f97
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:34:24 2020 +0900

    more refactoring

commit 70c65f0
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:27:15 2020 +0900

    unpack input data

commit 3a27397
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:22:16 2020 +0900

    slight change to initialization

commit 9b42008
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:50:36 2020 +0900

    add some comments, remove check the check on negative class id

commit 0aa375d
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:39:49 2020 +0900

    leave a TODO on write by only one thread

commit d75ee0a
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:13:04 2020 +0900

    temp disable write by only thread 0

commit 20b5630
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 10:06:43 2020 +0900

    use one block two avoid global sync issue

commit dd1e230
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 07:59:19 2020 +0900

    make NMS inner loop parallel
fix write by a single thread
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
commit fe8fda81774c2e1a4d434179f62e3a299e084cb7
Author: Masahiro Masuda <[email protected]>
Date:   Wed Dec 30 20:31:29 2020 +0900

    fix write by a single thread

commit 0c21e36
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:32:18 2020 +0900

    minor improvement when topk is available

commit 68c6866
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:10:24 2020 +0900

    finish concat output

commit 37d7a19
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 03:59:28 2020 +0900

    fixed topk handling

commit 1913f97
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:34:24 2020 +0900

    more refactoring

commit 70c65f0
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:27:15 2020 +0900

    unpack input data

commit 3a27397
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:22:16 2020 +0900

    slight change to initialization

commit 9b42008
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:50:36 2020 +0900

    add some comments, remove check the check on negative class id

commit 0aa375d
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:39:49 2020 +0900

    leave a TODO on write by only one thread

commit d75ee0a
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:13:04 2020 +0900

    temp disable write by only thread 0

commit 20b5630
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 10:06:43 2020 +0900

    use one block two avoid global sync issue

commit dd1e230
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 07:59:19 2020 +0900

    make NMS inner loop parallel
fix write by a single thread
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
commit fe8fda81774c2e1a4d434179f62e3a299e084cb7
Author: Masahiro Masuda <[email protected]>
Date:   Wed Dec 30 20:31:29 2020 +0900

    fix write by a single thread

commit 0c21e36
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:32:18 2020 +0900

    minor improvement when topk is available

commit 68c6866
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 04:10:24 2020 +0900

    finish concat output

commit 37d7a19
Author: Masahiro Masuda <[email protected]>
Date:   Tue Dec 29 03:59:28 2020 +0900

    fixed topk handling

commit 1913f97
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:34:24 2020 +0900

    more refactoring

commit 70c65f0
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:27:15 2020 +0900

    unpack input data

commit 3a27397
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 21:22:16 2020 +0900

    slight change to initialization

commit 9b42008
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:50:36 2020 +0900

    add some comments, remove check the check on negative class id

commit 0aa375d
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:39:49 2020 +0900

    leave a TODO on write by only one thread

commit d75ee0a
Author: Masahiro Masuda <[email protected]>
Date:   Mon Dec 28 19:13:04 2020 +0900

    temp disable write by only thread 0

commit 20b5630
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 10:06:43 2020 +0900

    use one block two avoid global sync issue

commit dd1e230
Author: Masahiro Masuda <[email protected]>
Date:   Sat Dec 26 07:59:19 2020 +0900

    make NMS inner loop parallel
fix write by a single thread
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants