Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One of a few experiments I did with moving libnabo to CUDA #33

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

LouisCastricato
Copy link

As stated before, all attempts were unsuccessful as I was not able to beat nor match libnabo CPU performance.

I'm passing this project onto a colleague of mine who has been working with KD trees and KNN for nearly 10+ years. I hope he can do a much better job than I did.

The code isn't linked to C or C++ code. The port I wrote for that was very messy and was consistently changing due to the various approaches attempted. I'm only putting this here as a proof of concept in case anyone wants to tinker around with it.

Dustin, my colleague, will be writing a new C++ wrapper in the coming month when he does his attempt.

I don't know what his approach will be, but his fork is here: https://github.com/dustprog/libnabo

…d is about 20% faster than the CPU

implementation. My code is very poorly optimized and should be taken with a grain of salt.

I'll be working on it some more later next week.

In a best case scenario, this function will perform in O(nlogn) time. In a worse case
scenario, it'll be O(n^2), where a worst case scenario is two points within the
same cluster being on exactly opposite sides of the tree.

Points should be organized as so

[Cluster|Cluster|Cluster|Cluster]
Where each cluster is 32 points large, and is ordered from
least to greatest compared around their distance from the center of the cluster.
The furthest away point should be no more than max_rad distance from the center.

Eventually this code, if my predictions hold up, should perform 7 - 9x faster than the CPU implementation.
Obviously thats quite a long ways away, will be countless days of work, and will never be optimized perfectly,
but you get the idea.

This code is highly unstable and has a very large amount of bugs (I counted 20+). DO NOT use this
code in any application yet. It will almost certainly crash either your GPU driver or the application.

This code was written to look, initially, as close to the openCL code as possible. Said being, the amount
of branching that currently occurs is huge, since I directly copied the transversal patterns
of the OpenCL version. I will be reducing the amount of branching soon.

-Louis
Optimized branching.

Reduced memory overhead.

First tests with dynamic parrallelism failed. I'll try again tomorrow.

Began to diverage from libnabo's default transversal patterns. I'm using my own, as they
seem better for CUDA. This may have a negative impact later on. I don't know yet.

Improved comments.
Fixed some syntax errors that were preventing it from compiling.
Added a best practice guide and installation guide for CUDA
@ethzasl-jenkins
Copy link

Can one of the admins verify this patch?

@LouisCastricato
Copy link
Author

Oh... I didn't mean to push the change to the read me... Um... Should I create a cudareadme.md?

@LouisCastricato
Copy link
Author

#29 Linking the thread I had started about this.

My experience with implementing a KD search in CUDA is that no matter what I did thread divergence was always the biggest overhead. In a best case scenario the threads were idle about half the time. In a worst case scenario there were idle 31/32 of the time.

My general predictions is that KD tree search algorithms like what libnabo uses will not be as effective in a GPGPU environment until something like NVIDIA's Volta releases, which would allow the usage of an on-board ARM chip to queue new work for the GPU and therefore potentially reduce warp divergence.

FLANN's approach to CUDA seemed interesting, but the error was far too large for it to be practical for anything more than super basic applications.

Unless Dustin sees something that I didn't, I'm going to say that doing this on the GPU may not be very viable for a bit longer (2 years perhaps?). And even by then, the monstrous improvements in cache performance that Intel Skylake promises may further reduce the potential performance of GPGPU KD Search over CPU KD Searching.

@simonlynen
Copy link
Collaborator

@MatrixCompSci thanks a lot for sharing your insight, thoughts and code. Great!

@LouisCastricato
Copy link
Author

The one thing I really want to try still is doing this on a Xeon Phi. I'm generally curious haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants