Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #94

Merged
merged 4 commits into from
Jul 9, 2020
Merged

Performance improvements #94

merged 4 commits into from
Jul 9, 2020

Conversation

joaquimg
Copy link
Member

@joaquimg joaquimg commented Jul 4, 2020

I added that benchmark file: bench/runbench.jl

There are major speedups with this little changes

Current master:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      85.9s / 100%            25.6GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 c + s         10    17.3s  20.2%   1.73s   4.85GiB  19.0%   497MiB
   opt         10    9.82s  11.4%   982ms     0.00B  0.00%    0.00B
   copy        10    7.14s  8.32%   714ms   4.19GiB  16.4%   429MiB
   build       10    370ms  0.43%  37.0ms    676MiB  2.58%  67.6MiB
 bcs           10    17.2s  20.0%   1.72s   4.92GiB  19.2%   504MiB
   opt         10    16.8s  19.6%   1.68s   4.26GiB  16.6%   436MiB
   build       10    352ms  0.41%  35.2ms    677MiB  2.58%  67.7MiB
 cs            10    17.2s  20.0%   1.72s   4.92GiB  19.2%   504MiB
   opt         10    16.8s  19.6%   1.68s   4.26GiB  16.6%   436MiB
   build       10    370ms  0.43%  37.0ms    677MiB  2.58%  67.7MiB
 bcs + v       10    17.1s  19.9%   1.71s   5.45GiB  21.3%   558MiB
   opt         10    16.3s  19.0%   1.63s   4.26GiB  16.6%   436MiB
   build       10    719ms  0.84%  71.9ms   1.19GiB  4.66%   122MiB
 bc + s        10    16.4s  19.1%   1.64s   4.86GiB  19.0%   497MiB
   opt         10    8.91s  10.4%   891ms     0.00B  0.00%    0.00B
   copy        10    7.07s  8.24%   707ms   4.20GiB  16.4%   430MiB
   build       10    413ms  0.48%  41.3ms    676MiB  2.58%  67.6MiB
 data          10    677ms  0.79%  67.7ms    617MiB  2.35%  61.7MiB
 ──────────────────────────────────────────────────────────────────

After this PR:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      52.8s / 100%            8.91GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 bcs + v       10    10.9s  20.7%   1.09s   2.11GiB  23.7%   216MiB
   opt         10    10.2s  19.3%   1.02s    938MiB  10.3%  93.8MiB
   build       10    744ms  1.41%  74.4ms   1.19GiB  13.4%   122MiB
 bcs           10    10.6s  20.0%   1.06s   1.58GiB  17.7%   161MiB
   opt         10    10.1s  19.1%   1.01s    938MiB  10.3%  93.8MiB
   build       10    454ms  0.86%  45.4ms    677MiB  7.43%  67.7MiB
 bc + s        10    10.3s  19.4%   1.03s   1.53GiB  17.1%   156MiB
   opt         10    8.31s  15.7%   831ms     0.00B  0.00%    0.00B
   copy        10    1.72s  3.25%   172ms    887MiB  9.73%  88.7MiB
   build       10    212ms  0.40%  21.2ms    676MiB  7.41%  67.6MiB
 cs            10    10.1s  19.2%   1.01s   1.58GiB  17.7%   161MiB
   opt         10    9.80s  18.6%   980ms    938MiB  10.3%  93.8MiB
   build       10    332ms  0.63%  33.2ms    677MiB  7.43%  67.7MiB
 c + s         10    10.1s  19.1%   1.01s   1.51GiB  17.0%   155MiB
   opt         10    8.15s  15.4%   815ms     0.00B  0.00%    0.00B
   copy        10    1.62s  3.07%   162ms    874MiB  9.58%  87.4MiB
   build       10    325ms  0.62%  32.5ms    676MiB  7.41%  67.6MiB
 data          10    789ms  1.49%  78.9ms    617MiB  6.76%  61.7MiB
 ──────────────────────────────────────────────────────────────────

cc @odow @mlubin @blegat

@codecov
Copy link

codecov bot commented Jul 4, 2020

Codecov Report

Merging #94 into master will increase coverage by 2.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
+ Coverage   54.41%   56.44%   +2.02%     
==========================================
  Files           3        3              
  Lines         408      427      +19     
==========================================
+ Hits          222      241      +19     
  Misses        186      186              
Impacted Files Coverage Δ
src/MOI_wrapper/MOI_wrapper.jl 83.40% <100.00%> (+1.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb20fa3...b4acc17. Read the comment docs.

@mlubin
Copy link
Member

mlubin commented Jul 4, 2020

Nice! Let me put the right copyright header on my snippet before you commit this.

Copy link
Member

@mlubin mlubin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind about the copyright. On first glance it looked like this was a copy of the facility location benchmark I wrote, but it's solving a random LP instead.

bench/runbench.jl Outdated Show resolved Hide resolved
bench/runbench.jl Outdated Show resolved Hide resolved
bench/runbench.jl Outdated Show resolved Hide resolved
add_sizehint!(I, n_terms)
add_sizehint!(J, n_terms)
add_sizehint!(V, n_terms)
for c_index in list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing the loop twice? Is it for cache friendliness as we modify different vectors?
The disadvantage is that we get the function twice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. How much does this particular optimization help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing twice we can pre allocate slots in I, J and V.
It does not seem to make the code any more readable or unreadable.
It is responsible for 30% of the overall speedup.
An extra step would be to cache f also, I was thinking it might be way to much, but that can give and extra 10%:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      52.8s / 100%            8.91GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 bcs + v       10    10.9s  20.7%   1.09s   2.11GiB  23.7%   216MiB
   opt         10    10.2s  19.3%   1.02s    938MiB  10.3%  93.8MiB
   build       10    744ms  1.41%  74.4ms   1.19GiB  13.4%   122MiB
 bcs           10    10.6s  20.0%   1.06s   1.58GiB  17.7%   161MiB
   opt         10    10.1s  19.1%   1.01s    938MiB  10.3%  93.8MiB
   build       10    454ms  0.86%  45.4ms    677MiB  7.43%  67.7MiB
 bc + s        10    10.3s  19.4%   1.03s   1.53GiB  17.1%   156MiB
   opt         10    8.31s  15.7%   831ms     0.00B  0.00%    0.00B
   copy        10    1.72s  3.25%   172ms    887MiB  9.73%  88.7MiB
   build       10    212ms  0.40%  21.2ms    676MiB  7.41%  67.6MiB
 cs            10    10.1s  19.2%   1.01s   1.58GiB  17.7%   161MiB
   opt         10    9.80s  18.6%   980ms    938MiB  10.3%  93.8MiB
   build       10    332ms  0.63%  33.2ms    677MiB  7.43%  67.7MiB
 c + s         10    10.1s  19.1%   1.01s   1.51GiB  17.0%   155MiB
   opt         10    8.15s  15.4%   815ms     0.00B  0.00%    0.00B
   copy        10    1.62s  3.07%   162ms    874MiB  9.58%  87.4MiB
   build       10    325ms  0.62%  32.5ms    676MiB  7.41%  67.6MiB
 data          10    789ms  1.49%  78.9ms    617MiB  6.76%  61.7MiB
 ──────────────────────────────────────────────────────────────────

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, its VERY important for the bridged cache with separate solver case!
I will commit it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now bridging overhead is basically zero, and copy is less than 20% fo solve time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I missed the computation of n_terms, it makes sense then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is getting functions slow ? Isn't the bridged cache just a MOIU.AbstractModel ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why, yes it should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its specially slow though.
The loop is just tight.

@joaquimg
Copy link
Member Author

joaquimg commented Jul 4, 2020

with the latest commit the bridging time is basically zero always:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      52.8s / 100%            8.91GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 bcs + v       10    10.9s  20.7%   1.09s   2.11GiB  23.7%   216MiB
   opt         10    10.2s  19.3%   1.02s    938MiB  10.3%  93.8MiB
   build       10    744ms  1.41%  74.4ms   1.19GiB  13.4%   122MiB
 bcs           10    10.6s  20.0%   1.06s   1.58GiB  17.7%   161MiB
   opt         10    10.1s  19.1%   1.01s    938MiB  10.3%  93.8MiB
   build       10    454ms  0.86%  45.4ms    677MiB  7.43%  67.7MiB
 bc + s        10    10.3s  19.4%   1.03s   1.53GiB  17.1%   156MiB
   opt         10    8.31s  15.7%   831ms     0.00B  0.00%    0.00B
   copy        10    1.72s  3.25%   172ms    887MiB  9.73%  88.7MiB
   build       10    212ms  0.40%  21.2ms    676MiB  7.41%  67.6MiB
 cs            10    10.1s  19.2%   1.01s   1.58GiB  17.7%   161MiB
   opt         10    9.80s  18.6%   980ms    938MiB  10.3%  93.8MiB
   build       10    332ms  0.63%  33.2ms    677MiB  7.43%  67.7MiB
 c + s         10    10.1s  19.1%   1.01s   1.51GiB  17.0%   155MiB
   opt         10    8.15s  15.4%   815ms     0.00B  0.00%    0.00B
   copy        10    1.62s  3.07%   162ms    874MiB  9.58%  87.4MiB
   build       10    325ms  0.62%  32.5ms    676MiB  7.41%  67.6MiB
 data          10    789ms  1.49%  78.9ms    617MiB  6.76%  61.7MiB
 ──────────────────────────────────────────────────────────────────

@blegat
Copy link
Member

blegat commented Jul 4, 2020

Is sparse taking a lot of time ?
It seems we could create A' instead of I, J, K. For this, we need to call MOIU.canonical(f) as SparseMatrixCSC does not allow any duplicate and then add it as a new row of A' by adding an element to colptr and appending the new entries to rowval and ǹzval. Then we can do A = copy(transpose(A')) instead of A = sparse(I, J, K, ...). One small detail, SparseMatrixCSC is immutable so we cannot modify A'.n and A'.m, so we probably need to just store colptr, rowval and nzval outside of the SparseMatrixCSC structure but the reasoning is the same.
Looking at the code of SparseArrays.halfperm!, it seems more efficient than sparse, this is due to the fact that sparse need to remove duplicates and sort while in this new approach this is already done in MOIU.canonical.
I was wondering about this thinking about the implementation of writing for MatrixOptInterface and improving the performance of SCS and the like (probably by using MatrixOptInterface) but I haven't tried it out yet.

@joaquimg
Copy link
Member Author

joaquimg commented Jul 4, 2020

Is sparse taking a lot of time?

It does take some reasonable time.
Between 40 to 100ms, my guess is between 25 to 40% of copy time.

So yes, I think there is room for improvement.

Building colptr, rowval and nzval directly might be good.

I vote to merge this PR and experiment canonicalize +colptr, rowval and nzval of At is a new PR. Although, it seems that Clp would still require transposing as an extra step. On the other hand, other solvers take At instead of A, in which case it can be much better.

For Xpress, CPLEX and Gurobi, we could start with this copy_to function for the linear function and loop through the other constrains as today. Most of the time that we need high-performance loading is LP anyway. We can improve the other constraints loading as needed.

@joaquimg
Copy link
Member Author

joaquimg commented Jul 4, 2020

Last thing, YES, I like the idea of doing that in MatrixOptInterface.
Besides sharing code with Clp, Xpress, Gurobi, CPLEX, SCS etc.
MatrixOI can also be a key component for a good implementation of the differentiation GSOC.

@mtanneau
Copy link
Contributor

mtanneau commented Jul 4, 2020 via email

@joaquimg joaquimg merged commit 8ea859e into master Jul 9, 2020
@odow odow deleted the jb/perf branch October 8, 2020 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants