Performance improvements #94

joaquimg · 2020-07-04T03:18:35Z

I added that benchmark file: bench/runbench.jl

There are major speedups with this little changes

Current master:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      85.9s / 100%            25.6GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 c + s         10    17.3s  20.2%   1.73s   4.85GiB  19.0%   497MiB
   opt         10    9.82s  11.4%   982ms     0.00B  0.00%    0.00B
   copy        10    7.14s  8.32%   714ms   4.19GiB  16.4%   429MiB
   build       10    370ms  0.43%  37.0ms    676MiB  2.58%  67.6MiB
 bcs           10    17.2s  20.0%   1.72s   4.92GiB  19.2%   504MiB
   opt         10    16.8s  19.6%   1.68s   4.26GiB  16.6%   436MiB
   build       10    352ms  0.41%  35.2ms    677MiB  2.58%  67.7MiB
 cs            10    17.2s  20.0%   1.72s   4.92GiB  19.2%   504MiB
   opt         10    16.8s  19.6%   1.68s   4.26GiB  16.6%   436MiB
   build       10    370ms  0.43%  37.0ms    677MiB  2.58%  67.7MiB
 bcs + v       10    17.1s  19.9%   1.71s   5.45GiB  21.3%   558MiB
   opt         10    16.3s  19.0%   1.63s   4.26GiB  16.6%   436MiB
   build       10    719ms  0.84%  71.9ms   1.19GiB  4.66%   122MiB
 bc + s        10    16.4s  19.1%   1.64s   4.86GiB  19.0%   497MiB
   opt         10    8.91s  10.4%   891ms     0.00B  0.00%    0.00B
   copy        10    7.07s  8.24%   707ms   4.20GiB  16.4%   430MiB
   build       10    413ms  0.48%  41.3ms    676MiB  2.58%  67.6MiB
 data          10    677ms  0.79%  67.7ms    617MiB  2.35%  61.7MiB
 ──────────────────────────────────────────────────────────────────

After this PR:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      52.8s / 100%            8.91GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 bcs + v       10    10.9s  20.7%   1.09s   2.11GiB  23.7%   216MiB
   opt         10    10.2s  19.3%   1.02s    938MiB  10.3%  93.8MiB
   build       10    744ms  1.41%  74.4ms   1.19GiB  13.4%   122MiB
 bcs           10    10.6s  20.0%   1.06s   1.58GiB  17.7%   161MiB
   opt         10    10.1s  19.1%   1.01s    938MiB  10.3%  93.8MiB
   build       10    454ms  0.86%  45.4ms    677MiB  7.43%  67.7MiB
 bc + s        10    10.3s  19.4%   1.03s   1.53GiB  17.1%   156MiB
   opt         10    8.31s  15.7%   831ms     0.00B  0.00%    0.00B
   copy        10    1.72s  3.25%   172ms    887MiB  9.73%  88.7MiB
   build       10    212ms  0.40%  21.2ms    676MiB  7.41%  67.6MiB
 cs            10    10.1s  19.2%   1.01s   1.58GiB  17.7%   161MiB
   opt         10    9.80s  18.6%   980ms    938MiB  10.3%  93.8MiB
   build       10    332ms  0.63%  33.2ms    677MiB  7.43%  67.7MiB
 c + s         10    10.1s  19.1%   1.01s   1.51GiB  17.0%   155MiB
   opt         10    8.15s  15.4%   815ms     0.00B  0.00%    0.00B
   copy        10    1.62s  3.07%   162ms    874MiB  9.58%  87.4MiB
   build       10    325ms  0.62%  32.5ms    676MiB  7.41%  67.6MiB
 data          10    789ms  1.49%  78.9ms    617MiB  6.76%  61.7MiB
 ──────────────────────────────────────────────────────────────────

cc @odow @mlubin @blegat

codecov · 2020-07-04T03:39:28Z

Codecov Report

Merging #94 into master will increase coverage by 2.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
+ Coverage   54.41%   56.44%   +2.02%     
==========================================
  Files           3        3              
  Lines         408      427      +19     
==========================================
+ Hits          222      241      +19     
  Misses        186      186

Impacted Files	Coverage Δ
src/MOI_wrapper/MOI_wrapper.jl	`83.40% <100.00%> (+1.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb20fa3...b4acc17. Read the comment docs.

mlubin · 2020-07-04T10:52:56Z

Nice! Let me put the right copyright header on my snippet before you commit this.

mlubin

Nevermind about the copyright. On first glance it looked like this was a copy of the facility location benchmark I wrote, but it's solving a random LP instead.

bench/runbench.jl

blegat · 2020-07-04T18:30:00Z

src/MOI_wrapper/MOI_wrapper.jl

+    add_sizehint!(I, n_terms)
+    add_sizehint!(J, n_terms)
+    add_sizehint!(V, n_terms)
+    for c_index in list


Why are we doing the loop twice? Is it for cache friendliness as we modify different vectors?
The disadvantage is that we get the function twice

I agree. How much does this particular optimization help?

Doing twice we can pre allocate slots in I, J and V.
It does not seem to make the code any more readable or unreadable.
It is responsible for 30% of the overall speedup.
An extra step would be to cache f also, I was thinking it might be way to much, but that can give and extra 10%:

────────────────────────────────────────────────────────────────── Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 52.8s / 100% 8.91GiB / 100% Section ncalls time %tot avg alloc %tot avg ────────────────────────────────────────────────────────────────── bcs + v 10 10.9s 20.7% 1.09s 2.11GiB 23.7% 216MiB opt 10 10.2s 19.3% 1.02s 938MiB 10.3% 93.8MiB build 10 744ms 1.41% 74.4ms 1.19GiB 13.4% 122MiB bcs 10 10.6s 20.0% 1.06s 1.58GiB 17.7% 161MiB opt 10 10.1s 19.1% 1.01s 938MiB 10.3% 93.8MiB build 10 454ms 0.86% 45.4ms 677MiB 7.43% 67.7MiB bc + s 10 10.3s 19.4% 1.03s 1.53GiB 17.1% 156MiB opt 10 8.31s 15.7% 831ms 0.00B 0.00% 0.00B copy 10 1.72s 3.25% 172ms 887MiB 9.73% 88.7MiB build 10 212ms 0.40% 21.2ms 676MiB 7.41% 67.6MiB cs 10 10.1s 19.2% 1.01s 1.58GiB 17.7% 161MiB opt 10 9.80s 18.6% 980ms 938MiB 10.3% 93.8MiB build 10 332ms 0.63% 33.2ms 677MiB 7.43% 67.7MiB c + s 10 10.1s 19.1% 1.01s 1.51GiB 17.0% 155MiB opt 10 8.15s 15.4% 815ms 0.00B 0.00% 0.00B copy 10 1.62s 3.07% 162ms 874MiB 9.58% 87.4MiB build 10 325ms 0.62% 32.5ms 676MiB 7.41% 67.6MiB data 10 789ms 1.49% 78.9ms 617MiB 6.76% 61.7MiB ──────────────────────────────────────────────────────────────────

Actually, its VERY important for the bridged cache with separate solver case!
I will commit it!

Now bridging overhead is basically zero, and copy is less than 20% fo solve time

Indeed, I missed the computation of n_terms, it makes sense then

Why is getting functions slow ? Isn't the bridged cache just a MOIU.AbstractModel ?

not sure why, yes it should be.

I don't think its specially slow though.
The loop is just tight.

joaquimg · 2020-07-04T19:03:27Z

with the latest commit the bridging time is basically zero always:

 ──────────────────────────────────────────────────────────────────
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      52.8s / 100%            8.91GiB / 100%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 bcs + v       10    10.9s  20.7%   1.09s   2.11GiB  23.7%   216MiB
   opt         10    10.2s  19.3%   1.02s    938MiB  10.3%  93.8MiB
   build       10    744ms  1.41%  74.4ms   1.19GiB  13.4%   122MiB
 bcs           10    10.6s  20.0%   1.06s   1.58GiB  17.7%   161MiB
   opt         10    10.1s  19.1%   1.01s    938MiB  10.3%  93.8MiB
   build       10    454ms  0.86%  45.4ms    677MiB  7.43%  67.7MiB
 bc + s        10    10.3s  19.4%   1.03s   1.53GiB  17.1%   156MiB
   opt         10    8.31s  15.7%   831ms     0.00B  0.00%    0.00B
   copy        10    1.72s  3.25%   172ms    887MiB  9.73%  88.7MiB
   build       10    212ms  0.40%  21.2ms    676MiB  7.41%  67.6MiB
 cs            10    10.1s  19.2%   1.01s   1.58GiB  17.7%   161MiB
   opt         10    9.80s  18.6%   980ms    938MiB  10.3%  93.8MiB
   build       10    332ms  0.63%  33.2ms    677MiB  7.43%  67.7MiB
 c + s         10    10.1s  19.1%   1.01s   1.51GiB  17.0%   155MiB
   opt         10    8.15s  15.4%   815ms     0.00B  0.00%    0.00B
   copy        10    1.62s  3.07%   162ms    874MiB  9.58%  87.4MiB
   build       10    325ms  0.62%  32.5ms    676MiB  7.41%  67.6MiB
 data          10    789ms  1.49%  78.9ms    617MiB  6.76%  61.7MiB
 ──────────────────────────────────────────────────────────────────

blegat · 2020-07-04T19:57:13Z

Is sparse taking a lot of time ?
It seems we could create A' instead of I, J, K. For this, we need to call MOIU.canonical(f) as SparseMatrixCSC does not allow any duplicate and then add it as a new row of A' by adding an element to colptr and appending the new entries to rowval and ǹzval. Then we can do A = copy(transpose(A')) instead of A = sparse(I, J, K, ...). One small detail, SparseMatrixCSC is immutable so we cannot modify A'.n and A'.m, so we probably need to just store colptr, rowval and nzval outside of the SparseMatrixCSC structure but the reasoning is the same.
Looking at the code of SparseArrays.halfperm!, it seems more efficient than sparse, this is due to the fact that sparse need to remove duplicates and sort while in this new approach this is already done in MOIU.canonical.
I was wondering about this thinking about the implementation of writing for MatrixOptInterface and improving the performance of SCS and the like (probably by using MatrixOptInterface) but I haven't tried it out yet.

joaquimg · 2020-07-04T21:18:12Z

Is sparse taking a lot of time?

It does take some reasonable time.
Between 40 to 100ms, my guess is between 25 to 40% of copy time.

So yes, I think there is room for improvement.

Building colptr, rowval and nzval directly might be good.

I vote to merge this PR and experiment canonicalize +colptr, rowval and nzval of At is a new PR. Although, it seems that Clp would still require transposing as an extra step. On the other hand, other solvers take At instead of A, in which case it can be much better.

For Xpress, CPLEX and Gurobi, we could start with this copy_to function for the linear function and loop through the other constrains as today. Most of the time that we need high-performance loading is LP anyway. We can improve the other constraints loading as needed.

joaquimg · 2020-07-04T21:26:16Z

Last thing, YES, I like the idea of doing that in MatrixOptInterface.
Besides sharing code with Clp, Xpress, Gurobi, CPLEX, SCS etc.
MatrixOI can also be a key component for a good implementation of the differentiation GSOC.

mtanneau · 2020-07-04T21:33:07Z

In Tulip, I keep both A and A’ in memory, see https://github.com/ds4dm/Tulip.jl/blob/master/src/Core/problemData.jl A list of rows and columns of A is kept up-to-date. It allows for fast column- and row-based operations, especially if the input is already sorted.

performance improvements

64ced2c

mlubin reviewed Jul 4, 2020

View reviewed changes

bench/runbench.jl Outdated Show resolved Hide resolved

bench/runbench.jl Outdated Show resolved Hide resolved

clean runbench file

4c13ffd

blegat reviewed Jul 4, 2020

View reviewed changes

add function cache

5cf2ee8

fix typo

b4acc17

blegat approved these changes Jul 5, 2020

View reviewed changes

joaquimg merged commit 8ea859e into master Jul 9, 2020

joaquimg mentioned this pull request Jul 11, 2020

Add specialized copy method jump-dev/GLPK.jl#143

Merged

odow deleted the jb/perf branch October 8, 2020 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #94

Performance improvements #94

joaquimg commented Jul 4, 2020 •

edited

Loading

codecov bot commented Jul 4, 2020 •

edited

Loading

mlubin commented Jul 4, 2020

mlubin left a comment

blegat Jul 4, 2020

odow Jul 4, 2020

joaquimg Jul 4, 2020

joaquimg Jul 4, 2020

joaquimg Jul 4, 2020

blegat Jul 4, 2020

blegat Jul 4, 2020

joaquimg Jul 4, 2020

joaquimg Jul 4, 2020

joaquimg commented Jul 4, 2020

blegat commented Jul 4, 2020

joaquimg commented Jul 4, 2020

joaquimg commented Jul 4, 2020 •

edited

Loading

mtanneau commented Jul 4, 2020 via email

Performance improvements #94

Performance improvements #94

Conversation

joaquimg commented Jul 4, 2020 • edited Loading

codecov bot commented Jul 4, 2020 • edited Loading

Codecov Report

mlubin commented Jul 4, 2020

mlubin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joaquimg commented Jul 4, 2020

blegat commented Jul 4, 2020

joaquimg commented Jul 4, 2020

joaquimg commented Jul 4, 2020 • edited Loading

mtanneau commented Jul 4, 2020 via email

joaquimg commented Jul 4, 2020 •

edited

Loading

codecov bot commented Jul 4, 2020 •

edited

Loading

joaquimg commented Jul 4, 2020 •

edited

Loading