-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements #94
Conversation
Codecov Report
@@ Coverage Diff @@
## master #94 +/- ##
==========================================
+ Coverage 54.41% 56.44% +2.02%
==========================================
Files 3 3
Lines 408 427 +19
==========================================
+ Hits 222 241 +19
Misses 186 186
Continue to review full report at Codecov.
|
Nice! Let me put the right copyright header on my snippet before you commit this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind about the copyright. On first glance it looked like this was a copy of the facility location benchmark I wrote, but it's solving a random LP instead.
src/MOI_wrapper/MOI_wrapper.jl
Outdated
add_sizehint!(I, n_terms) | ||
add_sizehint!(J, n_terms) | ||
add_sizehint!(V, n_terms) | ||
for c_index in list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing the loop twice? Is it for cache friendliness as we modify different vectors?
The disadvantage is that we get the function twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. How much does this particular optimization help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing twice we can pre allocate slots in I, J and V.
It does not seem to make the code any more readable or unreadable.
It is responsible for 30% of the overall speedup.
An extra step would be to cache f also, I was thinking it might be way to much, but that can give and extra 10%:
──────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 52.8s / 100% 8.91GiB / 100%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────
bcs + v 10 10.9s 20.7% 1.09s 2.11GiB 23.7% 216MiB
opt 10 10.2s 19.3% 1.02s 938MiB 10.3% 93.8MiB
build 10 744ms 1.41% 74.4ms 1.19GiB 13.4% 122MiB
bcs 10 10.6s 20.0% 1.06s 1.58GiB 17.7% 161MiB
opt 10 10.1s 19.1% 1.01s 938MiB 10.3% 93.8MiB
build 10 454ms 0.86% 45.4ms 677MiB 7.43% 67.7MiB
bc + s 10 10.3s 19.4% 1.03s 1.53GiB 17.1% 156MiB
opt 10 8.31s 15.7% 831ms 0.00B 0.00% 0.00B
copy 10 1.72s 3.25% 172ms 887MiB 9.73% 88.7MiB
build 10 212ms 0.40% 21.2ms 676MiB 7.41% 67.6MiB
cs 10 10.1s 19.2% 1.01s 1.58GiB 17.7% 161MiB
opt 10 9.80s 18.6% 980ms 938MiB 10.3% 93.8MiB
build 10 332ms 0.63% 33.2ms 677MiB 7.43% 67.7MiB
c + s 10 10.1s 19.1% 1.01s 1.51GiB 17.0% 155MiB
opt 10 8.15s 15.4% 815ms 0.00B 0.00% 0.00B
copy 10 1.62s 3.07% 162ms 874MiB 9.58% 87.4MiB
build 10 325ms 0.62% 32.5ms 676MiB 7.41% 67.6MiB
data 10 789ms 1.49% 78.9ms 617MiB 6.76% 61.7MiB
──────────────────────────────────────────────────────────────────
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, its VERY important for the bridged cache with separate solver case!
I will commit it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now bridging overhead is basically zero, and copy is less than 20% fo solve time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I missed the computation of n_terms
, it makes sense then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is getting functions slow ? Isn't the bridged cache just a MOIU.AbstractModel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why, yes it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its specially slow though.
The loop is just tight.
with the latest commit the bridging time is basically zero always:
|
Is |
It does take some reasonable time. So yes, I think there is room for improvement. Building colptr, rowval and nzval directly might be good. I vote to merge this PR and experiment canonicalize +colptr, rowval and nzval of At is a new PR. Although, it seems that Clp would still require transposing as an extra step. On the other hand, other solvers take At instead of A, in which case it can be much better. For Xpress, CPLEX and Gurobi, we could start with this copy_to function for the linear function and loop through the other constrains as today. Most of the time that we need high-performance loading is LP anyway. We can improve the other constraints loading as needed. |
Last thing, YES, I like the idea of doing that in MatrixOptInterface. |
In Tulip, I keep both A and A’ in memory, see
https://github.com/ds4dm/Tulip.jl/blob/master/src/Core/problemData.jl
A list of rows and columns of A is kept up-to-date. It allows for fast
column- and row-based operations, especially if the input is already sorted.
|
I added that benchmark file: bench/runbench.jl
There are major speedups with this little changes
Current master:
After this PR:
cc @odow @mlubin @blegat