Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix matrix_exp_multiply accuracy bug #2619

Merged
merged 35 commits into from
May 23, 2023
Merged

fix matrix_exp_multiply accuracy bug #2619

merged 35 commits into from
May 23, 2023

Conversation

yizhang-yiz
Copy link
Contributor

Summary

Fix #2529 by re-implementing the prim version of the function.

Tests

  • test matrix_exp_multiply_handle components that calculate the taylor expansion order and helper function approximating matrix power norm.
  • test matrices from bug in scale_matrix_exp_multiply #2529.

Side Effects

n/a

Release notes

Bugfix: matrix_exp_multiply function's accuracy issue.

Checklist

  • Math issue #(issue number)

  • Copyright holder: Metrum Research Group.
    By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@yizhang-yiz
Copy link
Contributor Author

I'm not sure how to interpret the failure of expression test (tests7_test.cpp). @rok-cesnovar can you take a look?

@rok-cesnovar
Copy link
Member

If I run it locally for the matrix_exp_multiply only (python3 runTests.py test/expressions/ --only-function=matrix_exp_multiply, I can see the following error:

Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from ExpressionTestPrim
[ RUN      ] ExpressionTestPrim.matrix_exp_multiply0
test/expressions/tests0_test.cpp:24: Failure
Expected: (matrix1_expr3_counter) <= (int9), actual: 2 vs 1
[  FAILED  ] ExpressionTestPrim.matrix_exp_multiply0 (0 ms)
[----------] 1 test from ExpressionTestPrim (0 ms total)

It seemed as if the second matrix would be an Eigen expression it would be evaluated twice. And it was, see last commit for the fix.

Copy link
Member

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have zero knowledge of these algorithms, so cant review the changes in the sense of correctness, here are just a few general code-related comments.

stan/math/prim/fun/matrix_exp_action_handler.hpp Outdated Show resolved Hide resolved
stan/math/prim/fun/matrix_exp_action_handler.hpp Outdated Show resolved Hide resolved
stan/math/prim/fun/matrix_exp_action_handler.hpp Outdated Show resolved Hide resolved
stan/math/prim/fun/matrix_exp_action_handler.hpp Outdated Show resolved Hide resolved
@yizhang-yiz
Copy link
Contributor Author

I was able to reproduce the expression test failure on linux but not on mac. @rok-cesnovar can you take another look? Thanks.

@rok-cesnovar
Copy link
Member

Will check tomorrow.

@rok-cesnovar
Copy link
Member

I am unable to replicate locally (even with clang++-6.0) on Ubuntu.

My next guess would be that the instance runs out of memory compiling 15 tests in parallel or something along these lines. Not sure why that would happen after this change, but who knows. Decreasing the parallel -j setting in the tests should be simple to try.

We are in the middle of a move of our testing infrastructure, so there will be a bit of downtime or a period of unrelated issues popping up, so testing on the actual instances will be a bit difficult in the next few day, but will make sure to test this as soon we are back up.

@yizhang-yiz
Copy link
Contributor Author

@rok-cesnovar thanks. We can continue after the migration.

@rok-cesnovar
Copy link
Member

Getting back to this finally.

  • For the GHA error with Windows tests, I think what we need to do is Add the CXX flag for only running rev tests with the test_ad #2630

    I would prefer closing Moving to C++14 #2489 and bumping the minimum g++, but that seems unlikely to happen very soon (see thread).

  • For the Jenkins tests, I am going to debug on this PR. I am going to force-push a bit. I apologize in advance for the spam e-mails and force pushing, but I just don't see another way of getting to the bottom of this, seeing was no one can reproduce this one locally.

// L1 norm
double normA = mat.colwise().template lpNorm<1>().maxCoeff();

if (normA < tol || t < tol) {
m = 0;
s = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to eagerly return here so it's clear that there's no more that happens below it.

That will unnest the lower else statement and it'll be slightly easier to read.

const std::vector<double>& theta = theta_m_double;
Eigen::VectorXd alpha(p_max - 1);

if (normA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for efficiency, I'd move the / (m_max * b.cols()) to the left hand side by * (m_max * b.cols()).

m = m_max;
const std::vector<double>& theta = theta_m_double;
Eigen::VectorXd alpha(p_max - 1);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put whitespace above 185 and remove the whitespace in 186.

The output of the if statement here is alpha. We want to keep that declaration close together if possible for readability.

@syclik
Copy link
Member

syclik commented May 19, 2023

@yizhang-yiz, for what it's worth, it's implemented really cleanly relative to the paper. The naming convention follows and is close. It looks right to me.

syclik
syclik previously approved these changes May 19, 2023
Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me.

syclik
syclik previously approved these changes May 19, 2023
Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yizhang-yiz
Copy link
Contributor Author

@syclik I completely let this slip through. Thanks for fixing the PR!

@syclik
Copy link
Member

syclik commented May 22, 2023

Looks like the "fix" I introduced causes it to have linking issues with multiple translation units.

From Jenkins https://jenkins.flatironinstitute.org/blue/organizations/jenkins/Stan%2FMath/detail/PR-2619/11/pipeline/214:
image

I believe this is real. I'll try to address it in a different way.

@syclik
Copy link
Member

syclik commented May 22, 2023

I'm looking at the intent of the code and I think the optimization is too much effort relative to the difficulty in getting it working reliably across C++ versions.

To be specific, there are two major optimizations happening. The first is that the variable is a static class member variable. I believe the idea is that on multiple uses of the class these constants don't have to be reallocated with the class. This, in and of itself, is great, but it interacts with the second thing.

The second is the use of constexpr and there are some differences in the C++ across the pre C++-17 / C++-17 boundary. This interacts with the static keyword in a way that requires the definition of the static member variable. And this is what is the last error posted -- multiple translation units can not have the same definition. We could define the variable in a central place like wherever we define the int main() function, but we're not really set up to do this in the math library well.

So... I think we can simplify this a bit and lose this optimization. Note: this is for 3 primitives only. I doubt it'd be that much of a difference in practice.

@syclik
Copy link
Member

syclik commented May 22, 2023

Also, this would be dominated by the allocation of the theta_m_single and theta_m_double variables.

@syclik syclik merged commit 1871303 into develop May 23, 2023
@syclik syclik deleted the bugfix_issue_2529 branch May 23, 2023 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug in scale_matrix_exp_multiply
5 participants