-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] - AMGCL Solution differs with different number of iterations/threads #11763
Comments
How recently? I dont think that amgcl code in Kratos has been updated recently. |
@ddemidov we found out this when we were debugging the test failures in optimization App in CI. So I think this started to fail like 1 or two weeks back (not sure when) |
My change affects only amgcl_ns so I do no think that could change your result... |
I am sorry, I don't have a working Kratos environment on my machine |
It's not too surprising that the max iteration count has an effect on the solutions if the solver fails to converge. What's more worrysome is that the number of threads has an impact as well. |
@matekelemen In this case it makes a difference since the warning says it is converged upto 1e-11, and it cannot reach 1e-12. So the solution should be the same for both iterations since both number of iterations reach 1e-11 convergence. |
I don't think at all that the results should be the same. I assume the solver continues iterating until one of the following is satisfied:
Both cases terminate on condition 2, and since the max iteration limit is different, the solver performs different number of iterations on the result and the results will obviously be different. Why do you think otherwise? |
@sunethwarna I am running your case with current master.
|
The issue is, the residual it prints from AMGCL says, it is converged around 1e-11, so the solution should not be so much different if I run it for couple of more iterations. Shoud it be? Am i missing something?
This started to happen around 1e-6, I put 1e-12 to get rid of the tolerance issues. The change of the norms are aroung 6000.0 (the printed value) which is not a small change in two solutions as far as i see.
Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again. |
Could this be a slight difference in the underlying linear algebra implementation of each system? |
Ok, I took a look to the actual result vector and that is worrying. This is the output printing sol_1 and sol_2.
There is a big difference in the results, even if the residual norm is small. I wonder where this is coming from.... |
The convergence criteria is the relative error, so could it be that the rhs norm is huge, which makes the differences between solutions relatively small? |
@ddemidov The RHS values are between 0 - 0.5, the lhs values are around 1e-1, 1e-2 |
I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected. |
@matekelemen Yes, I was thinking about that. Does this system come from a real problem? |
@ddiezrod This is a matrix which I took from the failing test of OptApp. When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL) |
Is the matrix scaled?, there is an option in the builder and solver |
Following is our observation Following are my concerns:
We are also lost in here to identify the problem :/ |
AMGCL also manages to solve the system if you change the subsolver to conjugate gradients. I guess fine-tuning the solver is unavoidable with ill-conditioned systems like this.
I used to work on another python project which also used hard-coded values as references to system tests. I also noticed small deviations across machines (even between different machines running the same Linux distro and same updates). I never found out what the issue was :/ |
Can you check it is related with this default change?: #11138 |
@loumalouomega I will try it in the coming days and update here :) |
guys, on one side i agree about the comment about the system conditioning. A system condition of 1e13 implies the system is essentially undefined. aside, please consider that when you do floating point operations in paraellel you are loosing the predictability. just think of adding 1e-4 + 1e10 + 1e-1 the result will be different depending on the order at which you do the sum ... which is not guaranteed when you are in parallel |
If that really is the issue here, I'm not sure how to deal with it.
I assume the solution will be to run these tests with a predefined set of threads (same as we do with MPI processes), but I don't know how to come up with robust tolerances. |
Description
Recently, AMGCL started to give totally different solutions (refer
sol_1
, andsol_2
) formax_iteration
It throws a warning saying it is not converged to 1e-12 (but it is converged to 1e-11, which should be close enough to have a small difference in the solution). The difference in the solution is very large which was causing the one of the tests to fail in CI (refer #11760).
If you reduce the tolerance to 1e-10, then the difference between two solutions (
sol_1
andsol_2
) is 0.0.Following is the script to replicate the bug. I am attaching the
A.mm
,b.mm.rhs
and python script in the zip file.data.zip
Scope
To Reproduce
Unzip the contents of the attached zip file and run
test_linear_solver.py
Expected behavior
To print 0.0.
Environment
@roigcarlo @matekelemen @Igarizza
The text was updated successfully, but these errors were encountered: