[Core] - AMGCL Solution differs with different number of iterations/threads #11763

sunethwarna · 2023-11-03T11:20:23Z

Description
Recently, AMGCL started to give totally different solutions (refer sol_1, and sol_2) for

Different number of max_iteration
Different number of threads (solution for threads = 10 differs from threads = 15)

It throws a warning saying it is not converged to 1e-12 (but it is converged to 1e-11, which should be close enough to have a small difference in the solution). The difference in the solution is very large which was causing the one of the tests to fail in CI (refer #11760).

If you reduce the tolerance to 1e-10, then the difference between two solutions (sol_1 and sol_2) is 0.0.

Following is the script to replicate the bug. I am attaching the A.mm, b.mm.rhs and python script in the zip file.
data.zip

import scipy
import numpy
import KratosMultiphysics as Kratos
from KratosMultiphysics.python_linear_solver_factory import ConstructSolver


def Solve(lhs: Kratos.CompressedMatrix, rhs: Kratos.Vector, number_of_iterations: int) -> Kratos.Vector:
    amgcl_solver_settings = Kratos.Parameters("""
        {
            "preconditioner_type"            : "amg",
            "solver_type"                    : "amgcl",
            "smoother_type"                  : "ilu0",
            "krylov_type"                    : "gmres",
            "coarsening_type"                : "aggregation",
            "max_iteration"                  : 100,
            "provide_coordinates"            : false,
            "gmres_krylov_space_dimension"   : 100,
            "verbosity"                      : 1,
            "tolerance"                      : 1e-12,
            "scaling"                        : false,
            "block_size"                     : 1,
            "use_block_matrices_if_possible" : true,
            "coarse_enough"                  : 1000,
            "max_levels"                     : -1,
            "pre_sweeps"                     : 1,
            "post_sweeps"                    : 1,
            "use_gpgpu"                      : false
        }""")
    amgcl_solver_settings["max_iteration"].SetInt(number_of_iterations)

    sol = Kratos.Vector(rhs.Size(), 0.0)
    linear_solver = ConstructSolver(amgcl_solver_settings)
    linear_solver.Solve(lhs, sol, rhs)
    return sol

if __name__ == "__main__":

    scipy_sparse_lhs: numpy.ndarray = numpy.array(scipy.io.mmread("A.mm").toarray())
    scipy_sparse_rhs = numpy.array(scipy.io.mmread("b.mm.rhs"))

    # now add values to the sparse matrix
    kratos_lhs = Kratos.CompressedMatrix(scipy_sparse_lhs.shape[0], scipy_sparse_lhs.shape[1])
    kratos_rhs = Kratos.Vector(scipy_sparse_lhs.shape[0], 0.0)
    for i in range(scipy_sparse_lhs.shape[0]):
        kratos_rhs[i] = scipy_sparse_rhs[i, 0]
        for j in range(scipy_sparse_lhs.shape[1]):
            if scipy_sparse_lhs[i, j] != 0.0:
                kratos_lhs[i, j] = scipy_sparse_lhs[i,j]

    sol_1 = Solve(kratos_lhs, kratos_rhs, 100) # solve for 100 iterations
    sol_2 = Solve(kratos_lhs, kratos_rhs, 200) # solve for 200 iterations

    print(numpy.linalg.norm(sol_1 - sol_2))

Scope

KratosCore

To Reproduce
Unzip the contents of the attached zip file and run test_linear_solver.py

Expected behavior
To print 0.0.

Environment

OS: Manjaro Linux (latest), Ubuntu
Branch: master
Python 3.11.5

@roigcarlo @matekelemen @Igarizza

The text was updated successfully, but these errors were encountered:

ddemidov · 2023-11-03T11:31:20Z

How recently? I dont think that amgcl code in Kratos has been updated recently.

roigcarlo · 2023-11-03T11:33:25Z

Weren't there some PR last week / two weeks ago about the AMGCL wrappers? #10691, #10390, #11687, but not sure if they can be related in any way

sunethwarna · 2023-11-03T11:39:55Z

@ddemidov we found out this when we were debugging the test failures in optimization App in CI. So I think this started to fail like 1 or two weeks back (not sure when)

ddemidov · 2023-11-03T11:46:16Z

This one is the most recent: #11687 (two weeks ago), others are from january and 2022. The only thing that changed in #11687 was that PRESSURE variable stopped being hard-coded. Could that be the reason?

ddiezrod · 2023-11-03T11:58:04Z

My change affects only amgcl_ns so I do no think that could change your result...

sunethwarna · 2023-11-03T12:09:25Z

@ddemidov @ddiezrod Can you try the script (in data.zip) to see whether this is re-producible in your PCs?

my master commit hash is:
edcb5f6

ddemidov · 2023-11-03T13:04:14Z

I am sorry, I don't have a working Kratos environment on my machine

matekelemen · 2023-11-03T14:21:05Z

It's not too surprising that the max iteration count has an effect on the solutions if the solver fails to converge. What's more worrysome is that the number of threads has an impact as well.

sunethwarna · 2023-11-03T14:27:50Z

@matekelemen In this case it makes a difference since the warning says it is converged upto 1e-11, and it cannot reach 1e-12. So the solution should be the same for both iterations since both number of iterations reach 1e-11 convergence.

matekelemen · 2023-11-03T14:34:21Z

I don't think at all that the results should be the same.

I assume the solver continues iterating until one of the following is satisfied:

the residual is reduced below the target tolerance
the max iteration limit is reached

Both cases terminate on condition 2, and since the max iteration limit is different, the solver performs different number of iterations on the result and the results will obviously be different. Why do you think otherwise?

ddiezrod · 2023-11-03T14:45:36Z

@sunethwarna I am running your case with current master.

If the solution has not converged its normal that you get different values using different max_iterations as @matekelemen says
I think the main problem here is that you are using a very low tolerance (1e-12) so you are very close to machine precision. A linear solver simply cannot go that far. It surprised me a little that result changes when changing the number of threads (it also happened to me in my machine), but again, as long as this is happening when comparing such small values, I do no think there is a reson to worry too much.
According to @ddemidov there has not been any change in amgcl in the last months, if you want to make sure nothing got broken you can recompile an older version, but from my experience, this is all normal.

sunethwarna · 2023-11-03T14:56:10Z

If the solution has not converged its normal that you get different values using different max_iterations as @matekelemen says

The issue is, the residual it prints from AMGCL says, it is converged around 1e-11, so the solution should not be so much different if I run it for couple of more iterations. Shoud it be? Am i missing something?

I think the main problem here is that you are using a very low tolerance (1e-12) so you are very close to machine precision. A linear solver simply cannot go that far. It surprised me a little that result changes when changing the number of threads (it also happened to me in my machine), but again, as long as this is happening when comparing such small values, I do no think there is a reson to worry too much.

This started to happen around 1e-6, I put 1e-12 to get rid of the tolerance issues. The change of the norms are aroung 6000.0 (the printed value) which is not a small change in two solutions as far as i see.

According to @ddemidov there has not been any change in amgcl in the last months, if you want to make sure nothing got broken you can recompile an older version, but from my experience, this is all normal.

Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again.

rubenzorrilla · 2023-11-03T14:57:40Z

Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again.

Could this be a slight difference in the underlying linear algebra implementation of each system?

ddiezrod · 2023-11-03T15:06:11Z

Ok, I took a look to the actual result vector and that is worrying. This is the output printing sol_1 and sol_2.

 |  /           |
 ' /   __| _` | __|  _ \   __|
 . \  |   (   | |   (   |\__ \
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."1"-core/adding-vars-from-processes-4cba1d1b42-RelWithDebInfo-x86_64
           Compiled for Windows and Python3.9 with MSVC-1928
Compiled with threading support.
Maximum number of threads: 10.
Process Id: 31208
Linear-Solver-Factory: Constructing a regular (non-complex) linear-solver
[WARNING] AMGCL Linear Solver: Non converged linear solution. [5.34367e-11 > 1e-12]
Linear-Solver-Factory: Constructing a regular (non-complex) linear-solver
[WARNING] AMGCL Linear Solver: Non converged linear solution. [5.20678e-11 > 1e-12]
[27](2.00165e+06,2.00003e+06,2.00003e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00114e+06,-999662,-999662,-1.00114e+06,-999662,-999662,2.00165e+06,2.00003e+06,2.00003e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00114e+06,-999662,-999662,2.00165e+06,2.00003e+06,2.00003e+06)
[27](2.001e+06,2.00213e+06,2.00213e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,2.001e+06,2.00213e+06,2.00213e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,2.001e+06,2.00213e+06,2.00213e+06)
6598.907873980675

There is a big difference in the results, even if the residual norm is small. I wonder where this is coming from....

ddemidov · 2023-11-03T15:10:40Z

The convergence criteria is the relative error, so could it be that the rhs norm is huge, which makes the differences between solutions relatively small?

sunethwarna · 2023-11-03T15:15:45Z

@ddemidov The RHS values are between 0 - 0.5, the lhs values are around 1e-1, 1e-2

matekelemen · 2023-11-03T15:16:04Z

I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected.

ddiezrod · 2023-11-03T15:21:16Z

@matekelemen Yes, I was thinking about that. Does this system come from a real problem?

sunethwarna · 2023-11-03T15:22:39Z

@ddiezrod This is a matrix which I took from the failing test of OptApp. When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL)

loumalouomega · 2023-11-03T15:25:48Z

I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected.

Is the matrix scaled?, there is an option in the builder and solver

sunethwarna · 2023-11-03T15:53:32Z

Following is our observation
The tests which was failing in OptApp was not changed for last 5 months. And they started failing recently. [It is testing an area which is isolated even in OptApp]. Now tests are passing with "skyline_lu" solver.

Following are my concerns:

Why it started to fail recently? (even if it already had a high condition number already)
Why it fails in some operating systems when AMGCL is used?

We are also lost in here to identify the problem :/

matekelemen · 2023-11-03T16:00:17Z

When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL)

AMGCL also manages to solve the system if you change the subsolver to conjugate gradients. I guess fine-tuning the solver is unavoidable with ill-conditioned systems like this.

Why it fails in some operating systems when AMGCL is used?

I used to work on another python project which also used hard-coded values as references to system tests. I also noticed small deviations across machines (even between different machines running the same Linux distro and same updates). I never found out what the issue was :/

loumalouomega · 2023-11-03T16:13:36Z

Following is our observation The tests which was failing in OptApp was not changed for last 5 months. And they started failing recently. [It is testing an area which is isolated even in OptApp]. Now tests are passing with "skyline_lu" solver.

Following are my concerns:
1. Why it started to fail recently? (even if it already had a high condition number already)

2. Why it fails in some operating systems when AMGCL is used?
We are also lost in here to identify the problem :/

Can you check it is related with this default change?: #11138

sunethwarna · 2023-11-07T09:53:05Z

@loumalouomega I will try it in the coming days and update here :)

RiccardoRossi · 2023-11-07T10:00:29Z

guys, on one side i agree about the comment about the system conditioning. A system condition of 1e13 implies the system is essentially undefined.

aside, please consider that when you do floating point operations in paraellel you are loosing the predictability.

just think of adding

1e-4 + 1e10 + 1e-1

the result will be different depending on the order at which you do the sum ... which is not guaranteed when you are in parallel

matekelemen · 2023-11-08T10:19:33Z

when you do floating point operations in parallel you are losing predictability

If that really is the issue here, I'm not sure how to deal with it.

Increasing the tolerance is not ideal because it's pretty much a hack, and we can't really produce an upper bound that guarantees to cover these kinds of deviations.
Restricting the thread count to 1 is also a no-go because we don't check for race conditions specifically (e.g.: with thread sanitizer), but hope that we catch them by checking the results of standard tests.

I assume the solution will be to run these tests with a predefined set of threads (same as we do with MPI processes), but I don't know how to come up with robust tolerances.

sunethwarna added the Bug label Nov 3, 2023

sunethwarna assigned RiccardoRossi Nov 3, 2023

sunethwarna changed the title ~~[Core] - AMGCL Solution differs with different number of iterations~~ [Core] - AMGCL Solution differs with different number of iterations/threads Nov 3, 2023

AlejandroCornejo assigned ddemidov Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] - AMGCL Solution differs with different number of iterations/threads #11763

[Core] - AMGCL Solution differs with different number of iterations/threads #11763

sunethwarna commented Nov 3, 2023 •

edited

Loading

ddemidov commented Nov 3, 2023

roigcarlo commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

ddemidov commented Nov 3, 2023

ddiezrod commented Nov 3, 2023

sunethwarna commented Nov 3, 2023 •

edited

Loading

ddemidov commented Nov 3, 2023

matekelemen commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

matekelemen commented Nov 3, 2023

ddiezrod commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

rubenzorrilla commented Nov 3, 2023

ddiezrod commented Nov 3, 2023

ddemidov commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

matekelemen commented Nov 3, 2023 •

edited

Loading

ddiezrod commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

loumalouomega commented Nov 3, 2023 •

edited

Loading

sunethwarna commented Nov 3, 2023 •

edited

Loading

matekelemen commented Nov 3, 2023

loumalouomega commented Nov 3, 2023

sunethwarna commented Nov 7, 2023

RiccardoRossi commented Nov 7, 2023

matekelemen commented Nov 8, 2023

[Core] - AMGCL Solution differs with different number of iterations/threads #11763

[Core] - AMGCL Solution differs with different number of iterations/threads #11763

Comments

sunethwarna commented Nov 3, 2023 • edited Loading

ddemidov commented Nov 3, 2023

roigcarlo commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

ddemidov commented Nov 3, 2023

ddiezrod commented Nov 3, 2023

sunethwarna commented Nov 3, 2023 • edited Loading

ddemidov commented Nov 3, 2023

matekelemen commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

matekelemen commented Nov 3, 2023

ddiezrod commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

rubenzorrilla commented Nov 3, 2023

ddiezrod commented Nov 3, 2023

ddemidov commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

matekelemen commented Nov 3, 2023 • edited Loading

ddiezrod commented Nov 3, 2023

sunethwarna commented Nov 3, 2023

loumalouomega commented Nov 3, 2023 • edited Loading

sunethwarna commented Nov 3, 2023 • edited Loading

matekelemen commented Nov 3, 2023

loumalouomega commented Nov 3, 2023

sunethwarna commented Nov 7, 2023

RiccardoRossi commented Nov 7, 2023

matekelemen commented Nov 8, 2023

sunethwarna commented Nov 3, 2023 •

edited

Loading

sunethwarna commented Nov 3, 2023 •

edited

Loading

matekelemen commented Nov 3, 2023 •

edited

Loading

loumalouomega commented Nov 3, 2023 •

edited

Loading

sunethwarna commented Nov 3, 2023 •

edited

Loading