[CORE] Fix memory issue in ublas_space.h #12761

rfaasse · 2024-10-17T07:34:41Z

📝 Description
In our TeamCity windows pipeline, tests started failing with a memory violation error. When tracking down the place where invalid memory was accessed, it turned out to be in ublas_space.h in the CheckAndCorrectZeroDiagonalValues function.

In that function, the values of matrix rA are fetched using a double*, initialized with the begin iterator of rA.value_data(). However, in the loop (the IndexPartition from line 723-734), rA is actually changed. This means the rA.value_data() also changes. Therefore, it could happen that the underlying vector (obtained using value_data()) has to be re-allocated because values are added to the vector that no longer fit in the allocated/reserved space. Indeed when printing rA.value_data().size() in the loop, I saw that the size doubled (which points to the vector being re-allocated), just before the crash. The crash then occurs because, the double* Avalues is invalidated when the underlying vector gets re-allocated.

This PR fixes the issue by storing the vector reference returned by value_data() instead of using the double*. This reference does not have the same issue with being invalidated when the vector is re-allocated. Note that we do not make a copy here, since the value_data(), index_1data() and index2_data() functions are returning a reference, which we are storing (not by value).

Since this practice with using a double* to point to the begin iterator is very tricky and can lead to these hard-to-find issues (it only occurs in very specific situations), I also changed the other places in this file where this was happening with the same strategy.

rfaasse · 2024-10-17T07:36:12Z

FYI @roigcarlo, I don't know whether this fix would help in the pipeline issue for windows you are looking into, but it's probably worth a shot trying this fix on your branch (since it is quite an elusive/subtle issue that only happens in very specific situations).

loumalouomega

The original implementation is mine, if CI passes i will be Okay. By the way, a similar approch is followed in the other B&S, so we may need to take a look.

roigcarlo · 2024-10-17T08:54:02Z

FYI @roigcarlo, I don't know whether this fix would help in the pipeline issue for windows you are looking into, but it's probably worth a shot trying this fix on your branch (since it is quite an elusive/subtle issue that only happens in very specific situations).

Ty, will try it, but I am afraid the problem is related with IO and file descriptors

rfaasse · 2024-10-17T11:41:59Z

The original implementation is mine, if CI passes i will be Okay. By the way, a similar approch is followed in the other B&S, so we may need to take a look.

@loumalouomega Since the pipelines are passing, could you approve this PR? I'm not sure of the procedure here, does someone from the @KratosMultiphysics/technical-committee also need to approve or is it enough if you do it (since you created this code in the first place)?

roigcarlo · 2024-10-17T12:05:45Z

Looks good to me as well, but please @RiccardoRossi take a look

RiccardoRossi · 2024-10-17T12:44:59Z

kratos/spaces/ublas_space.h

@@ -710,8 +710,8 @@ class UblasSpace
    {
        const std::size_t system_size = rA.size1();

-        const double* Avalues = rA.value_data().begin();
-        const std::size_t* Arow_indices = rA.index1_data().begin();
+        const auto& Avalues = rA.value_data();


in my opinion the increminating change is on line 737, as that is the only point at which you may trigger a reallocation. If no reallocation happens than the current code should be ok...or at least i cannot understand why it would not be so

this is not to say that i dislike the change, simply that if a realloation happend on 737 than the code is still not threadsafe...

That's a good point, I can imagine writing to the A matrix in a parallel loop, while using the underlying value_data, could still lead to issues. I think this needs to be investigated further.

However, in our test cases we saw that the re-allocation consistently caused a segfault/memory acces violation and with this change it consistently passes (since at least the reference is not invalidated like the double* there was before).

Therefore, I'd propose to merge this PR for now, since it fixes our pipelines (which are blocking our development at the moment), and afterwards there can be an investigation into whether we need other changes as well to make this part more secure/threadsafe.

to clarify, the point is that if you do A(i,i) and that row does not have the diagonal than A(i,i) reallocates the matrix.

aside of not being threadsafe this is VERY expensive... so i don't think we should allow it, but rather avoid using the (i,j) operator within the loop

Line 737:

rA(Index, Index) = scale_factor;

I don't see how this would trigger a reallocation. Also, Index is unique in each iteration so writing to that component is threadsafe.

I think there might be something fishy going on when iterators are converted to raw pointers.

to clarify, the point is that if you do A(i,i) and that row does not have the diagonal than A(i,i) reallocates the matrix.

aside of not being threadsafe this is VERY expensive... so i don't think we should allow it, but rather avoid using the (i,j) operator within the loop

Completely agreed, I think this function should be refactored to improve both the performance and safety, but that will take more time and thorough testing.

Do you think it is acceptable to merge this PR to unblock the pipelines we have for geomechanics such that the refactoring can be done without the time pressure on it? Since the change in this PR makes the function at least a bit more robust and the function has worked like this for quite a long time, I would prefer this approach. At this moment the failing pipelines are blocking our development.

@rfaasse if the insertion in parallel is the true cause of this segfault, then this PR doesn't fix it and your pipeline just dodged the race condition by chance. Merging this will not change that.

If you want to merge this PR as quickly as possible regardless (I'm fine with that), please open that separate issue and link this PR so the problem won't be forgotten.

@RiccardoRossi Just to re-iterate, the current state is blocking the development of the @KratosMultiphysics/geomechanics team. This PR at least gets rid of the implicit conversion from iterator to pointer and fixes the pipelines at our end. Therefore, can we merge this PR and investigate more in-depth at a later moment by creating a separate issue?

hi @rfaasse i approved the PR, nevertheless let me tell you that this is not a fix.

the problem is that if one of the threads reallocates the matrix (by doing rA(i,i) in a case in which it does not exist) the other threads will have a conflict as the reallocation will happen while they are writing. This is possibly even worst to find than the current behaviour ...

sorry to be so verbose, this is just to copy here how the find_element works, because at that point we can also do only what is needed directly in code

const_pointer find_element (size_type i, size_type j) const { size_type element1 (layout_type::index_M (i, j)); size_type element2 (layout_type::index_m (i, j)); if (filled1_ <= element1 + 1) return 0; vector_const_subiterator_type itv (index1_data_.begin () + element1); const_subiterator_type it_begin (index2_data_.begin () + zero_based (*itv)); const_subiterator_type it_end (index2_data_.begin () + zero_based (*(itv + 1))); const_subiterator_type it (detail::lower_bound (it_begin, it_end, k_based (element2), std::less<size_type> ())); if (it == it_end || *it != k_based (element2)) return 0; return &value_data_ [it - index2_data_.begin ()]; }

I think there are actually two related problems here. One is that the double* is invalidated upon re-allocation of the underlying vector, regardless of parallel vs serial. By using a reference at least we are still pointing to the right memory for a serial loop when the vector has re-allocated. My guess is that that was the issue for our pipeline, since it was consistently failing before and consistently passing with the fix (but that is just an assumption).

The second issue is as you all rightly pointed out that this implementation is not thread-safe, for which I'll create an issue and link to this PR and this discussion. Thank you for all the input and will see you in the next one!

Interesting stuff

There should be no reallocation any more once its fixed properly. I'd be curious if it would still fail in your pipelines after reverting this PR

Can you please include me in the next PR?

Fix memory issue

e1492fa

rfaasse added the Hotfix label Oct 17, 2024

rfaasse self-assigned this Oct 17, 2024

rfaasse requested a review from a team as a code owner October 17, 2024 07:34

loumalouomega reviewed Oct 17, 2024

View reviewed changes

loumalouomega approved these changes Oct 17, 2024

View reviewed changes

RiccardoRossi reviewed Oct 17, 2024

View reviewed changes

rfaasse merged commit 574bd67 into master Oct 17, 2024
9 of 11 checks passed

rfaasse deleted the core/fix-memory-issue-after-vector-reallocation branch October 17, 2024 18:32

loumalouomega pushed a commit that referenced this pull request Oct 18, 2024

Fix iterator to pointer issue in ublas_space.h (#12761)

12e29d7

rfaasse mentioned this pull request Oct 21, 2024

[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] Fix memory issue in ublas_space.h #12761

[CORE] Fix memory issue in ublas_space.h #12761

rfaasse commented Oct 17, 2024 •

edited

Loading

rfaasse commented Oct 17, 2024

loumalouomega left a comment

roigcarlo commented Oct 17, 2024

rfaasse commented Oct 17, 2024

roigcarlo commented Oct 17, 2024

RiccardoRossi Oct 17, 2024

rfaasse Oct 17, 2024

RiccardoRossi Oct 17, 2024

matekelemen Oct 17, 2024

rfaasse Oct 17, 2024 •

edited

Loading

matekelemen Oct 17, 2024

RiccardoRossi Oct 17, 2024

RiccardoRossi Oct 17, 2024

rfaasse Oct 17, 2024

philbucher Oct 17, 2024

[CORE] Fix memory issue in ublas_space.h #12761

[CORE] Fix memory issue in ublas_space.h #12761

Conversation

rfaasse commented Oct 17, 2024 • edited Loading

rfaasse commented Oct 17, 2024

loumalouomega left a comment

Choose a reason for hiding this comment

roigcarlo commented Oct 17, 2024

rfaasse commented Oct 17, 2024

roigcarlo commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfaasse Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfaasse commented Oct 17, 2024 •

edited

Loading

rfaasse Oct 17, 2024 •

edited

Loading