Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Fix memory issue in ublas_space.h #12761

Merged
merged 1 commit into from
Oct 17, 2024

Conversation

rfaasse
Copy link
Contributor

@rfaasse rfaasse commented Oct 17, 2024

📝 Description
In our TeamCity windows pipeline, tests started failing with a memory violation error. When tracking down the place where invalid memory was accessed, it turned out to be in ublas_space.h in the CheckAndCorrectZeroDiagonalValues function.

In that function, the values of matrix rA are fetched using a double*, initialized with the begin iterator of rA.value_data(). However, in the loop (the IndexPartition from line 723-734), rA is actually changed. This means the rA.value_data() also changes. Therefore, it could happen that the underlying vector (obtained using value_data()) has to be re-allocated because values are added to the vector that no longer fit in the allocated/reserved space. Indeed when printing rA.value_data().size() in the loop, I saw that the size doubled (which points to the vector being re-allocated), just before the crash. The crash then occurs because, the double* Avalues is invalidated when the underlying vector gets re-allocated.

This PR fixes the issue by storing the vector reference returned by value_data() instead of using the double*. This reference does not have the same issue with being invalidated when the vector is re-allocated. Note that we do not make a copy here, since the value_data(), index_1data() and index2_data() functions are returning a reference, which we are storing (not by value).

Since this practice with using a double* to point to the begin iterator is very tricky and can lead to these hard-to-find issues (it only occurs in very specific situations), I also changed the other places in this file where this was happening with the same strategy.

@rfaasse rfaasse added the Hotfix label Oct 17, 2024
@rfaasse rfaasse self-assigned this Oct 17, 2024
@rfaasse rfaasse requested a review from a team as a code owner October 17, 2024 07:34
@rfaasse
Copy link
Contributor Author

rfaasse commented Oct 17, 2024

FYI @roigcarlo, I don't know whether this fix would help in the pipeline issue for windows you are looking into, but it's probably worth a shot trying this fix on your branch (since it is quite an elusive/subtle issue that only happens in very specific situations).

Copy link
Member

@loumalouomega loumalouomega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation is mine, if CI passes i will be Okay. By the way, a similar approch is followed in the other B&S, so we may need to take a look.

@roigcarlo
Copy link
Member

FYI @roigcarlo, I don't know whether this fix would help in the pipeline issue for windows you are looking into, but it's probably worth a shot trying this fix on your branch (since it is quite an elusive/subtle issue that only happens in very specific situations).

Ty, will try it, but I am afraid the problem is related with IO and file descriptors

@rfaasse
Copy link
Contributor Author

rfaasse commented Oct 17, 2024

The original implementation is mine, if CI passes i will be Okay. By the way, a similar approch is followed in the other B&S, so we may need to take a look.

@loumalouomega Since the pipelines are passing, could you approve this PR? I'm not sure of the procedure here, does someone from the @KratosMultiphysics/technical-committee also need to approve or is it enough if you do it (since you created this code in the first place)?

@roigcarlo
Copy link
Member

Looks good to me as well, but please @RiccardoRossi take a look

@@ -710,8 +710,8 @@ class UblasSpace
{
const std::size_t system_size = rA.size1();

const double* Avalues = rA.value_data().begin();
const std::size_t* Arow_indices = rA.index1_data().begin();
const auto& Avalues = rA.value_data();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion the increminating change is on line 737, as that is the only point at which you may trigger a reallocation. If no reallocation happens than the current code should be ok...or at least i cannot understand why it would not be so

this is not to say that i dislike the change, simply that if a realloation happend on 737 than the code is still not threadsafe...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I can imagine writing to the A matrix in a parallel loop, while using the underlying value_data, could still lead to issues. I think this needs to be investigated further.

However, in our test cases we saw that the re-allocation consistently caused a segfault/memory acces violation and with this change it consistently passes (since at least the reference is not invalidated like the double* there was before).

Therefore, I'd propose to merge this PR for now, since it fixes our pipelines (which are blocking our development at the moment), and afterwards there can be an investigation into whether we need other changes as well to make this part more secure/threadsafe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, the point is that if you do A(i,i) and that row does not have the diagonal than A(i,i) reallocates the matrix.

aside of not being threadsafe this is VERY expensive... so i don't think we should allow it, but rather avoid using the (i,j) operator within the loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 737:

rA(Index, Index) = scale_factor;

I don't see how this would trigger a reallocation. Also, Index is unique in each iteration so writing to that component is threadsafe.

I think there might be something fishy going on when iterators are converted to raw pointers.

Copy link
Contributor Author

@rfaasse rfaasse Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, the point is that if you do A(i,i) and that row does not have the diagonal than A(i,i) reallocates the matrix.

aside of not being threadsafe this is VERY expensive... so i don't think we should allow it, but rather avoid using the (i,j) operator within the loop

Completely agreed, I think this function should be refactored to improve both the performance and safety, but that will take more time and thorough testing.

Do you think it is acceptable to merge this PR to unblock the pipelines we have for geomechanics such that the refactoring can be done without the time pressure on it? Since the change in this PR makes the function at least a bit more robust and the function has worked like this for quite a long time, I would prefer this approach. At this moment the failing pipelines are blocking our development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfaasse if the insertion in parallel is the true cause of this segfault, then this PR doesn't fix it and your pipeline just dodged the race condition by chance. Merging this will not change that.

If you want to merge this PR as quickly as possible regardless (I'm fine with that), please open that separate issue and link this PR so the problem won't be forgotten.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RiccardoRossi Just to re-iterate, the current state is blocking the development of the @KratosMultiphysics/geomechanics team. This PR at least gets rid of the implicit conversion from iterator to pointer and fixes the pipelines at our end. Therefore, can we merge this PR and investigate more in-depth at a later moment by creating a separate issue?

hi @rfaasse i approved the PR, nevertheless let me tell you that this is not a fix.

the problem is that if one of the threads reallocates the matrix (by doing rA(i,i) in a case in which it does not exist) the other threads will have a conflict as the reallocation will happen while they are writing. This is possibly even worst to find than the current behaviour ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be so verbose, this is just to copy here how the find_element works, because at that point we can also do only what is needed directly in code

const_pointer find_element (size_type i, size_type j) const {
            size_type element1 (layout_type::index_M (i, j));
            size_type element2 (layout_type::index_m (i, j));
            if (filled1_ <= element1 + 1)
                return 0;
            vector_const_subiterator_type itv (index1_data_.begin () + element1);
            const_subiterator_type it_begin (index2_data_.begin () + zero_based (*itv));
            const_subiterator_type it_end (index2_data_.begin () + zero_based (*(itv + 1)));
            const_subiterator_type it (detail::lower_bound (it_begin, it_end, k_based (element2), std::less<size_type> ()));
            if (it == it_end || *it != k_based (element2))
                return 0;
            return &value_data_ [it - index2_data_.begin ()];
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are actually two related problems here. One is that the double* is invalidated upon re-allocation of the underlying vector, regardless of parallel vs serial. By using a reference at least we are still pointing to the right memory for a serial loop when the vector has re-allocated. My guess is that that was the issue for our pipeline, since it was consistently failing before and consistently passing with the fix (but that is just an assumption).

The second issue is as you all rightly pointed out that this implementation is not thread-safe, for which I'll create an issue and link to this PR and this discussion. Thank you for all the input and will see you in the next one!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting stuff

There should be no reallocation any more once its fixed properly. I'd be curious if it would still fail in your pipelines after reverting this PR

Can you please include me in the next PR?

@rfaasse rfaasse merged commit 574bd67 into master Oct 17, 2024
9 of 11 checks passed
@rfaasse rfaasse deleted the core/fix-memory-issue-after-vector-reallocation branch October 17, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants