-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Fix memory issue in ublas_space.h #12761
[CORE] Fix memory issue in ublas_space.h #12761
Conversation
FYI @roigcarlo, I don't know whether this fix would help in the pipeline issue for windows you are looking into, but it's probably worth a shot trying this fix on your branch (since it is quite an elusive/subtle issue that only happens in very specific situations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation is mine, if CI passes i will be Okay. By the way, a similar approch is followed in the other B&S, so we may need to take a look.
Ty, will try it, but I am afraid the problem is related with IO and file descriptors |
@loumalouomega Since the pipelines are passing, could you approve this PR? I'm not sure of the procedure here, does someone from the @KratosMultiphysics/technical-committee also need to approve or is it enough if you do it (since you created this code in the first place)? |
Looks good to me as well, but please @RiccardoRossi take a look |
@@ -710,8 +710,8 @@ class UblasSpace | |||
{ | |||
const std::size_t system_size = rA.size1(); | |||
|
|||
const double* Avalues = rA.value_data().begin(); | |||
const std::size_t* Arow_indices = rA.index1_data().begin(); | |||
const auto& Avalues = rA.value_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my opinion the increminating change is on line 737, as that is the only point at which you may trigger a reallocation. If no reallocation happens than the current code should be ok...or at least i cannot understand why it would not be so
this is not to say that i dislike the change, simply that if a realloation happend on 737 than the code is still not threadsafe...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I can imagine writing to the A matrix in a parallel loop, while using the underlying value_data
, could still lead to issues. I think this needs to be investigated further.
However, in our test cases we saw that the re-allocation consistently caused a segfault/memory acces violation and with this change it consistently passes (since at least the reference is not invalidated like the double* there was before).
Therefore, I'd propose to merge this PR for now, since it fixes our pipelines (which are blocking our development at the moment), and afterwards there can be an investigation into whether we need other changes as well to make this part more secure/threadsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, the point is that if you do A(i,i) and that row does not have the diagonal than A(i,i) reallocates the matrix.
aside of not being threadsafe this is VERY expensive... so i don't think we should allow it, but rather avoid using the (i,j) operator within the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 737:
rA(Index, Index) = scale_factor;
I don't see how this would trigger a reallocation. Also, Index
is unique in each iteration so writing to that component is threadsafe.
I think there might be something fishy going on when iterators are converted to raw pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, the point is that if you do A(i,i) and that row does not have the diagonal than A(i,i) reallocates the matrix.
aside of not being threadsafe this is VERY expensive... so i don't think we should allow it, but rather avoid using the (i,j) operator within the loop
Completely agreed, I think this function should be refactored to improve both the performance and safety, but that will take more time and thorough testing.
Do you think it is acceptable to merge this PR to unblock the pipelines we have for geomechanics such that the refactoring can be done without the time pressure on it? Since the change in this PR makes the function at least a bit more robust and the function has worked like this for quite a long time, I would prefer this approach. At this moment the failing pipelines are blocking our development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rfaasse if the insertion in parallel is the true cause of this segfault, then this PR doesn't fix it and your pipeline just dodged the race condition by chance. Merging this will not change that.
If you want to merge this PR as quickly as possible regardless (I'm fine with that), please open that separate issue and link this PR so the problem won't be forgotten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RiccardoRossi Just to re-iterate, the current state is blocking the development of the @KratosMultiphysics/geomechanics team. This PR at least gets rid of the implicit conversion from iterator to pointer and fixes the pipelines at our end. Therefore, can we merge this PR and investigate more in-depth at a later moment by creating a separate issue?
hi @rfaasse i approved the PR, nevertheless let me tell you that this is not a fix.
the problem is that if one of the threads reallocates the matrix (by doing rA(i,i) in a case in which it does not exist) the other threads will have a conflict as the reallocation will happen while they are writing. This is possibly even worst to find than the current behaviour ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to be so verbose, this is just to copy here how the find_element works, because at that point we can also do only what is needed directly in code
const_pointer find_element (size_type i, size_type j) const {
size_type element1 (layout_type::index_M (i, j));
size_type element2 (layout_type::index_m (i, j));
if (filled1_ <= element1 + 1)
return 0;
vector_const_subiterator_type itv (index1_data_.begin () + element1);
const_subiterator_type it_begin (index2_data_.begin () + zero_based (*itv));
const_subiterator_type it_end (index2_data_.begin () + zero_based (*(itv + 1)));
const_subiterator_type it (detail::lower_bound (it_begin, it_end, k_based (element2), std::less<size_type> ()));
if (it == it_end || *it != k_based (element2))
return 0;
return &value_data_ [it - index2_data_.begin ()];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are actually two related problems here. One is that the double* is invalidated upon re-allocation of the underlying vector, regardless of parallel vs serial. By using a reference at least we are still pointing to the right memory for a serial loop when the vector has re-allocated. My guess is that that was the issue for our pipeline, since it was consistently failing before and consistently passing with the fix (but that is just an assumption).
The second issue is as you all rightly pointed out that this implementation is not thread-safe, for which I'll create an issue and link to this PR and this discussion. Thank you for all the input and will see you in the next one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting stuff
There should be no reallocation any more once its fixed properly. I'd be curious if it would still fail in your pipelines after reverting this PR
Can you please include me in the next PR?
📝 Description
In our TeamCity windows pipeline, tests started failing with a memory violation error. When tracking down the place where invalid memory was accessed, it turned out to be in
ublas_space.h
in theCheckAndCorrectZeroDiagonalValues
function.In that function, the values of matrix
rA
are fetched using a double*, initialized with the begin iterator ofrA.value_data()
. However, in the loop (theIndexPartition
from line 723-734),rA
is actually changed. This means therA.value_data()
also changes. Therefore, it could happen that the underlying vector (obtained usingvalue_data()
) has to be re-allocated because values are added to the vector that no longer fit in the allocated/reserved space. Indeed when printingrA.value_data().size()
in the loop, I saw that the size doubled (which points to the vector being re-allocated), just before the crash. The crash then occurs because, thedouble* Avalues
is invalidated when the underlying vector gets re-allocated.This PR fixes the issue by storing the vector reference returned by
value_data()
instead of using thedouble*
. This reference does not have the same issue with being invalidated when the vector is re-allocated. Note that we do not make a copy here, since thevalue_data()
,index_1data()
andindex2_data()
functions are returning a reference, which we are storing (not by value).Since this practice with using a
double*
to point to the begin iterator is very tricky and can lead to these hard-to-find issues (it only occurs in very specific situations), I also changed the other places in this file where this was happening with the same strategy.