-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fast_union_uint16 instead of union_uint16. #452
Conversation
It is possible. The reason I did not use them is that I was concerned with the safety. It is a good idea to revisit the issue. |
af171fd
to
3fc74c2
Compare
I read the code of |
The layout is more complex. We don't allocate a new array for the output, so we have one memory region where we have both the input and the output. We have the following layout...
That is, input 1 and output are in the same memory region, but output is offsetted. At the end of the computation, it is definitively the case that output will overwrite input 1. If you are doing a textbook computation, then there is no problem because you cannot write value But it gets more complicated once you have a vectorized function. Note that we can't just say that it looks ok... we have to be sure. If you think it is safe, can you work out the analysis from what I just wrote? |
1 similar comment
The layout is more complex. We don't allocate a new array for the output, so we have one memory region where we have both the input and the output. We have the following layout...
That is, input 1 and output are in the same memory region, but output is offsetted. At the end of the computation, it is definitively the case that output will overwrite input 1. If you are doing a textbook computation, then there is no problem because you cannot write value But it gets more complicated once you have a vectorized function. Note that we can't just say that it looks ok... we have to be sure. If you think it is safe, can you work out the analysis from what I just wrote? |
@lemire We just need to focus on the reading and writing performed on array1. In Let's say the length(cardinality) of input2 is L2:
Let's define 3 __m128i pointers,
The union output always contains less or equal number of elements than all inputs added, so we have:
therefore:
which means you will not overwrtite data beyond pos1, so the data haven't read is safe, and we don't care the data already read. This is the basic idea, I don't know how to illustrate this with code details... hope you can consider this and go through the code again. |
Merging. |
I am reverting this PR. It causes failures with Intel compilers. |
I wonder why
fast_union_uint16
was not used in these 2 functions:I think
union_vector16
is still safe here.