-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix "Got packets our of order" errors on connect #184
Fix "Got packets our of order" errors on connect #184
Conversation
When trying to upgrade to Trilogy 2.8.0, my application started raising errors when establishing connections to our database: ``` 1156: Got packets out of order ``` Bisecting the changes between 2.7.0 and 2.8.0 indicated that [the change][1] to remove sequence number hardcoding introduced the bug. Unfortunately, the issue was not easily reproducible locally, so I [added][1] a `printf` statement to try to understand what sequence number was being used instead of the previously hardcoded 1 (or 2). This log showed that in failure cases the sequence number was 0. After that, [adding more][3] `printf`s helped provide a clearer picture of the sequence of events when the incorrect sequence number is used: ``` trilogy=init seq=0 trilogy-pre-try_connect seq=0 trilogy-connect_recv seq_was=0 trilogy-connect_recv seq=0 trilogy-connect_recv-read_packet seq=0 trilogy-connect_recv seq_was=0 trilogy-connect_recv seq=0 trilogy-connect_recv-read_packet seq=1 trilogy-connect_recv-seq_was=1 trilogy-connect_recv seq=0 trilogy-connect_recv-read_packet seq=0 trilogy-connect_recv-parse_handshake seq=0 trilogy=auth_send use_ssl=0, seq=0 ``` My understanding of what happens here is that `trilogy_connect_recv` is getting called 3 times in the `while` loop inside `try_connect`. The first time, it sets the `packet_parser`'s `sequence_number` to 0 before it calls `read_packet` but only reads some amount up to the packet's `sequence_number` before exiting with `TRILOGY_AGAIN`. `trilogy_connect_recv` gets called a second time, and again it sets the `packet_parser`'s `sequence_number` to 0. This time `read_packet` consumes the packet's `sequence_number` and so `packet_parser`'s `sequence_number` get incremented. However, this iteration also exits with `TRILOGY_AGAIN` before the packet is finished. `trilogy_connect_recv` is then called a third time, and again it sets the `packet_parser`'s `sequence_number` to 0. `read_packet` finally finishes consuming the packet and this iteration is finally able to parse the handshake from the completed packet. Since the `sequence_number` is being reset on each loop, `auth_send` ends up incorrectly using a sequence number of 0 instead of 1 (or 2). Therefore it seems incorrect for each call to `trilogy_connect_recv` to reset the `sequence_number` since it is not guaranteed that the last call will be the one that consumes the handshake's `sequence_numer` and increments the value for use in `auth_send`. [1]: trilogy-libraries/trilogy@922544c [2]: f5afac3 [3]: e052939
I'm not sure how best to test this 😅 My thinking would be a test could open a connection to a socket, write part of the handshake packet (past the seq number) to the socket, call |
I can't really see a way to end up in the situation where we'd re-call and end up anywhere near here while needing the reset (i.e. not on a fresh connection that's already zeroed). But in principle, I think what makes the connect flow unique is the fact that its 'send' function doesn't actually send anything: I haven't checked the docs, but I guess the protocol says the server talks first. So, every other packet type, prior to 'recv', gets its sequence reset by its builder, in its 'send', but this one does not. Thus [even though I can't see any reason it wouldn't be safe to just delete it], I claim it would be most consistent for us to instead move this line up a bit, into |
That makes sense to me 👍 Happy to squash as well if you'd prefer Edit: whoops, typo'd the field 🤦 |
While it does not make sense to reset the sequence number in the middle of parsing the handshake packet, it is reasonable to continue resetting the sequence number once immediately when the connection is established. As mentioned during review, this makes the send/recv of establishing the connection similar to other send/recv pairs where the sequence number gets reset during the send portion.
bbdbcaf
to
8b80fce
Compare
Thanks @skipkayhil! |
The changelog entry was missing for #184, which fixes broken sequencing that we were seeing in some cases.
Add changelog entry for #184
When trying to upgrade to Trilogy 2.8.0, my application started raising errors when establishing connections to our database:
Bisecting the changes between 2.7.0 and 2.8.0 indicated that the change to remove sequence number hardcoding introduced the bug.
Unfortunately, the issue was not easily reproducible locally, so I added a
printf
statement to try to understand what sequence number was being used instead of the previously hardcoded 1 (or 2). This log showed that in failure cases the sequence number was 0.After that, adding more
printf
s helped provide a clearer picture of the sequence of events when the incorrect sequence number is used:My understanding of what happens here is that
trilogy_connect_recv
is getting called 3 times in thewhile
loop insidetry_connect
.The first time, it sets the
packet_parser
'ssequence_number
to 0 before it callsread_packet
but only reads some amount up to the packet'ssequence_number
before exiting withTRILOGY_AGAIN
.trilogy_connect_recv
gets called a second time, and again it sets thepacket_parser
'ssequence_number
to 0. This timeread_packet
consumes the packet'ssequence_number
and sopacket_parser
'ssequence_number
get incremented. However, this iteration also exits withTRILOGY_AGAIN
before the packet is finished.trilogy_connect_recv
is then called a third time, and again it sets thepacket_parser
'ssequence_number
to 0.read_packet
finally finishes consuming the packet and this iteration is finally able to parse the handshake from the completed packet.Since the
sequence_number
is being reset on each loop,auth_send
ends up incorrectly using a sequence number of 0 instead of 1 (or 2). Therefore it seems incorrect for each call totrilogy_connect_recv
to reset thesequence_number
since it is not guaranteed that the last call will be the one that consumes the handshake'ssequence_numer
and increments the value for use inauth_send
.