Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "Got packets our of order" errors on connect #184

Merged

Conversation

skipkayhil
Copy link
Contributor

When trying to upgrade to Trilogy 2.8.0, my application started raising errors when establishing connections to our database:

1156: Got packets out of order

Bisecting the changes between 2.7.0 and 2.8.0 indicated that the change to remove sequence number hardcoding introduced the bug.

Unfortunately, the issue was not easily reproducible locally, so I added a printf statement to try to understand what sequence number was being used instead of the previously hardcoded 1 (or 2). This log showed that in failure cases the sequence number was 0.

After that, adding more printfs helped provide a clearer picture of the sequence of events when the incorrect sequence number is used:

trilogy=init seq=0
trilogy-pre-try_connect seq=0
trilogy-connect_recv seq_was=0
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=0
trilogy-connect_recv seq_was=0
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=1
trilogy-connect_recv-seq_was=1
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=0
trilogy-connect_recv-parse_handshake seq=0
trilogy=auth_send use_ssl=0, seq=0

My understanding of what happens here is that trilogy_connect_recv is getting called 3 times in the while loop inside try_connect.

The first time, it sets the packet_parser's sequence_number to 0 before it calls read_packet but only reads some amount up to the packet's sequence_number before exiting with TRILOGY_AGAIN.

trilogy_connect_recv gets called a second time, and again it sets the packet_parser's sequence_number to 0. This time read_packet consumes the packet's sequence_number and so packet_parser's sequence_number get incremented. However, this iteration also exits with TRILOGY_AGAIN before the packet is finished.

trilogy_connect_recv is then called a third time, and again it sets the packet_parser's sequence_number to 0. read_packet finally finishes consuming the packet and this iteration is finally able to parse the handshake from the completed packet.

Since the sequence_number is being reset on each loop, auth_send ends up incorrectly using a sequence number of 0 instead of 1 (or 2). Therefore it seems incorrect for each call to trilogy_connect_recv to reset the sequence_number since it is not guaranteed that the last call will be the one that consumes the handshake's sequence_numer and increments the value for use in auth_send.

When trying to upgrade to Trilogy 2.8.0, my application started raising
errors when establishing connections to our database:

```
1156: Got packets out of order
```

Bisecting the changes between 2.7.0 and 2.8.0 indicated that [the
change][1] to remove sequence number hardcoding introduced the bug.

Unfortunately, the issue was not easily reproducible locally, so I
[added][1] a `printf` statement to try to understand what sequence
number was being used instead of the previously hardcoded 1 (or 2). This
log showed that in failure cases the sequence number was 0.

After that, [adding more][3] `printf`s helped provide a clearer picture
of the sequence of events when the incorrect sequence number is used:

```
trilogy=init seq=0
trilogy-pre-try_connect seq=0
trilogy-connect_recv seq_was=0
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=0
trilogy-connect_recv seq_was=0
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=1
trilogy-connect_recv-seq_was=1
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=0
trilogy-connect_recv-parse_handshake seq=0
trilogy=auth_send use_ssl=0, seq=0
```

My understanding of what happens here is that `trilogy_connect_recv` is
getting called 3 times in the `while` loop inside `try_connect`.

The first time, it sets the `packet_parser`'s `sequence_number` to 0
before it calls `read_packet` but only reads some amount up to the
packet's `sequence_number` before exiting with `TRILOGY_AGAIN`.

`trilogy_connect_recv` gets called a second time, and again it sets the
`packet_parser`'s `sequence_number` to 0. This time `read_packet`
consumes the packet's `sequence_number` and so `packet_parser`'s
`sequence_number` get incremented. However, this iteration also exits
with `TRILOGY_AGAIN` before the packet is finished.

`trilogy_connect_recv` is then called a third time, and again it sets
the `packet_parser`'s `sequence_number` to 0. `read_packet` finally
finishes consuming the packet and this iteration is finally able to
parse the handshake from the completed packet.

Since the `sequence_number` is being reset on each loop, `auth_send`
ends up incorrectly using a sequence number of 0 instead of 1 (or 2).
Therefore it seems incorrect for each call to `trilogy_connect_recv` to
reset the `sequence_number` since it is not guaranteed that the last
call will be the one that consumes the handshake's `sequence_numer` and
increments the value for use in `auth_send`.

[1]: trilogy-libraries/trilogy@922544c
[2]: f5afac3
[3]: e052939
@skipkayhil
Copy link
Contributor Author

I'm not sure how best to test this 😅

My thinking would be a test could open a connection to a socket, write part of the handshake packet (past the seq number) to the socket, call trilogy_connect_recv, finish writing the handshake packet, call trilogy_connect_recv again, and then assert the seq number ends up as 1. However I'm not sure exactly how I would do that in the C test suite or if that's a reasonable test to write.

@matthewd
Copy link
Contributor

matthewd commented May 8, 2024

I can't really see a way to end up in the situation where we'd re-call and end up anywhere near here while needing the reset (i.e. not on a fresh connection that's already zeroed).

But in principle, I think what makes the connect flow unique is the fact that its 'send' function doesn't actually send anything: I haven't checked the docs, but I guess the protocol says the server talks first.

So, every other packet type, prior to 'recv', gets its sequence reset by its builder, in its 'send', but this one does not.

Thus [even though I can't see any reason it wouldn't be safe to just delete it], I claim it would be most consistent for us to instead move this line up a bit, into trilogy_connect_send_socket: that's the moment that we send a zero-length sequence, skipping the builder involvement that would otherwise handle it.

@skipkayhil
Copy link
Contributor Author

skipkayhil commented May 8, 2024

That makes sense to me 👍 Happy to squash as well if you'd prefer

Edit: whoops, typo'd the field 🤦

While it does not make sense to reset the sequence number in the middle
of parsing the handshake packet, it is reasonable to continue resetting
the sequence number once immediately when the connection is established.
As mentioned during review, this makes the send/recv of establishing the
connection similar to other send/recv pairs where the sequence number
gets reset during the send portion.
@eileencodes eileencodes merged commit 3703624 into trilogy-libraries:main May 8, 2024
11 checks passed
@eileencodes
Copy link
Member

Thanks @skipkayhil!

eileencodes added a commit that referenced this pull request May 13, 2024
The changelog entry was missing for #184, which fixes
broken sequencing that we were seeing in some cases.
composerinteralia added a commit that referenced this pull request May 13, 2024
@skipkayhil skipkayhil deleted the hm-no-seq-reset-on-connect branch June 18, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants