async_connect does not handle socket from libpq properly, fails to connect in some environments #304

msherman13 · 2022-05-25T17:19:47Z

I believe that the async_connect code violates the libpq documentation by assuming that the underlying socket file-descriptor will not change over the course of the connection sequence. The relevant section of the documentation is below:

https://www.postgresql.org/docs/current/libpq-connect.html

If PQconnectStart or PQconnectStartParams succeeds, the next stage is to poll libpq so that it can proceed with the connection sequence. Use PQsocket(conn) to obtain the descriptor of the socket underlying the database connection. (Caution: do not assume that the socket remains the same across PQconnectPoll calls.) Loop thus: If PQconnectPoll(conn) last returned PGRES_POLLING_READING, wait until the socket is ready to read (as indicated by select(), poll(), or similar system function). Then call PQconnectPoll(conn) again. Conversely, if PQconnectPoll(conn) last returned PGRES_POLLING_WRITING, wait until the socket is ready to write, then call PQconnectPoll(conn) again. On the first iteration, i.e., if you have yet to call PQconnectPoll, behave as if it last returned PGRES_POLLING_WRITING. Continue this loop until PQconnectPoll(conn) returns PGRES_POLLING_FAILED, indicating the connection procedure has failed, or PGRES_POLLING_OK, indicating the connection has been successfully made.

In my environment (running local postgres 12 server in a docker container), the connection always times out. After debugging, I found that the below code fixes the issue, although it is probably a huge hack. I am not familiar enough with the ozo codebase to feel confident in this fix:

diff --git a/include/ozo/impl/connection.h b/include/ozo/impl/connection.h
index fc92b5e..d458d43 100644
--- a/include/ozo/impl/connection.h
+++ b/include/ozo/impl/connection.h
@@ -62,12 +62,14 @@ ozo::pg::conn connection<OidMap, Statistics>::release() {
 template <typename OidMap, typename Statistics>
 template <typename WaitHandler>
 void connection<OidMap, Statistics>::async_wait_write(WaitHandler&& h) {
+    assign(release());
     socket_.async_write_some(asio::null_buffers(), std::forward<WaitHandler>(h));
 }
 
 template <typename OidMap, typename Statistics>
 template <typename WaitHandler>
 void connection<OidMap, Statistics>::async_wait_read(WaitHandler&& h) {
+    assign(release());
     socket_.async_read_some(asio::null_buffers(), std::forward<WaitHandler>(h));
 }

Unrelated, but it is also worth noting that the null_buffers method of waiting for the socket to become ready is deprecated. The preferred method is to use socket_.async_wait

The text was updated successfully, but these errors were encountered:

thed636 · 2022-05-26T16:45:46Z

Hi!
Thanks for reaching us. Well, yes, it definitely violates documentation in cases of multi-host connection string. Do you use a multi-host in your connection string, or have the issue with a single host in the connection string?

msherman13 · 2022-05-27T15:17:35Z

I am having the issue with a single-host connection string

msherman13 · 2022-05-27T15:19:42Z

the original code seems to work on some client hosts, but not others. I'm not sure the difference between the two hosts I tested on, both centos 7.5

computerquip-work · 2024-11-05T20:12:11Z

This is a nasty bug. As far as I understand, PQconnectPoll() may call closesocket() on the underlying socket between calls. So PQconnectPoll() -> poll(PQsocket(conn)) is how it normally goes. ozo seems to be caching the result of PQsocket(conn) though and re-using the fd. Because libpq closes the fd, and fds may end up being reused, any application that creates other fds can run into this issue, along with any other issues that come with poll'ing on fds that don't belong to ozo. It manifests in such a way that it's somewhat hard to debug unfortunately.

I would think this could cause other problems but so far, I've only seen the PGRES_POLLING_OK event getting dropped and some PGRES_POLLING_WRITING and PGRES_POLLING_READING events getting caught. So I figure I don't understand the whole situation and would probably assume my understanding above is somewhat off. Still, probably not great.

thed636 self-assigned this May 26, 2022

thed636 added the bug label May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async_connect does not handle socket from libpq properly, fails to connect in some environments #304

async_connect does not handle socket from libpq properly, fails to connect in some environments #304

msherman13 commented May 25, 2022

thed636 commented May 26, 2022

msherman13 commented May 27, 2022

msherman13 commented May 27, 2022

computerquip-work commented Nov 5, 2024 •

edited

Loading

async_connect does not handle socket from libpq properly, fails to connect in some environments #304

async_connect does not handle socket from libpq properly, fails to connect in some environments #304

Comments

msherman13 commented May 25, 2022

thed636 commented May 26, 2022

msherman13 commented May 27, 2022

msherman13 commented May 27, 2022

computerquip-work commented Nov 5, 2024 • edited Loading

computerquip-work commented Nov 5, 2024 •

edited

Loading