Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async_connect does not handle socket from libpq properly, fails to connect in some environments #304

Open
msherman13 opened this issue May 25, 2022 · 4 comments
Assignees
Labels

Comments

@msherman13
Copy link

I believe that the async_connect code violates the libpq documentation by assuming that the underlying socket file-descriptor will not change over the course of the connection sequence. The relevant section of the documentation is below:

https://www.postgresql.org/docs/current/libpq-connect.html

If PQconnectStart or PQconnectStartParams succeeds, the next stage is to poll libpq so that it can proceed with the connection sequence. Use PQsocket(conn) to obtain the descriptor of the socket underlying the database connection. (Caution: do not assume that the socket remains the same across PQconnectPoll calls.) Loop thus: If PQconnectPoll(conn) last returned PGRES_POLLING_READING, wait until the socket is ready to read (as indicated by select(), poll(), or similar system function). Then call PQconnectPoll(conn) again. Conversely, if PQconnectPoll(conn) last returned PGRES_POLLING_WRITING, wait until the socket is ready to write, then call PQconnectPoll(conn) again. On the first iteration, i.e., if you have yet to call PQconnectPoll, behave as if it last returned PGRES_POLLING_WRITING. Continue this loop until PQconnectPoll(conn) returns PGRES_POLLING_FAILED, indicating the connection procedure has failed, or PGRES_POLLING_OK, indicating the connection has been successfully made.

In my environment (running local postgres 12 server in a docker container), the connection always times out. After debugging, I found that the below code fixes the issue, although it is probably a huge hack. I am not familiar enough with the ozo codebase to feel confident in this fix:

diff --git a/include/ozo/impl/connection.h b/include/ozo/impl/connection.h
index fc92b5e..d458d43 100644
--- a/include/ozo/impl/connection.h
+++ b/include/ozo/impl/connection.h
@@ -62,12 +62,14 @@ ozo::pg::conn connection<OidMap, Statistics>::release() {
 template <typename OidMap, typename Statistics>
 template <typename WaitHandler>
 void connection<OidMap, Statistics>::async_wait_write(WaitHandler&& h) {
+    assign(release());
     socket_.async_write_some(asio::null_buffers(), std::forward<WaitHandler>(h));
 }
 
 template <typename OidMap, typename Statistics>
 template <typename WaitHandler>
 void connection<OidMap, Statistics>::async_wait_read(WaitHandler&& h) {
+    assign(release());
     socket_.async_read_some(asio::null_buffers(), std::forward<WaitHandler>(h));
 }

Unrelated, but it is also worth noting that the null_buffers method of waiting for the socket to become ready is deprecated. The preferred method is to use socket_.async_wait

@thed636
Copy link
Contributor

thed636 commented May 26, 2022

Hi!
Thanks for reaching us. Well, yes, it definitely violates documentation in cases of multi-host connection string. Do you use a multi-host in your connection string, or have the issue with a single host in the connection string?

@thed636 thed636 self-assigned this May 26, 2022
@thed636 thed636 added the bug label May 26, 2022
@msherman13
Copy link
Author

I am having the issue with a single-host connection string

@msherman13
Copy link
Author

the original code seems to work on some client hosts, but not others. I'm not sure the difference between the two hosts I tested on, both centos 7.5

@computerquip-work
Copy link

computerquip-work commented Nov 5, 2024

This is a nasty bug. As far as I understand, PQconnectPoll() may call closesocket() on the underlying socket between calls. So PQconnectPoll() -> poll(PQsocket(conn)) is how it normally goes. ozo seems to be caching the result of PQsocket(conn) though and re-using the fd. Because libpq closes the fd, and fds may end up being reused, any application that creates other fds can run into this issue, along with any other issues that come with poll'ing on fds that don't belong to ozo. It manifests in such a way that it's somewhat hard to debug unfortunately.

I would think this could cause other problems but so far, I've only seen the PGRES_POLLING_OK event getting dropped and some PGRES_POLLING_WRITING and PGRES_POLLING_READING events getting caught. So I figure I don't understand the whole situation and would probably assume my understanding above is somewhat off. Still, probably not great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants