PG::ConnectionBad when using in combination with puma in clustered mode and message_bus middleware #2158

mscrivo · 2024-05-03T14:24:36Z

mscrivo
May 3, 2024

Hey Jeremy, hope all is well.

I'm at my wits end trying to diagnose a bizarre threading issue when using the above tools in combination. We're trying to add message_bus to our system and all was well at first in development, where we use puma in single mode, but as soon as we deployed to our real environments we started seeing random 500's in the app that includes the message_bus middleware.

To start off, I don't think this is at all a problem with sequel, but since it affects the DB connection pool, I figured I'd start here. Here's what we we're seeing:

When we have our rack app running in puma clustered mode (with 2 or more workers, 5 threads each), and we have the message_bus middleware mounted in the app, we see errors like this on random endpoints, including the message_bus ones themselves:

PG::ConnectionBad: PQconsumeInput() SSL error: decryption failed or bad record mac
 PG::ConnectionBad: PQconsumeInput() SSL error: decryption failed or bad record mac
 (Sequel::DatabaseDisconnectError)

I've even intermittently seen hard seg faults:

/Users/mscrivo/.local/share/mise/installs/ruby/3.3.1/lib/ruby/gems/3.3.0/gems/pg-1.5.6/lib/pg/connection.rb:695: [BUG] Segmentation fault at 0x0000000128b98ac8
ruby 3.3.1 (2024-04-23 revision c56cd86388) [arm64-darwin23]

message_bus middleware does some threading behind the scenes that I don't really understand and makes use of rack hijack. So I'm guessing there's some scenario where it's hijacking threads, closing connections and then they are trying to be reused elsewhere.

To validate that, I enabled the connection_validation extension for each call, and it does indeed "fix" the problem as unusable connections are not used anymore, but of course, I'd rather not add that because of the performance hit.

Here's our config that I think is relevant:

Ruby 3.3.1
Using Sequel 5.80 with sharedthreaded connection pool and a number of extensions (which I can list if you think it'd be helpful?)
Rack 3.0.10
Sequel connection pool set to match puma threads at 5
Issue happens on both macOS (arm64) and linux (x64)

Any leads, or debugging tips you could provide would be greatly appreciated! Thanks

jeremyevans · 2024-05-03T14:39:13Z

jeremyevans
May 3, 2024
Maintainer

Most common cause of this would be some use of forking, where the database connections are used in forked processes. Can you check whether there is any forking in the application (or at least, any forking while a Sequel::Database object has connections).

1 reply

mscrivo May 3, 2024
Author

Well puma does forking in clustered mode, and that's exactly where we're seeing the issue, when clustered mode is on. But I don't understand why only with message_bus. It's supposed to work just fine with forking according to their README. We've even added their suggested on_worker_boot config.

mscrivo · 2024-05-03T14:51:56Z

mscrivo
May 3, 2024
Author

I was trying to see how they use message_bus themselves at discourse, and they have this config where they tell ActiveRecord to close all connections on disconnect, seems like it may be relevant, does Sequel have a function like that?

0 replies

mscrivo · 2024-05-03T15:05:47Z

mscrivo
May 3, 2024
Author

🤦🏼 Wow, I just found this: https://sequel.jeremyevans.net/rdoc/files/doc/fork_safety_rdoc.html .. and adding that before_fork seems to have solved it. I can't believe I didn't know this before, and even more surprised we haven't seen issues before this as we've been using puma in forking mode for ages.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PG::ConnectionBad when using in combination with puma in clustered mode and message_bus middleware #2158

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PG::ConnectionBad when using in combination with puma in clustered mode and message_bus middleware #2158

mscrivo May 3, 2024

Replies: 3 comments · 1 reply

jeremyevans May 3, 2024 Maintainer

mscrivo May 3, 2024 Author

mscrivo May 3, 2024 Author

mscrivo May 3, 2024 Author

mscrivo
May 3, 2024

Replies: 3 comments 1 reply

jeremyevans
May 3, 2024
Maintainer

mscrivo May 3, 2024
Author

mscrivo
May 3, 2024
Author

mscrivo
May 3, 2024
Author