-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections #8574
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
There's a challenge here around the fact that each subprocess (gunicorn worker and celery worker) gets its own pool, and that each one of those can connect to multiple databases. Depending on whether you configure these things to use threads or workers (subprocesses), you can end up with a lot of connections very quickly that is vastly bigger than the number of active connections. One problem is that while you probably want to cap the number of connections to your db, you want do have a dynamic number of workers as you need more capacity. There's also a challenge around the fact that threading and SQLAlchemy pools have intricate issues. There are endless stackoverflows documenting this. Another thought is that we may want to limit concurrency to analytics databases, but this approach is not achieving that in any way as there's no global state shared across server/workers. This would need to get handled as some sort of environment global variable (redis?) that would keep track of the number of active connections. What's the cost of spawning / destroying a connection? On the server / on the client? Milliseconds of CPU time? Worth it? |
Yes, there is a challenge around each process getting its own pool. I see your point that for some systems (Redshift is getting my side-eye here) having connections held open could be a larger problem. I'll amend the recommendation above to make one of the options a NullPool. This would retain the existing behavior for databases that are unable to handle enough open connections. RE: the cost of spawning/destroying the connection, I think it's impossible to come up with a really solid specific number. I think the range is likely to be between around 10 milliseconds in the case where the server is a more application-focused one (Postgres, MySQL, etc.) living on the same network up to potentially multiple seconds for systems separated geographically or with a chattier protocol for initiating a database connection. Under load, these numbers can get quite large. A goal down the line would be to limit total connections, but I'd like to push that off into a future SIP. I believe a reasonable way to attack that would be to implement a custom pool that leverages Redis as a distributed lock for the system. The Redis lookups for this system will potentially add a fair amount of latency, so that's something we should discuss separately in my mind. RE: the intricacies of SQLAlchemy in a threaded environment, it appears that connection pools and engines are safe, but that sessions and connections are not. This makes sense intellectually - the connection pool and engine are designed to protect the non-thread-safe resources they contain. None of this is safe across a process boundary, so the multiprocessing module in Python is a danger to connection pools. We already have this issue when it comes to the metadata database. Post-fork, any existing connection pools would need to be recreated. Some database engines implement database connections as blocking calls from my research, which will break multithreading due to the Global Interpreter Lock. I think for us to really achieve the best throughput we will want to use lazy-loaded connections from process-based Celery workers that then become long-running connections. This, however, is multiple SIPs away, and I anticipate that we will need to retain the ability to run all queries in the foreground for the foreseeable future. RE: is it worth it? I think that depends heavily on the workload. In terms of freeing up processor cycles on the web server, it could be very worth it. If there is a substantial geographical separation between Superset and the database accepting the connection, or if connections are slow to instantiate on that server, it will definitely be very worth it. I think providing the option of connection pooling could greatly accelerate certain workloads, though you have convinced me that retaining the option of a NullPool is a wise choice. |
Another thing to cover as part of this sip is the configurability of the pools per database. Potentially heterogenous params based on pool type, it's hard to come up with a balance between something comprehensive and static VS flexible. We could have both: maybe different presets in a dropdown list, and the possibity to override with actual pool objects in a dict in superset_config.py ... Planning on making another comment addressing your points above. |
A global redis-based pool would be very useful from a user perspective. Doing that in a future SIP makes a lot of sense though, since right now there is no connection limit at all. I suggest avoiding offering any configuration in the UI until then, as a config intended to be applied per-process would be difficult to reliably translate to a global config. |
I think allowing for people to configure their connection pool is a great thing, let's provide ways to do this. I see two main approaches as well as a hybrid. 1 - by config: Doing it as configuration, in 2 - in UI Doing it in the UI, and expose only the common pool type(s) and parameters that matter. This is likely to be simple at first (only allow 3 - hybrid A hybrid approach could be easy to implement to, where we look for the dict config and fall back on the UI config. |
Personally I like the idea of supporting a hybrid approach, giving precedence to the code based config. However, given that Superset usually runs on multiple concurrent worker processes, I think the only way of achieving true pooling would require some sort of locking outside python scope (Redis being the top contender as mentioned above). While it does propose it's own set of challenges (not to mention added complexity), somehow it feels simple enough to be manageable, especially if it can be rolled out as an opt-in feature. Therefore I'd vote to at least try building a Redis locking POC, as it should be pretty quick to put together and see what type of overhead or other problems it might introduce. |
I think it should be possible to build a SQLAlchemy-compatible pool called |
I like the hybrid approach as well. I'm not excited about adding this configuration to though I think there is a valid use-case for un-capped pools, especially when the database connections are to datastores like BigQuery and Athena. Anywhere that supports massively concurrent access. Capped pools are more important for systems like Redshift, Postgres, MySQL, etc. where too many connections open can cause difficulty. I can put together a proof of concept of a Redis-lock-based, capped, distributed pool and we can take a look. I expect to be able to get to it in a couple of days. |
I think priority #1 is pool "configurability" per database (achieved with a I'd wait for a direct request prior to actually building a |
I don't feel strongly about adding a column or adding on to Hybrid data models (mixing structured and semi-structured fields) are becoming more common over time, database support them much better than they used to. That doesn't mean that's right, but clearly reflecting a reality of rapidly-evolving / complex schemas. |
I'm willing to compromise on the unstructured column. I've updated the SIP to reflect this discussion. |
A couple questions/comments: Will this SIP also cover using connection pooling for connecting to the metadata database from Celery workers? That's a common issue that we've seen, where if too many connections get opened in Celery, we start to overwhelm the total connections available to MySQL (this may also be related to not closing connections after finishing an async query, this is uncertain). What would the user experience be if there is no available connections when making a sync query? Would that request stall and wait for a connection to be available or instantly fail? Finally, although this may be tangental to the work here, a |
Some answers/responses for @etr2460 This SIP is for connection pooling for analytical databases, not the metadata database. I'd be happy to talk with you to understand your metadata DB issues, but do not want to consider that problem as part of this SIP. I think the most reasonable behavior for a sync query that cannot check out a connection would be to block and wait for a connection to be available, with the failure state being an eventual timeout, but this isn't a strong opinion. I'd be interested in other's thoughts on the matter. I understand the desire for a |
[tangential to this SIP and related to @etr2460 's comment] about Celery connection pooling to the metadata database, it seems reasonable to think that the connection requirements in the Celery context are different from the typical web-server-building-a-page use case, and it'd be good to offer configurability that is context aware (celery / web). Personally I don't think it requires a SIP though if it's just a configuration hook. |
Thanks for the answers Will, it all makes sense to me. I agree that blocking and waiting is the right experience for now. No further questions! |
I like the idea of being able to use connection pooling wherever necessary, even for metadata connections if necessary. |
@villebro completely agreed RE: connection pooling everywhere being desirable. Given the structure of the system though, pooling for the metadata database is a special case. Connection pooling should already be in use for foreground connections, but Celery presents special concerns, which is why I'd like to treat it separately. We definitely share the same end goal though! |
🏷database |
Approved! |
Does anyone here know the implementation status of this proposal? |
I would be also interested in an update for this proposal. Can anyone post something about this? |
I think this SIP is stale. It may also need to be revised to cover Global Async Queries (or not, I haven't thought this through), as that's probably what we'll try to work towards in the long term. |
Ok, thanks for your reply. |
So it seems like this functionality may already have been partially implemented, sans the configuration within the UI. Specially the If the connection pool settings need to differ from the metadata database then one can leverage SQLAlchemy binds via Flask-SQLAlchemy's The one missing step is exposing the ability to choose between using the |
Rereading this now, I wonder if the correct design for this should in fact exist downstream of Superset, and not within superset, for the reasons explained by Max in his first post:
Especially in the context of horizontal scaling of workers, keeping connections open on the worker processes will quickly lead to lots of idle open connections that are not efficiently utilized. So if someone has seen a generic connection pooler for SQLAlchemy connections, similar to Pgpool-II, that could be an awesome solution to this problem (although I'm sure configuring it would be a challenge in itself..). |
Is anyone intending to continue with this? If not, it'll be considered stale and move to the closed/deferred column. |
As much as this matters for OLTP proportionally, where you have potentially high transaction-per-second and super small unit of work (your average atomic operations are counted in low ms), I feel like it matter significantly less [proportionally] for analytics workload, where you average latency is more like hundreds or thousands of millisecs |
Ok... calling it dead in the water... if anyone wants to rekindle this or keep the discussion going, just say the word and we can bring it back in the kanban. |
[SIP] Proposal for Implementing Connection Pooling for Analytics Database Connections
Motivation
Currently, Superset’s connections to analytics databases do not have long-lived connection pools. In most instances, a database connection is spawned immediately before a query is executed and discarded after a single use. This introduces a small amount of latency into every query. While most queries run against data warehouses are expected to be longer-running than a typical web application query, this latency will be noticeable when performing operations such as loading schema and table lists for display in the UI, or loading table definitions and previews.
A more serious concern is that the number of open database connections to analytics databases is only bounded by the number of threads available to the application across all processes. Under peak load, this can lead to hammering databases with a large number of connection requests and queries simultaneously. This does not allow us to provide meaningful upper bounds for the number of available database connections. Implementing connection pooling at the process level will allow us to provide a configurable maximum number of connections that Superset is able to leverage.
Proposed Change
I recommend we add a singleton object to hold a SQLAlchemy Engine instance for each configured database in the application. I believe that engines should not be instantiated on startup, but instead instantiated on first use to avoid unnecessary connection negotiation.
I further recommend that we use the SQLAlchemy QueuePool as the default pool implementation while retaining the ability to configure Superset to use a NullPool, configurable via the Database setup system. I would like to make the
pool_size
andmax_overflow
properties configurable, as well as whether to treat the queue as FIFO or LIFO and thepool_pre_ping
option, and customization of theconnect_args
passed on engine instantiation (which controls things like connection timeouts). I believe that LIFO queues will be preferable for infrequently-accessed database connections, as they will generally maintain a lower number of connections in the pool, and thus should be the default. I would also recommend that for LIFO queues we default to thepool_pre_ping
option to trigger pool member invalidation when necessary, as stale connections are more likely under the LIFO configuration.As part of this work, I recommend moving engine instantiation code out of the Database model and into its own module, probably as part of the singleton that will maintain an in-memory list of database pools. We will need to update the code that alters database records to reinitialize the processes’ engine after Database record creation and update.
One further change will be in regards to Celery’s connection pooling. Right now, we use the NullPool in Celery and instantiate database connections when needed. For Celery, I would recommend moving to the StaticPool, which will create one database connection per worker process. Because Celery reuses worker processes, this will reduce the overhead on backgrounded queries. An alternative would be to move to threaded workers (gevent or eventlet) and maintain the same pool configuration as the UI. I’d love suggestions from the community on what to recommend here.
New or Changed Public Interfaces
This change should have minimal impact on the UI, the primary change being the addition of more configuration options in the Databases section. I would recommend having sensible defaults and hiding the pool setup under an
Advanced
configuration section. I plan to provide guidance on the meaning of thepool_size
,max_overflow
, and FIFO vs LIFO configuration parameters, both in the UI and in new documentation. The configuration approach will be hybrid, allowing global configuration of defaults inconfig.py
, with overrides available on a per-database basis in the UI.New dependencies
No additional dependencies will be necessary.
Migration Plan and Compatibility
A database migration will be necessary to add an additional field to the DBs table to hold connection pooling arguments.
No URLs will change as part of this work. I would like feedback from the community, particularly engineers at Airbnb, Lyft, and other organizations with large Superset installs, on what sensible defaults for connection pools would look like.
Rejected Alternatives
The primary alternative rejected is the current, connection-pool-less state. While this state allows for only the number of connections needed at any given time to be in use, it falls down with regards to performance and predictability of number of open connections at any given time.
I also considered the other connection pool implementations in SQLAlchemy, but it appears that our use-case is best served by the QueuePool implementation.
One additional piece I considered was providing an option to the user of configuring an overall, rather than per-process, maximum number of connections. In that case, processes would need to “check out” the ability to make a connection from a distributed lock built in Redis, or the max size would need to be large enough to provide at least one connection per live process. While I think this would be a better experience for most users, I’m concerned about the additional application complexity required by such a change. Would processes need to register themselves in Redis on boot so we could get a correct count of the number of live processes? What happens when we need to scale up beyond the global maximum number of database connections? I think solving those problems is not easy, and most use-cases will be well-enough served by a per-process max number of connections.
The text was updated successfully, but these errors were encountered: