-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dashboard locked if api/v1/viz
DELETE requests are sent on table involved in batch sql jobs
#12829
Comments
This is caused because the The solution is to set a cc @dgaubert Could this break something engine-wise? |
We can leave this for next shift. |
From the docs
So I'm not sure a
The point is, if we want consistency and the dashboard, map, and dataset pages need to peek into the catalog, they have to wait or break (lock timeout) during the To be clear, I don't think it is a problem easy to solve without adding extra complexity. Also, Engine/Core can help, but the point is what queries are blocked from Backend (and the wonderful rails ORM's), and how to deal with them from there: resilience to statement/lock timeouts, relaxed consistency, etc. |
On second thought, the Many people went through this ticket before. Comments from anyone are more than welcomed. |
@andy-esch can you give me read perms on the table |
Set to private with link @rafatower: |
@rafatower We would not only need to apply
That's more or less why I proposed setting it at database level, since it would help for all cases, including Rails, but also things like Batch SQL API and analyses (which also skip timeouts). We have had problems with those components in the past with competing analyses or user deletion i.e: this is not exclusive of Rails, you can trigger a similar situation just by not being careful using Batch API. Although, most cases are from Rails, since it's the main user of direct I agree that setting a timeout in the most problematic Rails queries is a solution to this particular case, but I still think setting a global Of course, the dashboard is still going to break if there is something locked at that point (long transaction with an exclusive lock), so we are only talking about mitigation by trying to avoid such long locks by limiting waitign queries. In summary:
Bonus: we may want to consider setting the |
Forgive me is this is a really bad idea as I do not know the software architecture very well, and I realize this is not an easy idea to carry out, but... What about having a database for dashboards (table metadata, etc.) that's separate from the user database (which stores the actual tables, runs analysis queries, etc.)? |
@andy-esch That is actually what we are doing right now. We still need to connect to the user database in order to gather some info, namely: the size of the database and the list of tables (for the ghost tables process, and also, if we ever want to show non-cartodbfied tables in the dashboard). We could figure other ways to do some of that, but we would still need to connect to the user database at some point (although maybe less frequently if we cache it). |
@javitonino thanks a lot for your points, you got me into it. Then the simplest solution would be to add a global I will run some tests in local, then set up the config changes for staging and production. |
These are the queries I got in my local setup:
|
The locked queries looks slightly more interesting:
|
there might be locks also due to autovacuum:
|
So, here's what happens:
(please fill in the gaps I might have left out) At the same time, there are other operations related to DB maintenance and Ghost Tables that may also try to acquire exclusive locks on the target table. (BTW, I generate the table from scratch to ease "reproducibility" of the issue). Let's imagine for a moment that we have a relatively short
@javitonino any thoughts? Revisiting some alternatives: how about..?
|
Uhm. I was thinking about the lock at the user DB ( This is the scenario I have in mind (which I've seen in production), without
We stay like this until A finishes. With
The deletion of the table will fail due to the timeout until A finishes. So, the lock I had in mind was happening at the user DB, not in the metadata DB. I think we have different scenarios in mind. About
So, in any case, VACUUM won't get locked by INSERT/UPDATE/DELETE/SELECT. About alternatives:
We could add a lock timeout to ghost tables (Rails) and the DB size function (extension). As discussed previously, we can also add it for dropping tables. All of these will help. My only concern is catching all cases can be complicated.
As I mentioned, I discovered that a transaction with Another alternative for me. I think we can workaround the bulk of this issue by settign a timeout in the |
There is indeed, check the lock traces I posted above.
Look at the trace I posted above. It seems to have gotten a
Oh! there's hope! 😄 At least initially, I prefer that approach. There's a clear semantics to that and no surprising side effects down the road.
Eventually we're going to have to reengineer a little not just |
I think there is something wrong in the way you gather I think it might be the
I'm thinking we should get together and go step by step, since there is certainly something we are missing here. Yes, About the missing endpoints, we will most likely consider it in one of our next projects, since we want users to be able to access it. We are aware that we do not have a good way to list tables. In fact, the only public official way is to rely on a query to |
My opinion here is similar to what I have said in other issues related to dashboard components (ghost tables, etc.): I think we are still thinking on how we add a stone to the mountain, instead of rethinking it. The problem with anything related to what @andy-esch pointed out is that the user mental model does not fit ours, pretty much designed from an engineering point of view. Users just want to do an action (delete a table, for instance) and continue working on their things. They do not expect that while trying to list, let's say, their data, they cannot do that because there is a hidden relationship between the first action and the second one. As we do not expect either that by uploading a file to Google Drive, we cannot continue working on a document or searching for something. From my perspective, that is the theme here. It is not about database internals, it is about how we have to design the entire experience assuming those goals. I believe the new dashboard is a good opportunity to rethink these things. I do not know if @jorgesancha and @alonsogarciapablo, leading that initiative, share my view though. |
Since this is a rather complex scenario, I created a script to reproduce it: https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604 |
The DROP SEQUENCE is made only IF EXISTS, so no exception is raised here. Just a NOTICE from the DB if it does not exist. That was added a long time ago. It is not really needed but nobody's touched it probably because of fear of breaking things.
Working on a fix here: #14127 This is the result of the execution of the test after the fix:
So, instead of blocking the As a bonus, I added the |
…-update Update cartodb PG extension to 0.23.0 #12829
And fixed in production: https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604
Obviously the |
Context
I get 504 gateway timeouts (and a frozen dashboard) when a delete table request is applied on a table involved in a batch sql operation.
Steps to Reproduce
Current Result
Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.
Expected result
Batch jobs and requests to delete tables should not freeze user account.
Browser and version
Chrome 61.0.3163.91 (Official Build) (64-bit)
macOS 10.12.5
.carto file
None, but you can get a dataset to test here:
https://eschbacher.carto.com/api/v2/sql?q=select+*+from+batch_sql_viz_api_lock_copy&format=csv&filename= batch_sql_viz_api_lock
Additional info
Discovered while developing cartoframes
cc @juanignaciosl
The text was updated successfully, but these errors were encountered: