dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs #12829

andy-esch · 2017-09-20T18:14:58Z

Context

I get 504 gateway timeouts (and a frozen dashboard) when a delete table request is applied on a table involved in a batch sql operation.

Steps to Reproduce

Have a table with ~1M rows (download one here)
Run a batch operation, then try to delete the table. Here's the operations in Python:

from carto.auth import APIKeyAuthClient
from carto.sql import BatchSQLClient

auth = APIKeyAuthClient('https://eschbacher.carto.com/',
                        'my api key')
# table with 950000 rows
table = 'batch_sql_viz_api_lock'

# update geometry from columns `lat` and `lng`
BatchSQLClient(auth).create([
    "UPDATE {table} SET the_geom = cdb_latlng(lat, lng)".format(table=table)
])
auth.send('api/v1/viz/{table}'.format(table=table),
          http_method='DELETE')

Current Result

Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.

Expected result

Batch jobs and requests to delete tables should not freeze user account.

Browser and version

Chrome 61.0.3163.91 (Official Build) (64-bit)
macOS 10.12.5

.carto file

None, but you can get a dataset to test here:
https://eschbacher.carto.com/api/v2/sql?q=select+*+from+batch_sql_viz_api_lock_copy&format=csv&filename= batch_sql_viz_api_lock

Additional info

Discovered while developing cartoframes

cc @juanignaciosl

javitonino · 2018-04-05T15:40:25Z

This is caused because the DROP TABLE tries to get an exclusive lock, and locks any task behind it. It will block until the batch finishes, then the DROP finishes, and finally, all other queued queries run.

The solution is to set a lock_timeout so we wait only for a limited time. I would suggest doing this globally (setting it at database level), so we can get all the cases. An alternative is to fix it in our deletion, but this could still happen via SQL API. It's a bit :psycho:, but I feel it's the safer way to make sure this doesn't happen.

cc @dgaubert Could this break something engine-wise?

ramiroaznar · 2018-05-28T11:38:45Z

cc @Algunenano @dgaubert

ramiroaznar · 2018-06-22T07:36:42Z

We can leave this for next shift.

rafatower · 2018-06-27T16:18:59Z

From the docs

Note that if statement_timeout is nonzero, it is rather pointless to set lock_timeout to the same or larger value, since the statement timeout would always trigger first.
Setting lock_timeout in postgresql.conf is not recommended because it would affect all sessions

So I'm not sure a lock_timeout is the best approach.

Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.

The point is, if we want consistency and the dashboard, map, and dataset pages need to peek into the catalog, they have to wait or break (lock timeout) during the DROP operation. The alternative is to assume some inconsistency for a while (i.e. you can see the table being dropped in the dashboard but cannot really perform operations there, or you could see it removed but then the drop fails and the table remains there).

To be clear, I don't think it is a problem easy to solve without adding extra complexity.

Also, Engine/Core can help, but the point is what queries are blocked from Backend (and the wonderful rails ORM's), and how to deal with them from there: resilience to statement/lock timeouts, relaxed consistency, etc.

rafatower · 2018-06-27T16:40:58Z

On second thought, the lock_timeout could be a solution if we could apply it just to the DELETE method. So it can acquire the lock quickly or equally quickly give up and return an error. I can give it a try. What do you think about it @javitonino et al.?

Many people went through this ticket before. Comments from anyone are more than welcomed.

rafatower · 2018-06-27T16:53:15Z

@andy-esch can you give me read perms on the table batch_sql_viz_api_lock_copy used above? otherwise I guess I could use any other big table, or generate one just for testing purposes. The point, I guess, is to perform a long locking operation through the sql batch api.

andy-esch · 2018-06-27T17:03:29Z

Set to private with link @rafatower:

javitonino · 2018-06-27T20:17:55Z

@rafatower We would not only need to apply lock_timeout to the delete, but also to other queries that might get locked, like other ALTERs, cartodbfication and some more. Still, it should be a reduced number, so it could be ok. Two comments:

We cannot do a straight-forward SET lock_timeout because of pgbouncer (might get applied to other sessions). However, TIL about SET LOCAL which sets only for a transaction, which could be a good solution.
Applying lock_timeout at DB level could make sense in our case. There is a subtle trick you're missing here. We use the postgres user to do DROP TABLE, which is overriden to have statement_timeout = 0 (and so, it never gives up). However, it does not have an override for lock_timeout, so if we set at a per-DB level, it would use that.

That's more or less why I proposed setting it at database level, since it would help for all cases, including Rails, but also things like Batch SQL API and analyses (which also skip timeouts). We have had problems with those components in the past with competing analyses or user deletion i.e: this is not exclusive of Rails, you can trigger a similar situation just by not being careful using Batch API. Although, most cases are from Rails, since it's the main user of direct postgres user connections.

I agree that setting a timeout in the most problematic Rails queries is a solution to this particular case, but I still think setting a global lock_timeout could be beneficial for other cases we have not yet pinpointed as clearly as this one.

Of course, the dashboard is still going to break if there is something locked at that point (long transaction with an exclusive lock), so we are only talking about mitigation by trying to avoid such long locks by limiting waitign queries.

In summary:

I still think a per-DB lock_timeout could be a good idea as a safety measure.
Even if we do the previous, we probably want to wrap some Rails code in a transaction with a SET LOCAL lock_timeout = 5 or something like that.

Bonus: we may want to consider setting the lock_timeout in the db size function in the extension. That would also have helped here (the table would be locked, but the dashboad would still work).

andy-esch · 2018-06-27T20:23:52Z

Forgive me is this is a really bad idea as I do not know the software architecture very well, and I realize this is not an easy idea to carry out, but... What about having a database for dashboards (table metadata, etc.) that's separate from the user database (which stores the actual tables, runs analysis queries, etc.)?

javitonino · 2018-06-27T20:34:09Z

@andy-esch That is actually what we are doing right now. We still need to connect to the user database in order to gather some info, namely: the size of the database and the list of tables (for the ghost tables process, and also, if we ever want to show non-cartodbfied tables in the dashboard).

We could figure other ways to do some of that, but we would still need to connect to the user database at some point (although maybe less frequently if we cache it).

rafatower · 2018-06-28T09:23:10Z

@javitonino thanks a lot for your points, you got me into it.

Then the simplest solution would be to add a global lock_timeout to both user and metadata DB's. I'd go for that one cause that is something that can be easily set up and reverted, should be needed.

I will run some tests in local, then set up the config changes for staging and production.

rafatower · 2018-06-28T14:09:16Z

These are the queries I got in my local setup:

postgres=# select * from pg_stat_activity where state is not NULL and state <> 'idle';

-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 12369
datname          | postgres
pid              | 13755
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2018-06-28 13:50:05.001721+00
xact_start       | 2018-06-28 14:05:53.024664+00
query_start      | 2018-06-28 14:05:53.024664+00
state_change     | 2018-06-28 14:05:53.024667+00
wait_event_type  | 
wait_event       | 
state            | active
backend_xid      | 
backend_xmin     | 637106
query            | select * from pg_stat_activity where state is not NULL and state <> 'idle';
backend_type     | client backend
-[ RECORD 2 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16401
datname          | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid              | 23144
usesysid         | 10
usename          | postgres
application_name | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 41402
backend_start    | 2018-06-27 15:15:42.713173+00
xact_start       | 2018-06-28 14:05:38.185147+00
query_start      | 2018-06-28 14:05:38.185147+00
state_change     | 2018-06-28 14:05:38.18515+00
wait_event_type  | Lock
wait_event       | relation
state            | active
backend_xid      | 637109
backend_xmin     | 637106
query            | DROP TABLE IF EXISTS public.batch_sql_viz_api_lock
backend_type     | client backend
-[ RECORD 3 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16401
datname          | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid              | 23181
usesysid         | 79795
usename          | development_cartodb_user_5659debf-7f98-4403-9c63-be961436eda9
application_name | cartodb_sqlapi
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 41430
backend_start    | 2018-06-27 15:15:44.145693+00
xact_start       | 2018-06-28 14:05:37.94842+00
query_start      | 2018-06-28 14:05:37.94842+00
state_change     | 2018-06-28 14:05:37.948422+00
wait_event_type  | 
wait_event       | 
state            | active
backend_xid      | 637106
backend_xmin     | 637106
query            | /* 37535288-089c-44c8-9887-80071c3e60a1 */ UPDATE batch_sql_viz_api_lock SET the_geom = cdb_latlng(lat, lng)
backend_type     | client backend
-[ RECORD 4 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16400
datname          | carto_db_development
pid              | 23182
usesysid         | 10
usename          | postgres
application_name | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 41434
backend_start    | 2018-06-27 15:15:44.230431+00
xact_start       | 2018-06-28 14:05:37.966495+00
query_start      | 2018-06-28 14:05:38.159058+00
state_change     | 2018-06-28 14:05:38.159121+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle in transaction
backend_xid      | 637107
backend_xmin     | 
query            | DELETE FROM "user_tables" WHERE "user_tables"."id" = '6e7c72bb-455f-4dc1-8691-bf2bb63203c3'
backend_type     | client backend

rafatower · 2018-06-28T15:41:14Z

The locked queries looks slightly more interesting:

# SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.pid <> pg_backend_pid() and NOT granted;
-[ RECORD 1 ]------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
locktype           | transactionid
database           | 
relation           | 
page               | 
tuple              | 
virtualxid         | 
transactionid      | 637329
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 20/772
pid                | 20987
mode               | ShareLock
granted            | f
fastpath           | f
datid              | 16400
datname            | carto_db_development
pid                | 20987
usesysid           | 10
usename            | postgres
application_name   | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr        | 127.0.0.1
client_hostname    | 
client_port        | 34412
backend_start      | 2018-06-28 14:08:01.58358+00
xact_start         | 2018-06-28 15:38:18.99111+00
query_start        | 2018-06-28 15:38:19.011034+00
state_change       | 2018-06-28 15:38:19.011041+00
wait_event_type    | Lock
wait_event         | transactionid
state              | active
backend_xid        | 637332
backend_xmin       | 637328
query              | UPDATE "user_tables" SET "name" = 'batch_sql_viz_api_lock', "privacy" = 0, "updated_at" = '2018-06-28 19:31:26.617356+0200', "tags" = NULL, "geometry_columns" = NULL, "rows_counted" = 0, "rows_estimated" = 0, "indexes" = NULL, "database_name" = NULL, "description" = NULL, "table_id" = 1007641, "user_id" = '5659debf-7f98-4403-9c63-be961436eda9', "map_id" = '691bc7f4-b6ff-489a-9efd-eca6224a6803', "data_import_id" = NULL WHERE ("id" = '644e1f7a-7244-478e-a707-4389fa1d0d96')
backend_type       | client backend
-[ RECORD 2 ]------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
locktype           | relation
database           | 16401
relation           | 1007641
page               | 
tuple              | 
virtualxid         | 
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 12/718
pid                | 23144
mode               | AccessExclusiveLock
granted            | f
fastpath           | f
datid              | 16401
datname            | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid                | 23144
usesysid           | 10
usename            | postgres
application_name   | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr        | 127.0.0.1
client_hostname    | 
client_port        | 41402
backend_start      | 2018-06-27 15:15:42.713173+00
xact_start         | 2018-06-28 15:37:01.715414+00
query_start        | 2018-06-28 15:37:01.715414+00
state_change       | 2018-06-28 15:37:01.715415+00
wait_event_type    | Lock
wait_event         | relation
state              | active
backend_xid        | 637331
backend_xmin       | 637328
query              | DROP TABLE IF EXISTS public.batch_sql_viz_api_lock
backend_type       | client backend

rafatower · 2018-06-28T15:45:31Z

there might be locks also due to autovacuum:

-[ RECORD 4 ]------+---------------------------------------------------------
locktype           | virtualxid
database           | 
relation           | 
page               | 
tuple              | 
virtualxid         | 5/5125
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 5/5125
pid                | 23652
mode               | ExclusiveLock
granted            | t
fastpath           | t
datid              | 16401
datname            | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid                | 23652
usesysid           | 
usename            | 
application_name   | 
client_addr        | 
client_hostname    | 
client_port        | 
backend_start      | 2018-06-28 15:44:14.196075+00
xact_start         | 2018-06-28 15:44:14.218102+00
query_start        | 2018-06-28 15:44:14.218102+00
state_change       | 2018-06-28 15:44:14.218104+00
wait_event_type    | 
wait_event         | 
state              | active
backend_xid        | 
backend_xmin       | 637364
query              | autovacuum: VACUUM public.batch_sql_viz_api_lock
backend_type       | autovacuum worker

rafatower · 2018-06-28T16:20:40Z

So, here's what happens:

The UPDATE locks the target table
The ORM gets a RowExclusiveLock on the metadata table and launchs a DROP TABLE on the target table
The ORM does not delete nor frees the metadata row because it is waiting for the DROP to complete
When there's an access to the dashboard, an update on the metadata row is attempted (!) (to update the updated_at?), and is blocked waiting for the RowExclusiveLock to be freed.

(please fill in the gaps I might have left out)

At the same time, there are other operations related to DB maintenance and Ghost Tables that may also try to acquire exclusive locks on the target table.

(BTW, I generate the table from scratch to ease "reproducibility" of the issue).

Let's imagine for a moment that we have a relatively short lock_timeout, what would happen then?

The ORM wouldn't be able to fulfill whatever query need to update the RowExclusiveLock, thus returning an error to the frontend. So, it wouldn't really fix the problem at hand.
Some operations required for DB maintenance would give up as well. E.g: if there's just a VACUUM once a day, it may never happen (thanks @Algunenano for pointing that out). The worst of it is that it is hard to know beforehand what exactly would be negatively affected by a lock_timeout.

@javitonino any thoughts? Revisiting some alternatives: how about..?

adding the lock_timeout just to metadata DB's and serving a cache version of the list of tables if it times out?
trying to set the lock_timeout just for the delete triggered from the ORM? (I know it may not be obvious because of the layers of abstraction and pgbouncer, but there are ways, including bypassing pgbouncer. In the end DROP shall not be the most common operation)

javitonino · 2018-06-28T16:53:53Z

When there's an access to the dashboard, an update on the metadata row is attempted (!) (to update the updated_at?), and is blocked waiting for the RowExclusiveLock to be freed.

Uhm. I was thinking about the lock at the user DB (DROP TABLE blocks CDB_UserDataSize, which is called from the dashboard). But maybe there is something in metadata as well. But I don't think so: a RowExclusiveLock does not affect SELECT (they will just see an older version of the table until the transaction finishes). I don't think we have a locking problem in metadata, but rather, at the user DB.

This is the scenario I have in mind (which I've seen in production), without lock_timeout.

Session A (let's say analysis), is running an UPDATE. I emulate by opening a transaction, running an UPDATE and never closing the transaction. It holds a RowExclusiveLock
Session B (Rails), DROP TABLE t. Tries to get an AccessExclusiveLock, but it cannot until A finishes.
Session C (dashboard). Tries to run CDB_UserDataSize, which is locked by the AccessExclusiveLock.

We stay like this until A finishes.

With lock_timeout:

Session A (let's say analysis), is running an UPDATE. I emulate by opening a transaction, running an UPDATE and never closing the transaction. It holds a RowExclusiveLock
Session B (Rails), DROP TABLE t. Tries to get an AccessExclusiveLock, but it cannot until A finishes.
Session C (dashboard). Tries to run CDB_UserDataSize, which is locked by the AccessExclusiveLock.
Session B, lock_timeout expires and the DROP is aborted
Session C, continues and displays the dashboard

The deletion of the table will fail due to the timeout until A finishes. lock_timeout has helped in the sense that it has only blocked one operation (the table deletion) but not the rest of the dashboard.

So, the lock I had in mind was happening at the user DB, not in the metadata DB. I think we have different scenarios in mind.

About VACUUM: It needs a SHARE UPDATE EXCLUSIVE LOCK, so it would only be affected by higher leves of locking (see this table: https://www.postgresql.org/docs/10/static/explicit-locking.html#TABLE-LOCK-COMPATIBILITY). This means, it will only see a difference if there is a pending lock (and usages):

SHARE UPDATE EXCLUSIVE (vacuum, analyze, some alter table, etc)
SHARE (create index)
SHARE ROW EXCLUSIVE (some alter table)
EXCLUSIVE (materialized views)
ACCESS EXCLUSIVE (drop, truncate, vacuum full)

So, in any case, VACUUM won't get locked by INSERT/UPDATE/DELETE/SELECT.

About alternatives:

adding the lock_timeout just to metadata DB's and serving a cache version of the list of tables if it times out?

We could add a lock timeout to ghost tables (Rails) and the DB size function (extension). As discussed previously, we can also add it for dropping tables. All of these will help. My only concern is catching all cases can be complicated.

trying to set the lock_timeout just for the delete triggered from the ORM? (I know it may not be obvious because of the layers of abstraction and pgbouncer, but there are ways, including bypassing pgbouncer. In the end DROP shall not be the most common operation)

As I mentioned, I discovered that a transaction with SET LOCAL works great for this, so we could go for this. As previously discussed, the only problem is catching them all.

Another alternative for me. I think we can workaround the bulk of this issue by settign a timeout in the CDB_UserDataSize function (either at function level or caller level).

rafatower · 2018-06-29T07:53:10Z

But maybe there is something in metadata as well. But I don't think so

There is indeed, check the lock traces I posted above.

About VACUUM: It needs a SHARE UPDATE EXCLUSIVE LOCK

Look at the trace I posted above. It seems to have gotten a ExclusiveLock. Or may I be misinterpreting something? I keep digging into it...

As I mentioned, I discovered that a transaction with SET LOCAL works great for this, so we could go for this. As previously discussed, the only problem is catching them all.

Oh! there's hope! 😄

At least initially, I prefer that approach. There's a clear semantics to that and no surprising side effects down the road.

Another alternative for me. I think we can workaround the bulk of this issue by settign a timeout in the CDB_UserDataSize function (either at function level or caller level).

Eventually we're going to have to reengineer a little not just CDB_UserDataSize but the quota checks in general, face to the new dashboard and all that. Side note: the endpoint api/v1/viz/{table} is not a public one, but the fact that Andy resorts to it reveals some weak spots API wise. cc'ing @inigomedina and @alonsogarciapablo

javitonino · 2018-06-29T08:22:58Z

I think there is something wrong in the way you gather pg_locks or in the way we are reading them, because it contradicts the docs. That, or the docs are completely wrong, which is also a possibility. 🤔

I think it might be the lock_type field. In autovacuum, it's holding and Exclusive lock over a transaction id, but there is no relation level lock. Reading this table, it looks like:

Autovacuum is holding an exclusive lock on a transaction. It should be also holding a lock on the relation (depending on the phase of the vacuum).
DROP is holding an AccessExclusiveLock on a relation, which makes sense
UPDATE is waiting for transactionid (a transaction to finish). But I don't see the actual transaction that could be locking it (in your listing, there are no other locks in the metadata DB).

I'm thinking we should get together and go step by step, since there is certainly something we are missing here.

Yes, SET LOCAL in DROPs should be less risky. The reason I don't like it, is that it's still possible to lock your database in other ways. It would help in this case though, but it's not very frequent (I've only seen it 3-4 times in a couple of years). So my worry is that we will fix this specific case, but who knows how many will remain. I mean, I still think we should do it (all of our parts should play nicely), but I would like a DB-level failsafe anyway.

About the missing endpoints, we will most likely consider it in one of our next projects, since we want users to be able to access it. We are aware that we do not have a good way to list tables. In fact, the only public official way is to rely on a query to pg_catalog. Still, it's a problem we need to clearly understand, since it interacts with ghost tables, and it might be confusing for API users to deal with such concepts that are more closely related to Builder than anything else.

inigomedina · 2018-06-29T09:54:34Z

My opinion here is similar to what I have said in other issues related to dashboard components (ghost tables, etc.): I think we are still thinking on how we add a stone to the mountain, instead of rethinking it. The problem with anything related to what @andy-esch pointed out is that the user mental model does not fit ours, pretty much designed from an engineering point of view.

Users just want to do an action (delete a table, for instance) and continue working on their things. They do not expect that while trying to list, let's say, their data, they cannot do that because there is a hidden relationship between the first action and the second one. As we do not expect either that by uploading a file to Google Drive, we cannot continue working on a document or searching for something.

From my perspective, that is the theme here. It is not about database internals, it is about how we have to design the entire experience assuming those goals. I believe the new dashboard is a good opportunity to rethink these things. I do not know if @jorgesancha and @alonsogarciapablo, leading that initiative, share my view though.

rafatower · 2018-07-02T14:58:16Z

Since this is a rather complex scenario, I created a script to reproduce it: https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604

The DROP SEQUENCE is made only IF EXISTS, so no exception is raised here. Just a NOTICE from the DB if it does not exist. That was added a long time ago. It is not really needed but nobody's touched it probably because of fear of breaking things.

rafatower · 2018-07-02T16:15:44Z

Working on a fix here: #14127

This is the result of the execution of the test after the fix:

$ pipenv run delete_lock_test -s :8080 -r :3000 http://development.localhost.lan $API_KEY
2018-07-02 18:12:42,999 - INFO - Dropping table if it exists... 
2018-07-02 18:12:43,141 - INFO - dropped
2018-07-02 18:12:43,142 - INFO - Creating table...
2018-07-02 18:12:43,211 - INFO - table created
2018-07-02 18:12:43,211 - INFO - Populating table...
2018-07-02 18:12:50,505 - INFO - table populated
2018-07-02 18:12:50,510 - INFO - Cartodbfy'ing the table
2018-07-02 18:12:56,343 - INFO - Table cartodbfy'ed
2018-07-02 18:12:56,343 - INFO - Sending UPDATE to the Batch API...
2018-07-02 18:12:56,353 - INFO - Waiting for it to start...
2018-07-02 18:12:56,353 - INFO - UPDATE running
2018-07-02 18:12:56,353 - INFO - Making sure table is in the dashboard...
2018-07-02 18:12:58,729 - INFO - Table is now in the dashboard
2018-07-02 18:12:58,729 - INFO - Sending DELETE...
Traceback (most recent call last):
  File "./delete_lock_test.py", line 163, in <module>
    main()
  File "./delete_lock_test.py", line 159, in main
    delete_canonical_viz(auth_rails)
  File "./delete_lock_test.py", line 124, in delete_canonical_viz
    resp.raise_for_status()
  File "/home/rtorre/.local/share/virtualenvs/drop-table-dashboard-locks-BWL7GOYF/local/lib/python2.7/site-packages/requests/models.py", line 939, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://development.localhost.lan:3000/api/v1/viz/batch_sql_viz_api_lock?api_key=730162d337594b567fa5e51919cac5e3061a7197

So, instead of blocking the DROP TABLE waiting for a lock and the whole dashboard, the DELETE request gives up after 1s and returns an error back to the client, but the dashboard is not blocked.

As a bonus, I added the lock_timeouts to the CDB_UserDataSize() functions as well, as Javi asked for it. Since that is a bit out of scope, I'm ready to remove those upon request.

…-update Update cartodb PG extension to 0.23.0 #12829

rafatower · 2018-07-04T11:13:21Z

And fixed in production:

https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604

$ pipenv run delete_lock_test https://rtorre.carto.com $API_KEY
2018-07-04 13:04:06,902 - INFO - Dropping table if it exists... 
2018-07-04 13:04:08,019 - INFO - dropped
2018-07-04 13:04:08,020 - INFO - Creating table...
2018-07-04 13:04:08,655 - INFO - table created
2018-07-04 13:04:08,655 - INFO - Populating table...
2018-07-04 13:04:34,970 - INFO - table populated
2018-07-04 13:04:34,976 - INFO - Cartodbfy'ing the table
2018-07-04 13:05:02,010 - INFO - Table cartodbfy'ed
2018-07-04 13:05:02,010 - INFO - Sending UPDATE to the Batch API...
2018-07-04 13:05:02,160 - INFO - Waiting for it to start...
2018-07-04 13:05:02,160 - INFO - UPDATE running
2018-07-04 13:05:02,160 - INFO - Making sure table is in the dashboard...
2018-07-04 13:05:07,943 - INFO - Table is now in the dashboard
2018-07-04 13:05:07,944 - INFO - Sending DELETE...
2018-07-04 13:05:12,430 - INFO - Could not delete canonical viz, status_code = 400
2018-07-04 13:05:12,431 - INFO - Cancelling UPDATE...
2018-07-04 13:05:12,589 - INFO - UPDATE canceled.

Obviously the Could not delete canonical viz is because of the lock timeout, and it is logged as such in rollbar: https://rollbar.com/carto/CartoDB/items/36087/

juanignaciosl added the Backend label Sep 20, 2017

ramiroaznar assigned juanignaciosl Mar 22, 2018

juanignaciosl assigned javitonino and unassigned juanignaciosl Apr 2, 2018

ramiroaznar assigned ethervoid and unassigned javitonino Apr 16, 2018

taking mentioned this issue Apr 18, 2018

오류 제보 가이드라인 openNAMU/openNAMU#431

Closed

ramiroaznar assigned AleGoiko Apr 30, 2018

ethervoid removed their assignment May 2, 2018

ramiroaznar assigned antoniocarlon May 28, 2018

ramiroaznar unassigned AleGoiko and antoniocarlon Jun 22, 2018

ramiroaznar added the engine label Jun 22, 2018

ramiroaznar assigned rafatower Jun 25, 2018

rafatower pushed a commit that referenced this issue Jun 29, 2018

Lock timeout for remove table operations #12829

511ddd5

rafatower pushed a commit that referenced this issue Jun 29, 2018

Lock timeout for CDB_UserDataSize() ops #12829

b1392b0

rafatower pushed a commit that referenced this issue Jun 29, 2018

Lock timeout for CDB_UserDataSize() ops #12829

f50ca9f

rafatower pushed a commit that referenced this issue Jun 29, 2018

Please rubocop #12829

e3de2cb

rafatower pushed a commit that referenced this issue Jun 29, 2018

Update NEWS.md #12829

3dc5dc5

rafatower pushed a commit that referenced this issue Jul 2, 2018

Wrap table deletion lock_timeout in transaction #12829

2d515ef

rafatower pushed a commit that referenced this issue Jul 2, 2018

Wrap CDB_UserDataSize() in transaction w/ lock_timeout #12829

c0269da

rafatower mentioned this issue Jul 2, 2018

Fix locks drop table #14127

Merged

rafatower pushed a commit that referenced this issue Jul 3, 2018

Check table exists before CDB_DropOverviews #12829

3a8ac6f

rafatower pushed a commit that referenced this issue Jul 3, 2018

Update cartodb PG extension to 0.23.0 #12829

2641900

rafatower pushed a commit that referenced this issue Jul 4, 2018

Merge pull request #14135 from CartoDB/fix-locks-drop-table-extension…

f9e9287

…-update Update cartodb PG extension to 0.23.0 #12829

rafatower closed this as completed Jul 4, 2018

rafatower mentioned this issue Aug 20, 2018

/api/v3/me endpoint is painfully slow #13953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs #12829

dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs #12829

andy-esch commented Sep 20, 2017 •

edited

Loading

javitonino commented Apr 5, 2018

ramiroaznar commented May 28, 2018

ramiroaznar commented Jun 22, 2018

rafatower commented Jun 27, 2018

rafatower commented Jun 27, 2018

rafatower commented Jun 27, 2018

andy-esch commented Jun 27, 2018

javitonino commented Jun 27, 2018 •

edited

Loading

andy-esch commented Jun 27, 2018

javitonino commented Jun 27, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

javitonino commented Jun 28, 2018

rafatower commented Jun 29, 2018

javitonino commented Jun 29, 2018

inigomedina commented Jun 29, 2018

rafatower commented Jul 2, 2018

rafatower commented Jul 2, 2018

rafatower commented Jul 4, 2018 •

edited

Loading

dashboard locked if api/v1/viz DELETE requests are sent on table involved in batch sql jobs #12829

dashboard locked if api/v1/viz DELETE requests are sent on table involved in batch sql jobs #12829

Comments

andy-esch commented Sep 20, 2017 • edited Loading

Context

Steps to Reproduce

Current Result

Expected result

Browser and version

.carto file

Additional info

javitonino commented Apr 5, 2018

ramiroaznar commented May 28, 2018

ramiroaznar commented Jun 22, 2018

rafatower commented Jun 27, 2018

rafatower commented Jun 27, 2018

rafatower commented Jun 27, 2018

andy-esch commented Jun 27, 2018

javitonino commented Jun 27, 2018 • edited Loading

andy-esch commented Jun 27, 2018

javitonino commented Jun 27, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

rafatower commented Jun 28, 2018

javitonino commented Jun 28, 2018

rafatower commented Jun 29, 2018

javitonino commented Jun 29, 2018

inigomedina commented Jun 29, 2018

rafatower commented Jul 2, 2018

rafatower commented Jul 2, 2018

rafatower commented Jul 4, 2018 • edited Loading

dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs #12829

dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs #12829

andy-esch commented Sep 20, 2017 •

edited

Loading

javitonino commented Jun 27, 2018 •

edited

Loading

rafatower commented Jul 4, 2018 •

edited

Loading