Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dashboard locked if api/v1/viz DELETE requests are sent on table involved in batch sql jobs #12829

Closed
andy-esch opened this issue Sep 20, 2017 · 22 comments
Assignees

Comments

@andy-esch
Copy link
Contributor

andy-esch commented Sep 20, 2017

Context

I get 504 gateway timeouts (and a frozen dashboard) when a delete table request is applied on a table involved in a batch sql operation.

Steps to Reproduce

  1. Have a table with ~1M rows (download one here)
  2. Run a batch operation, then try to delete the table. Here's the operations in Python:
from carto.auth import APIKeyAuthClient
from carto.sql import BatchSQLClient

auth = APIKeyAuthClient('https://eschbacher.carto.com/',
                        'my api key')
# table with 950000 rows
table = 'batch_sql_viz_api_lock'

# update geometry from columns `lat` and `lng`
BatchSQLClient(auth).create([
    "UPDATE {table} SET the_geom = cdb_latlng(lat, lng)".format(table=table)
])
auth.send('api/v1/viz/{table}'.format(table=table),
          http_method='DELETE')

Current Result

Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.

Expected result

Batch jobs and requests to delete tables should not freeze user account.

Browser and version

Chrome 61.0.3163.91 (Official Build) (64-bit)
macOS 10.12.5

.carto file

None, but you can get a dataset to test here:
https://eschbacher.carto.com/api/v2/sql?q=select+*+from+batch_sql_viz_api_lock_copy&format=csv&filename= batch_sql_viz_api_lock

Additional info

Discovered while developing cartoframes

cc @juanignaciosl

@javitonino
Copy link
Contributor

This is caused because the DROP TABLE tries to get an exclusive lock, and locks any task behind it. It will block until the batch finishes, then the DROP finishes, and finally, all other queued queries run.

The solution is to set a lock_timeout so we wait only for a limited time. I would suggest doing this globally (setting it at database level), so we can get all the cases. An alternative is to fix it in our deletion, but this could still happen via SQL API. It's a bit :psycho:, but I feel it's the safer way to make sure this doesn't happen.

cc @dgaubert Could this break something engine-wise?

@ramiroaznar
Copy link

cc @Algunenano @dgaubert

@ramiroaznar
Copy link

We can leave this for next shift.

@rafatower
Copy link
Contributor

From the docs

Note that if statement_timeout is nonzero, it is rather pointless to set lock_timeout to the same or larger value, since the statement timeout would always trigger first.
Setting lock_timeout in postgresql.conf is not recommended because it would affect all sessions

So I'm not sure a lock_timeout is the best approach.

Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.

The point is, if we want consistency and the dashboard, map, and dataset pages need to peek into the catalog, they have to wait or break (lock timeout) during the DROP operation. The alternative is to assume some inconsistency for a while (i.e. you can see the table being dropped in the dashboard but cannot really perform operations there, or you could see it removed but then the drop fails and the table remains there).

To be clear, I don't think it is a problem easy to solve without adding extra complexity.

Also, Engine/Core can help, but the point is what queries are blocked from Backend (and the wonderful rails ORM's), and how to deal with them from there: resilience to statement/lock timeouts, relaxed consistency, etc.

@rafatower
Copy link
Contributor

On second thought, the lock_timeout could be a solution if we could apply it just to the DELETE method. So it can acquire the lock quickly or equally quickly give up and return an error. I can give it a try. What do you think about it @javitonino et al.?

Many people went through this ticket before. Comments from anyone are more than welcomed.

@rafatower
Copy link
Contributor

@andy-esch can you give me read perms on the table batch_sql_viz_api_lock_copy used above? otherwise I guess I could use any other big table, or generate one just for testing purposes. The point, I guess, is to perform a long locking operation through the sql batch api.

@javitonino
Copy link
Contributor

javitonino commented Jun 27, 2018

@rafatower We would not only need to apply lock_timeout to the delete, but also to other queries that might get locked, like other ALTERs, cartodbfication and some more. Still, it should be a reduced number, so it could be ok. Two comments:

  • We cannot do a straight-forward SET lock_timeout because of pgbouncer (might get applied to other sessions). However, TIL about SET LOCAL which sets only for a transaction, which could be a good solution.
  • Applying lock_timeout at DB level could make sense in our case. There is a subtle trick you're missing here. We use the postgres user to do DROP TABLE, which is overriden to have statement_timeout = 0 (and so, it never gives up). However, it does not have an override for lock_timeout, so if we set at a per-DB level, it would use that.

That's more or less why I proposed setting it at database level, since it would help for all cases, including Rails, but also things like Batch SQL API and analyses (which also skip timeouts). We have had problems with those components in the past with competing analyses or user deletion i.e: this is not exclusive of Rails, you can trigger a similar situation just by not being careful using Batch API. Although, most cases are from Rails, since it's the main user of direct postgres user connections.

I agree that setting a timeout in the most problematic Rails queries is a solution to this particular case, but I still think setting a global lock_timeout could be beneficial for other cases we have not yet pinpointed as clearly as this one.

Of course, the dashboard is still going to break if there is something locked at that point (long transaction with an exclusive lock), so we are only talking about mitigation by trying to avoid such long locks by limiting waitign queries.

In summary:

  • I still think a per-DB lock_timeout could be a good idea as a safety measure.
  • Even if we do the previous, we probably want to wrap some Rails code in a transaction with a SET LOCAL lock_timeout = 5 or something like that.

Bonus: we may want to consider setting the lock_timeout in the db size function in the extension. That would also have helped here (the table would be locked, but the dashboad would still work).

@andy-esch
Copy link
Contributor Author

Forgive me is this is a really bad idea as I do not know the software architecture very well, and I realize this is not an easy idea to carry out, but... What about having a database for dashboards (table metadata, etc.) that's separate from the user database (which stores the actual tables, runs analysis queries, etc.)?

@javitonino
Copy link
Contributor

@andy-esch That is actually what we are doing right now. We still need to connect to the user database in order to gather some info, namely: the size of the database and the list of tables (for the ghost tables process, and also, if we ever want to show non-cartodbfied tables in the dashboard).

We could figure other ways to do some of that, but we would still need to connect to the user database at some point (although maybe less frequently if we cache it).

@rafatower
Copy link
Contributor

@javitonino thanks a lot for your points, you got me into it.

Then the simplest solution would be to add a global lock_timeout to both user and metadata DB's. I'd go for that one cause that is something that can be easily set up and reverted, should be needed.

I will run some tests in local, then set up the config changes for staging and production.

@rafatower
Copy link
Contributor

These are the queries I got in my local setup:

postgres=# select * from pg_stat_activity where state is not NULL and state <> 'idle';
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 12369
datname          | postgres
pid              | 13755
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2018-06-28 13:50:05.001721+00
xact_start       | 2018-06-28 14:05:53.024664+00
query_start      | 2018-06-28 14:05:53.024664+00
state_change     | 2018-06-28 14:05:53.024667+00
wait_event_type  | 
wait_event       | 
state            | active
backend_xid      | 
backend_xmin     | 637106
query            | select * from pg_stat_activity where state is not NULL and state <> 'idle';
backend_type     | client backend
-[ RECORD 2 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16401
datname          | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid              | 23144
usesysid         | 10
usename          | postgres
application_name | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 41402
backend_start    | 2018-06-27 15:15:42.713173+00
xact_start       | 2018-06-28 14:05:38.185147+00
query_start      | 2018-06-28 14:05:38.185147+00
state_change     | 2018-06-28 14:05:38.18515+00
wait_event_type  | Lock
wait_event       | relation
state            | active
backend_xid      | 637109
backend_xmin     | 637106
query            | DROP TABLE IF EXISTS public.batch_sql_viz_api_lock
backend_type     | client backend
-[ RECORD 3 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16401
datname          | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid              | 23181
usesysid         | 79795
usename          | development_cartodb_user_5659debf-7f98-4403-9c63-be961436eda9
application_name | cartodb_sqlapi
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 41430
backend_start    | 2018-06-27 15:15:44.145693+00
xact_start       | 2018-06-28 14:05:37.94842+00
query_start      | 2018-06-28 14:05:37.94842+00
state_change     | 2018-06-28 14:05:37.948422+00
wait_event_type  | 
wait_event       | 
state            | active
backend_xid      | 637106
backend_xmin     | 637106
query            | /* 37535288-089c-44c8-9887-80071c3e60a1 */ UPDATE batch_sql_viz_api_lock SET the_geom = cdb_latlng(lat, lng)
backend_type     | client backend
-[ RECORD 4 ]----+-------------------------------------------------------------------------------------------------------------
datid            | 16400
datname          | carto_db_development
pid              | 23182
usesysid         | 10
usename          | postgres
application_name | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 41434
backend_start    | 2018-06-27 15:15:44.230431+00
xact_start       | 2018-06-28 14:05:37.966495+00
query_start      | 2018-06-28 14:05:38.159058+00
state_change     | 2018-06-28 14:05:38.159121+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle in transaction
backend_xid      | 637107
backend_xmin     | 
query            | DELETE FROM "user_tables" WHERE "user_tables"."id" = '6e7c72bb-455f-4dc1-8691-bf2bb63203c3'
backend_type     | client backend

@rafatower
Copy link
Contributor

The locked queries looks slightly more interesting:

# SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.pid <> pg_backend_pid() and NOT granted;
-[ RECORD 1 ]------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
locktype           | transactionid
database           | 
relation           | 
page               | 
tuple              | 
virtualxid         | 
transactionid      | 637329
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 20/772
pid                | 20987
mode               | ShareLock
granted            | f
fastpath           | f
datid              | 16400
datname            | carto_db_development
pid                | 20987
usesysid           | 10
usename            | postgres
application_name   | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr        | 127.0.0.1
client_hostname    | 
client_port        | 34412
backend_start      | 2018-06-28 14:08:01.58358+00
xact_start         | 2018-06-28 15:38:18.99111+00
query_start        | 2018-06-28 15:38:19.011034+00
state_change       | 2018-06-28 15:38:19.011041+00
wait_event_type    | Lock
wait_event         | transactionid
state              | active
backend_xid        | 637332
backend_xmin       | 637328
query              | UPDATE "user_tables" SET "name" = 'batch_sql_viz_api_lock', "privacy" = 0, "updated_at" = '2018-06-28 19:31:26.617356+0200', "tags" = NULL, "geometry_columns" = NULL, "rows_counted" = 0, "rows_estimated" = 0, "indexes" = NULL, "database_name" = NULL, "description" = NULL, "table_id" = 1007641, "user_id" = '5659debf-7f98-4403-9c63-be961436eda9', "map_id" = '691bc7f4-b6ff-489a-9efd-eca6224a6803', "data_import_id" = NULL WHERE ("id" = '644e1f7a-7244-478e-a707-4389fa1d0d96')
backend_type       | client backend
-[ RECORD 2 ]------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
locktype           | relation
database           | 16401
relation           | 1007641
page               | 
tuple              | 
virtualxid         | 
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 12/718
pid                | 23144
mode               | AccessExclusiveLock
granted            | f
fastpath           | f
datid              | 16401
datname            | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid                | 23144
usesysid           | 10
usename            | postgres
application_name   | /home/rtorre/.rbenv/versions/2.2.6/bin/thin
client_addr        | 127.0.0.1
client_hostname    | 
client_port        | 41402
backend_start      | 2018-06-27 15:15:42.713173+00
xact_start         | 2018-06-28 15:37:01.715414+00
query_start        | 2018-06-28 15:37:01.715414+00
state_change       | 2018-06-28 15:37:01.715415+00
wait_event_type    | Lock
wait_event         | relation
state              | active
backend_xid        | 637331
backend_xmin       | 637328
query              | DROP TABLE IF EXISTS public.batch_sql_viz_api_lock
backend_type       | client backend

@rafatower
Copy link
Contributor

there might be locks also due to autovacuum:

-[ RECORD 4 ]------+---------------------------------------------------------
locktype           | virtualxid
database           | 
relation           | 
page               | 
tuple              | 
virtualxid         | 5/5125
transactionid      | 
classid            | 
objid              | 
objsubid           | 
virtualtransaction | 5/5125
pid                | 23652
mode               | ExclusiveLock
granted            | t
fastpath           | t
datid              | 16401
datname            | cartodb_dev_user_5659debf-7f98-4403-9c63-be961436eda9_db
pid                | 23652
usesysid           | 
usename            | 
application_name   | 
client_addr        | 
client_hostname    | 
client_port        | 
backend_start      | 2018-06-28 15:44:14.196075+00
xact_start         | 2018-06-28 15:44:14.218102+00
query_start        | 2018-06-28 15:44:14.218102+00
state_change       | 2018-06-28 15:44:14.218104+00
wait_event_type    | 
wait_event         | 
state              | active
backend_xid        | 
backend_xmin       | 637364
query              | autovacuum: VACUUM public.batch_sql_viz_api_lock
backend_type       | autovacuum worker

@rafatower
Copy link
Contributor

So, here's what happens:

  • The UPDATE locks the target table
  • The ORM gets a RowExclusiveLock on the metadata table and launchs a DROP TABLE on the target table
  • The ORM does not delete nor frees the metadata row because it is waiting for the DROP to complete
  • When there's an access to the dashboard, an update on the metadata row is attempted (!) (to update the updated_at?), and is blocked waiting for the RowExclusiveLock to be freed.

(please fill in the gaps I might have left out)

At the same time, there are other operations related to DB maintenance and Ghost Tables that may also try to acquire exclusive locks on the target table.

(BTW, I generate the table from scratch to ease "reproducibility" of the issue).

Let's imagine for a moment that we have a relatively short lock_timeout, what would happen then?

  • The ORM wouldn't be able to fulfill whatever query need to update the RowExclusiveLock, thus returning an error to the frontend. So, it wouldn't really fix the problem at hand.
  • Some operations required for DB maintenance would give up as well. E.g: if there's just a VACUUM once a day, it may never happen (thanks @Algunenano for pointing that out). The worst of it is that it is hard to know beforehand what exactly would be negatively affected by a lock_timeout.

@javitonino any thoughts? Revisiting some alternatives: how about..?

  • adding the lock_timeout just to metadata DB's and serving a cache version of the list of tables if it times out?
  • trying to set the lock_timeout just for the delete triggered from the ORM? (I know it may not be obvious because of the layers of abstraction and pgbouncer, but there are ways, including bypassing pgbouncer. In the end DROP shall not be the most common operation)

@javitonino
Copy link
Contributor

When there's an access to the dashboard, an update on the metadata row is attempted (!) (to update the updated_at?), and is blocked waiting for the RowExclusiveLock to be freed.

Uhm. I was thinking about the lock at the user DB (DROP TABLE blocks CDB_UserDataSize, which is called from the dashboard). But maybe there is something in metadata as well. But I don't think so: a RowExclusiveLock does not affect SELECT (they will just see an older version of the table until the transaction finishes). I don't think we have a locking problem in metadata, but rather, at the user DB.

This is the scenario I have in mind (which I've seen in production), without lock_timeout.

  • Session A (let's say analysis), is running an UPDATE. I emulate by opening a transaction, running an UPDATE and never closing the transaction. It holds a RowExclusiveLock
  • Session B (Rails), DROP TABLE t. Tries to get an AccessExclusiveLock, but it cannot until A finishes.
  • Session C (dashboard). Tries to run CDB_UserDataSize, which is locked by the AccessExclusiveLock.

We stay like this until A finishes.

With lock_timeout:

  • Session A (let's say analysis), is running an UPDATE. I emulate by opening a transaction, running an UPDATE and never closing the transaction. It holds a RowExclusiveLock
  • Session B (Rails), DROP TABLE t. Tries to get an AccessExclusiveLock, but it cannot until A finishes.
  • Session C (dashboard). Tries to run CDB_UserDataSize, which is locked by the AccessExclusiveLock.
  • Session B, lock_timeout expires and the DROP is aborted
  • Session C, continues and displays the dashboard

The deletion of the table will fail due to the timeout until A finishes. lock_timeout has helped in the sense that it has only blocked one operation (the table deletion) but not the rest of the dashboard.

So, the lock I had in mind was happening at the user DB, not in the metadata DB. I think we have different scenarios in mind.


About VACUUM: It needs a SHARE UPDATE EXCLUSIVE LOCK, so it would only be affected by higher leves of locking (see this table: https://www.postgresql.org/docs/10/static/explicit-locking.html#TABLE-LOCK-COMPATIBILITY). This means, it will only see a difference if there is a pending lock (and usages):

  • SHARE UPDATE EXCLUSIVE (vacuum, analyze, some alter table, etc)
  • SHARE (create index)
  • SHARE ROW EXCLUSIVE (some alter table)
  • EXCLUSIVE (materialized views)
  • ACCESS EXCLUSIVE (drop, truncate, vacuum full)

So, in any case, VACUUM won't get locked by INSERT/UPDATE/DELETE/SELECT.


About alternatives:

adding the lock_timeout just to metadata DB's and serving a cache version of the list of tables if it times out?

We could add a lock timeout to ghost tables (Rails) and the DB size function (extension). As discussed previously, we can also add it for dropping tables. All of these will help. My only concern is catching all cases can be complicated.

trying to set the lock_timeout just for the delete triggered from the ORM? (I know it may not be obvious because of the layers of abstraction and pgbouncer, but there are ways, including bypassing pgbouncer. In the end DROP shall not be the most common operation)

As I mentioned, I discovered that a transaction with SET LOCAL works great for this, so we could go for this. As previously discussed, the only problem is catching them all.


Another alternative for me. I think we can workaround the bulk of this issue by settign a timeout in the CDB_UserDataSize function (either at function level or caller level).

@rafatower
Copy link
Contributor

But maybe there is something in metadata as well. But I don't think so

There is indeed, check the lock traces I posted above.

About VACUUM: It needs a SHARE UPDATE EXCLUSIVE LOCK

Look at the trace I posted above. It seems to have gotten a ExclusiveLock. Or may I be misinterpreting something? I keep digging into it...

As I mentioned, I discovered that a transaction with SET LOCAL works great for this, so we could go for this. As previously discussed, the only problem is catching them all.

Oh! there's hope! 😄

At least initially, I prefer that approach. There's a clear semantics to that and no surprising side effects down the road.

Another alternative for me. I think we can workaround the bulk of this issue by settign a timeout in the CDB_UserDataSize function (either at function level or caller level).

Eventually we're going to have to reengineer a little not just CDB_UserDataSize but the quota checks in general, face to the new dashboard and all that. Side note: the endpoint api/v1/viz/{table} is not a public one, but the fact that Andy resorts to it reveals some weak spots API wise. cc'ing @inigomedina and @alonsogarciapablo

@javitonino
Copy link
Contributor

I think there is something wrong in the way you gather pg_locks or in the way we are reading them, because it contradicts the docs. That, or the docs are completely wrong, which is also a possibility. 🤔

I think it might be the lock_type field. In autovacuum, it's holding and Exclusive lock over a transaction id, but there is no relation level lock. Reading this table, it looks like:

  • Autovacuum is holding an exclusive lock on a transaction. It should be also holding a lock on the relation (depending on the phase of the vacuum).
  • DROP is holding an AccessExclusiveLock on a relation, which makes sense
  • UPDATE is waiting for transactionid (a transaction to finish). But I don't see the actual transaction that could be locking it (in your listing, there are no other locks in the metadata DB).

I'm thinking we should get together and go step by step, since there is certainly something we are missing here.


Yes, SET LOCAL in DROPs should be less risky. The reason I don't like it, is that it's still possible to lock your database in other ways. It would help in this case though, but it's not very frequent (I've only seen it 3-4 times in a couple of years). So my worry is that we will fix this specific case, but who knows how many will remain. I mean, I still think we should do it (all of our parts should play nicely), but I would like a DB-level failsafe anyway.


About the missing endpoints, we will most likely consider it in one of our next projects, since we want users to be able to access it. We are aware that we do not have a good way to list tables. In fact, the only public official way is to rely on a query to pg_catalog. Still, it's a problem we need to clearly understand, since it interacts with ghost tables, and it might be confusing for API users to deal with such concepts that are more closely related to Builder than anything else.

@inigomedina
Copy link

My opinion here is similar to what I have said in other issues related to dashboard components (ghost tables, etc.): I think we are still thinking on how we add a stone to the mountain, instead of rethinking it. The problem with anything related to what @andy-esch pointed out is that the user mental model does not fit ours, pretty much designed from an engineering point of view.

Users just want to do an action (delete a table, for instance) and continue working on their things. They do not expect that while trying to list, let's say, their data, they cannot do that because there is a hidden relationship between the first action and the second one. As we do not expect either that by uploading a file to Google Drive, we cannot continue working on a document or searching for something.

From my perspective, that is the theme here. It is not about database internals, it is about how we have to design the entire experience assuming those goals. I believe the new dashboard is a good opportunity to rethink these things. I do not know if @jorgesancha and @alonsogarciapablo, leading that initiative, share my view though.

rafatower pushed a commit that referenced this issue Jun 29, 2018
rafatower pushed a commit that referenced this issue Jun 29, 2018
@rafatower
Copy link
Contributor

Since this is a rather complex scenario, I created a script to reproduce it: https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604

rafatower pushed a commit that referenced this issue Jul 2, 2018
The DROP SEQUENCE is made only IF EXISTS, so no exception is raised
here. Just a NOTICE from the DB if it does not exist.

That was added a long time ago. It is not really needed but nobody's
touched it probably because of fear of breaking things.
@rafatower
Copy link
Contributor

Working on a fix here: #14127

This is the result of the execution of the test after the fix:

$ pipenv run delete_lock_test -s :8080 -r :3000 http://development.localhost.lan $API_KEY
2018-07-02 18:12:42,999 - INFO - Dropping table if it exists... 
2018-07-02 18:12:43,141 - INFO - dropped
2018-07-02 18:12:43,142 - INFO - Creating table...
2018-07-02 18:12:43,211 - INFO - table created
2018-07-02 18:12:43,211 - INFO - Populating table...
2018-07-02 18:12:50,505 - INFO - table populated
2018-07-02 18:12:50,510 - INFO - Cartodbfy'ing the table
2018-07-02 18:12:56,343 - INFO - Table cartodbfy'ed
2018-07-02 18:12:56,343 - INFO - Sending UPDATE to the Batch API...
2018-07-02 18:12:56,353 - INFO - Waiting for it to start...
2018-07-02 18:12:56,353 - INFO - UPDATE running
2018-07-02 18:12:56,353 - INFO - Making sure table is in the dashboard...
2018-07-02 18:12:58,729 - INFO - Table is now in the dashboard
2018-07-02 18:12:58,729 - INFO - Sending DELETE...
Traceback (most recent call last):
  File "./delete_lock_test.py", line 163, in <module>
    main()
  File "./delete_lock_test.py", line 159, in main
    delete_canonical_viz(auth_rails)
  File "./delete_lock_test.py", line 124, in delete_canonical_viz
    resp.raise_for_status()
  File "/home/rtorre/.local/share/virtualenvs/drop-table-dashboard-locks-BWL7GOYF/local/lib/python2.7/site-packages/requests/models.py", line 939, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://development.localhost.lan:3000/api/v1/viz/batch_sql_viz_api_lock?api_key=730162d337594b567fa5e51919cac5e3061a7197

So, instead of blocking the DROP TABLE waiting for a lock and the whole dashboard, the DELETE request gives up after 1s and returns an error back to the client, but the dashboard is not blocked.

As a bonus, I added the lock_timeouts to the CDB_UserDataSize() functions as well, as Javi asked for it. Since that is a bit out of scope, I'm ready to remove those upon request.

rafatower pushed a commit that referenced this issue Jul 4, 2018
…-update

Update cartodb PG extension to 0.23.0 #12829
@rafatower
Copy link
Contributor

rafatower commented Jul 4, 2018

And fixed in production:

https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604

$ pipenv run delete_lock_test https://rtorre.carto.com $API_KEY
2018-07-04 13:04:06,902 - INFO - Dropping table if it exists... 
2018-07-04 13:04:08,019 - INFO - dropped
2018-07-04 13:04:08,020 - INFO - Creating table...
2018-07-04 13:04:08,655 - INFO - table created
2018-07-04 13:04:08,655 - INFO - Populating table...
2018-07-04 13:04:34,970 - INFO - table populated
2018-07-04 13:04:34,976 - INFO - Cartodbfy'ing the table
2018-07-04 13:05:02,010 - INFO - Table cartodbfy'ed
2018-07-04 13:05:02,010 - INFO - Sending UPDATE to the Batch API...
2018-07-04 13:05:02,160 - INFO - Waiting for it to start...
2018-07-04 13:05:02,160 - INFO - UPDATE running
2018-07-04 13:05:02,160 - INFO - Making sure table is in the dashboard...
2018-07-04 13:05:07,943 - INFO - Table is now in the dashboard
2018-07-04 13:05:07,944 - INFO - Sending DELETE...
2018-07-04 13:05:12,430 - INFO - Could not delete canonical viz, status_code = 400
2018-07-04 13:05:12,431 - INFO - Cancelling UPDATE...
2018-07-04 13:05:12,589 - INFO - UPDATE canceled.

Obviously the Could not delete canonical viz is because of the lock timeout, and it is logged as such in rollbar: https://rollbar.com/carto/CartoDB/items/36087/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants