kv: bound the size of in-flight write tracking in txnPipeliner #32522

manigandham · 2018-11-21T07:51:56Z

We have a table with a 26M rows and 20 columns. Issuing a delete, for roughly half the rows in the table, crashes the node:

DELETE FROM accounts WHERE datasetversion = 4;

Limiting the deletes works, I tried up to 500k without issue.

DELETE FROM accounts WHERE datasetversion = 4 LIMIT 500000;

3 nodes x n1-standard-2 (2 cpu / 7.5gb) with local nvme SSD on GCE with 25% cache/mem settings, 3x RF, this is the only table in the database.

Tried it again with a single node with 4 cpu / 15gb, same result

To Reproduce

Start single or multi-node cluster. Run import of 6gb CSV file containing 26M rows. Run 'DELETE' command filtering on a single column with 2 possible values, which should remove about half of the data. The node connected to the client will slowly run up memory usage until it crashes.

Expected behavior
Delete around 12M out of 26M rows in table without crashing.

Environment:

CockroachDB version 2.1.1
Server OS: Ubuntu 18.04 LTS on GCE
Client app: psql client on separate vm

Debug output from 3 node cluster:
debug.zip

The text was updated successfully, but these errors were encountered:

tbg · 2018-11-21T11:48:58Z

Thanks @manigandham. As the node crashes, does it leave anything in logs/heap_profiler? We'd be interested in seeing these files if you have them. (They don't contain sensitive information).

manigandham · 2018-11-21T12:02:39Z

Are you talking about the folder on the node? I already deleted the cluster but can run the process again...

manigandham · 2018-11-21T12:04:03Z

It looks like the debug output contains that? Can you check the attached zip file and see if it's valid?

tbg · 2018-11-21T12:04:29Z

Oh, I had missed that attached file. Apologies! Taking a look.

tbg · 2018-11-21T12:08:11Z

@nvanbenschoten this looks like your wheelhouse. We don't unconditionally support large transactions, but the profile indicates something we need to fix regarding the txn pipeliner.

jseldess · 2018-11-21T14:40:11Z

@tbg, @nvanbenschoten, back in v1.0, we had a known limitation about truncating a large table. But that was resolved, and we no longer have a stated limitation about large transactions. At the very end of a current limitation about Write and update limits for a single statement, we say:

Note that even though not directly restricted any more, large transactions can have performance implications on the cluster.

Given that large transactions can go beyond just impacting performance (and crash nodes, as in this case), should we create a separate known limitation about this?

This might also be another sign that we need to revive the docs project to identify and publish "production limits" (nodes, dbs, tables, rows, transactions, etc.), even if they are just recommendations based on our current understanding.

tbg · 2018-11-21T20:49:34Z

Large transactions are still a bad idea, though we weren't aware of them causing an outright memory blowup as seen here. My assumption right now is that this is a regression.

manigandham · 2018-11-22T02:47:39Z

Another update: reading the table with a SELECT COUNT(*) with a filter on that column after deleting millions rows in smaller batches also crashes the node. From the documentation, is that due to scanning through tombstones?

Our expectation is that updates and queries will always complete, even if slow. Is this inaccurate? Will there be hard limits to data updates? And if so, how would we approach such large transactions?

Informs cockroachdb#32522. There is a tradeoff here between the overhead of waiting for consensus for a batch if we don't pipeline and proving that all of the writes in the batch succeed if we do pipeline. We set this default to a value which experimentally strikes a balance between the two costs. To determine the best value for this setting, I ran a three-node single-AZ AWS cluster with 4 vCPU nodes (`m5d.xlarge`). I modified KV to perform writes in an explicit txn and to run multiple statements. I then ran `kv0` with 8 DML statements per txn (a reasonable estimate for the average number of statements that an **explicit** txn runs) and adjusted the batch size of these statements from 1 to 256. This resulted in the following graph: <see graph in PR> We can see that the cross-over point where txn pipelining stops being beneficial is with batch sizes somewhere between 128 and 256 rows. Given this information, I set the default for `kv.transaction.write_pipelining_max_batch_size` to 128. Of course, there are a lot of variables at play here: storage throughput, replication latency, node size, etc. I think the setup I used hits a reasonable middle ground with these. Release note: None

nvanbenschoten · 2018-11-26T19:58:27Z

The memory blowup in the txn pipeliner is due to two issues. The first is that the kv.transaction.write_pipelining_max_batch_size currently defaults to 0, which means that a key-value batch will be pipelined, regardless of the number of operations that it performs. This doesn't play well with sql.max{Insert/Update/Upsert/Delete}BatchSize, which breaks large implicit SQL txns into non-1PC kv txns. The intention was for these to never be pipelined, but they currently will be. This should be fixed by #32606.

The other issue is more general - we need to address this TODO:

cockroach/pkg/kv/txn_interceptor_pipeliner.go

Lines 185 to 187 in a7e13e8

    
           // TODO(nvanbenschoten): Consider placing an upper bound on the size of the 
        
           // outstandingWrites tree. Once this limit is hit, we'll either need to 
        
           // proactively prove outstanding writes or stop pipelining new writes.

I suspect that the first issue is the cause of almost all cases where this memory blows up, but we should have a hard memory bound here regardless.

Another update: reading the table with a SELECT COUNT(*) with a filter on that column after deleting millions rows in smaller batches also crashes the node. From the documentation, is that due to scanning through tombstones?

This sounds like a different issue. Is it reproducible for you?

Informs cockroachdb#32522. There is a tradeoff here between the overhead of waiting for consensus for a batch if we don't pipeline and proving that all of the writes in the batch succeed if we do pipeline. We set this default to a value which experimentally strikes a balance between the two costs. To determine the best value for this setting, I ran a three-node single-AZ AWS cluster with 4 vCPU nodes (`m5d.xlarge`). I modified KV to perform writes in an explicit txn and to run multiple statements. I then ran `kv0` with 8 DML statements per txn (a reasonable estimate for the average number of statements that an **explicit** txn runs) and adjusted the batch size of these statements from 1 to 256. This resulted in the following graph: <see graph in PR> We can see that the cross-over point where txn pipelining stops being beneficial is with batch sizes somewhere between 128 and 256 rows. Given this information, I set the default for `kv.transaction.write_pipelining_max_batch_size` to 128. Of course, there are a lot of variables at play here: storage throughput, replication latency, node size, etc. I think the setup I used hits a reasonable middle ground with these. Release note: None

32606: kv: set sane default for kv.transaction.write_pipelining_max_batch_size r=nvanbenschoten a=nvanbenschoten Informs #32522. There is a tradeoff here between the overhead of waiting for consensus for a batch if we don't pipeline and proving that all of the writes in the batch succeed if we do pipeline. We set this default to a value which experimentally strikes a balance between the two costs. To determine the best value for this setting, I ran a three-node single-AZ AWS cluster with 4 vCPU nodes (`m5d.xlarge`). I modified KV to perform writes in an explicit txn and to run multiple statements. I then ran `kv0` with 8 DML statements per txn (a reasonable estimate for the average number of statements that an **explicit** txn runs) and adjusted the batch size of these statements from 1 to 256. This resulted in the following graph: ![image](https://user-images.githubusercontent.com/5438456/49038443-fc91e200-f18a-11e8-810d-1172821e63ea.png) We can see that the cross-over point where txn pipelining stops being beneficial is with batch sizes somewhere between 128 and 256 rows. Given this information, I set the default for kv.transaction.write_pipelining_max_batch_size` to 128. Of course, there are a lot of variables at play here: storage throughput, replication latency, node size, etc. I think the setup I used hits a reasonable middle ground with these. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

manigandham · 2018-11-26T23:14:19Z

@nvanbenschoten

Somewhat reproducible. After crashing from the delete query, I restart the node, then run the SELECT COUNT(*) which either crashes the node again or takes an extremely long time. Single bigger node seems to finish the query vs smaller clustered nodes.

Here's a screenshot of a 17 minute query time. Running the same exact query again takes 44 seconds.

Informs cockroachdb#32522. There is a tradeoff here between the overhead of waiting for consensus for a batch if we don't pipeline and proving that all of the writes in the batch succeed if we do pipeline. We set this default to a value which experimentally strikes a balance between the two costs. To determine the best value for this setting, I ran a three-node single-AZ AWS cluster with 4 vCPU nodes (`m5d.xlarge`). I modified KV to perform writes in an explicit txn and to run multiple statements. I then ran `kv0` with 8 DML statements per txn (a reasonable estimate for the average number of statements that an **explicit** txn runs) and adjusted the batch size of these statements from 1 to 256. This resulted in the following graph: <see graph in PR> We can see that the cross-over point where txn pipelining stops being beneficial is with batch sizes somewhere between 128 and 256 rows. Given this information, I set the default for `kv.transaction.write_pipelining_max_batch_size` to 128. Of course, there are a lot of variables at play here: storage throughput, replication latency, node size, etc. I think the setup I used hits a reasonable middle ground with these. Release note (performance improvement): Change the default value for the cluster setting kv.transaction.write_pipelining_max_batch_size to 128. This speeds up bulk write operations.

Fixes cockroachdb#32522. This change creates a new cluster setting called: ``` kv.transaction.write_pipelining_max_outstanding_size ``` It limits the size in bytes that can be dedicated to tracking in-flight writes that have been pipelined. Once this limit is hit, not more writes will be pipelined by a transaction until some of the writes are proven and removed from the outstanding write set. This change once again illustrates the need for periodic proving of outstanding writes. We touch upon that in the type definition's comment and in cockroachdb#35009. Release note: None

35014: kv: introduce new "max outstanding size" setting to txnPipeliner r=nvanbenschoten a=nvanbenschoten Fixes #32522. This change creates a new cluster setting called: ``` kv.transaction.write_pipelining_max_outstanding_size ``` It limits the size in bytes that can be dedicated to tracking in-flight writes that have been pipelined. Once this limit is hit, not more writes will be pipelined by a transaction until some of the writes are proven and removed from the outstanding write set. This change once again illustrates the need for periodic proving of outstanding writes. We touch upon that in the type definition's comment and in #35009. Release note: None 35199: log: fix remaining misuse of runtime.Callers/runtime.FuncForPC r=nvanbenschoten a=nvanbenschoten Fixes #17770. This commit fixes the last user of `runtime.Callers` that misused the stdlib function by translating the PC values it returned directly into symbolic information (see https://golang.org/pkg/runtime/#Callers) [1]. Go's documentation warns that this is a recipe for disaster when mixed with mid-stack inlining. The other concern in #17770 was this comment: #17770 (comment). This was discussed in golang/go#29582 and addressed in golang/go@956879d. An alternative would be to use `runtime.Caller` here, but that would force an allocation that was hard-earned in #29017. Instead, this commit avoids any performance hit. ``` name old time/op new time/op delta Header-4 267ns ± 1% 268ns ± 0% ~ (p=0.584 n=10+10) VDepthWithVModule-4 260ns ± 3% 255ns ± 1% -1.87% (p=0.018 n=10+9) name old alloc/op new alloc/op delta Header-4 0.00B 0.00B ~ (all equal) VDepthWithVModule-4 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Header-4 0.00 0.00 ~ (all equal) VDepthWithVModule-4 0.00 0.00 ~ (all equal) ``` [1] I went through and verified that this was still correct. Release note: None 35203: closedts: react faster to config changes r=danhhz a=tbg This approximately halves the duration of `make test PKG=./pkg/ccl/changefeedccl TESTS=/sinkless`, from ~60s to ~30s. Touches #34455. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

tbg assigned petermattis Nov 21, 2018

tbg assigned nvanbenschoten and unassigned petermattis Nov 21, 2018

nvanbenschoten mentioned this issue Nov 26, 2018

kv: set sane default for kv.transaction.write_pipelining_max_batch_size #32606

Merged

nvanbenschoten mentioned this issue Nov 26, 2018

release-2.1: kv: set sane default for kv.transaction.write_pipelining_max_batch_size #32621

Merged

nvanbenschoten changed the title ~~Node crashes on large table delete~~ kv: node crashes on large table delete Nov 27, 2018

nvanbenschoten changed the title ~~kv: node crashes on large table delete~~ kv: bound the size of in-flight write tracking in txnPipeliner Feb 15, 2019

nvanbenschoten mentioned this issue Feb 15, 2019

kv: avoid immediately launching goroutine for txn heartbeat loop #35009

Closed

nvanbenschoten mentioned this issue Feb 15, 2019

kv: introduce new "max outstanding size" setting to txnPipeliner #35014

Merged

craig bot closed this as completed in #35014 Feb 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: bound the size of in-flight write tracking in txnPipeliner #32522

kv: bound the size of in-flight write tracking in txnPipeliner #32522

manigandham commented Nov 21, 2018 •

edited

Loading

tbg commented Nov 21, 2018

manigandham commented Nov 21, 2018

manigandham commented Nov 21, 2018

tbg commented Nov 21, 2018

tbg commented Nov 21, 2018

jseldess commented Nov 21, 2018

tbg commented Nov 21, 2018

manigandham commented Nov 22, 2018

nvanbenschoten commented Nov 26, 2018

manigandham commented Nov 26, 2018 •

edited

Loading

kv: bound the size of in-flight write tracking in txnPipeliner #32522

kv: bound the size of in-flight write tracking in txnPipeliner #32522

Comments

manigandham commented Nov 21, 2018 • edited Loading

tbg commented Nov 21, 2018

manigandham commented Nov 21, 2018

manigandham commented Nov 21, 2018

tbg commented Nov 21, 2018

tbg commented Nov 21, 2018

jseldess commented Nov 21, 2018

tbg commented Nov 21, 2018

manigandham commented Nov 22, 2018

nvanbenschoten commented Nov 26, 2018

manigandham commented Nov 26, 2018 • edited Loading

manigandham commented Nov 21, 2018 •

edited

Loading

manigandham commented Nov 26, 2018 •

edited

Loading