Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor] #102954

Open
kvoli opened this issue May 9, 2023 · 85 comments
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage branch-master Failures and bugs on the master branch. regression Regression from a release. T-sql-queries SQL Queries Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented May 9, 2023

kv0/nodes=3/cpu=96 had a serious (66%) regression in terms of ops/s since the Apr 30th run. The regression occurs on both GCE and AWS.

image
image

Jira issue: CRDB-27754

@kvoli kvoli added the T-testeng TestEng Team label May 9, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 9, 2023

Hi @kvoli, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl
Copy link

blathers-crl bot commented May 9, 2023

cc @cockroachdb/test-eng

@kvoli kvoli added the regression Regression from a release. label May 9, 2023
@smg260 smg260 self-assigned this May 9, 2023
@smg260 smg260 added the A-roachperf-investigation Issue opened as a result of roachperf triage label May 9, 2023
@smg260
Copy link
Contributor

smg260 commented May 9, 2023

Bisection results, PR #101486 (cc: @rafiss )

2023-05-09T16:46:45Z   Bisecting regression in [kv0/enc=false/nodes=3/cpu=96] using commit range [4619767bb0b (known good),7d02e061759 (known bad)]
2023-05-09T16:46:45Z   Thresholds [good >= 43000, bad <= 27000]
2023-05-09T17:04:12Z   [1ee30a1fe7e] median ops/s: [26590]. Auto marked as bad.
2023-05-09T17:33:17Z   [2f03212bd1f] median ops/s: [24438]. Auto marked as bad.
2023-05-09T18:02:09Z   [608f03a999a] median ops/s: [23428]. Auto marked as bad.
2023-05-09T18:32:17Z   [da48b19bcc2] median ops/s: [25457]. Auto marked as bad.
2023-05-09T18:32:17Z   Bisection complete. Suspect commit:
commit da48b19bcc2f962dda531d075502682b64326bf4
Author: Rafi Shamim <[email protected]>
Date:   Thu Apr 13 15:12:45 2023 -0400

    sql: use global defaults for internal session data rather than zero-values

    Release note: None

@rafiss
Copy link
Collaborator

rafiss commented May 9, 2023

Thanks for tracking that down. I was wondering if my PR would turn anything up. If you didn't get a chance to look at it, the summary is that we previously were using zero-valued defaults for all session variables used in internal operations (jobs, leases, delegates queries for some builtin functions, etc). After #101486, the default values for session variables will be used instead.

One notable change is that previously, internal operations would use distsql=off but now they will use distsql=auto. Another one is distsql_workmem, which previously was unbounded, but now uses the default limit of 64 MiB.

To track this down further, I'd be curious to see if there are any specific queries that are slower. Then we can look at their traces.

@smg260
Copy link
Contributor

smg260 commented May 22, 2023

@rafiss Thanks for the change overview. I might be of limited help here, but can you link me any documentation on obtaining query stats for this particular workload during a roachtest run?

@irfansharif
Copy link
Contributor

This regression also appears for kv0/enc=false/nodes=3/cpu=32: https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D32&tab=aws.

image

@irfansharif irfansharif assigned irfansharif and unassigned smg260 May 30, 2023
@irfansharif
Copy link
Contributor

kv0-traces.zip

stmt-bundle-870525437835771905/ - slow trace
stmt-bundle-870526672146366465/ - fast trace

@irfansharif
Copy link
Contributor

Partway through a full kv0 run, there's a step change in latency distributions. It becomes bi-modal and then throughput suffers:

image

@irfansharif

This comment was marked as outdated.

@irfansharif
Copy link
Contributor

Another one is distsql_workmem, which previously was unbounded, but now uses the default limit of 64 MiB.

It wasn't previously unbounded, I don't think:

if flowCtx.EvalCtx.SessionData().WorkMemLimit <= 0 {
// If for some reason workmem limit is not set, use the default value.
return DefaultMemoryLimit
}

@irfansharif
Copy link
Contributor

irfansharif commented Jun 4, 2023

Here's another slow trace that doesn't straddle multiple nodes. But it's still the higher KV exec latencies (around proposal replication/acknowledgement) that's affecting top-line throughput.

image image

Here's a fast trace that straddles two nodes (captured in the first 5 minutes of kv0/nodes=3/cpu=96 when throughput is still high + latencies low) that also has low replication latency. Why the step change few minutes in?

image

@irfansharif
Copy link
Contributor

I also kicked off two manual runs using 1ee30a1 (which includes the merge commit from #101486) and 2c9bdbb (the merge commit from before). The bisection is indeed correct. Though I don't understand why it's caused the regression we're seeing.

@irfansharif irfansharif added branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 4, 2023
@irfansharif
Copy link
Contributor

Two runs of kv0/nodes=3/cpu=96 from a recent master and then a master with all commits from #101486 reverted. Just to sanity check that it's that PR that's done something somehow.

image

@irfansharif
Copy link
Contributor

irfansharif commented Jun 4, 2023

Of the commits in that PR, it's da48b19 in particular where the regression appears. Confirmed this by isolating the SHA (reverted all else from master and re-running kv0 experiments while still observing degradation).

@irfansharif
Copy link
Contributor

irfansharif commented Jun 4, 2023

I looked through debug zips, looking for clues in logs around when one of these step changes happened. I saw things like:

I230604 16:01:34.939333 5256528 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 424 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:02:34.945126 5721242 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 425 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:03:34.952456 6151338 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 426 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:04:34.960182 6524605 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 427 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:05:27.405226 96 util/admission/work_queue.go:512 ⋮ [T1,n3] 428 ‹sql-kv-response›: FIFO threshold for tenant 1 is -128
I230604 16:05:27.505068 94 util/admission/work_queue.go:512 ⋮ [T1,n3] 429 ‹kv›: FIFO threshold for tenant 1 is -128
I230604 16:05:27.605032 395 util/admission/work_queue.go:512 ⋮ [T1,n3] 430 ‹kv›: FIFO threshold for tenant 1 is -128
I230604 16:05:35.038644 6855831 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 431 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:06:35.044587 7161519 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 432 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:07:35.051411 7464761 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 433 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:08:35.058832 7760751 sql/stats/automatic_stats.go:833 ⋮ [T1,n3] 434 automatically executing ‹"CREATE STATISTICS __auto__ FROM [106] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"›
I230604 16:08:35.067411 7910897 jobs/registry.go:617 ⋮ [T1,n3,intExec=‹create-stats›] 435 active version is 23.1-6

The 106 table ID corresponds to kv.kv that this workload is upserting into. I next tried disabling the auto stats collection mechanism:

SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false;

It's made all the difference. Here are two runs of kv0/nodes=3/cpu=96 with and without automatics SQL stats collection.

image

I did the same at 1ee30a1 but running kv0/enc=false/nodes=3/cpu=32. Here's with and without automatic SQL stats collection respectively.

image image

(All the pink grafana annotations correspond to jobs running, including the auto stats one.)

@irfansharif
Copy link
Contributor

stmt-bundle-871053367241342979.zip is the statement bundle for the trace in #102954 (comment). I can look further (I'm generally interested in understanding how the PR above + automatic table statistics could cause an increase in what looks like KV replication latencies), but I'd like some pointers from @cockroachdb/sql-queries first. It's a simple UPSERT INTO kv (k, v) VALUES ($1:::INT8, $2:::BYTES) into the following schema:

CREATE TABLE public.kv (
	k INT8 NOT NULL,
	v BYTES NOT NULL,
	CONSTRAINT kv_pkey PRIMARY KEY (k ASC)
);

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jun 4, 2023
@michae2
Copy link
Collaborator

michae2 commented Sep 19, 2023

I see, interesting. Thank you for digging into this so deeply. I can see why this has proved so elusive.

@erikgrinaker
Copy link
Contributor

doubling the the COCKROACH_SCHEDULER_SHARD_SIZE to 32

I wonder if this is because it increases Raft scheduler mutex contention, which alleviates pressure on some pathologically contended resource elsewhere in the system. Did Raft scheduler latency increase or decrease with this change?

@srosenberg
Copy link
Member

This is quite a puzzle! Admission control kicks in at 14:45 (see KV Admission Slots); shortly thereafter, concurrency (as measured by runnable goroutines) drops.

Any insights from the goroutine profiles? Also, was the workload running on n4? I couldn't find any system metrics (node_exporter) for it or the cluster nodes. We should enable node_exporter if only to rule out anything lower in the stack.

@cucaroach
Copy link
Contributor

I wonder if this is because it increases Raft scheduler mutex contention, which alleviates pressure on some pathologically contended resource elsewhere in the system. Did Raft scheduler latency increase or decrease with this change?

Raft scheduler latency is high when performance is degraded and low when it isn't. With histograms disabled P99 is like 25ms, when they are enabled its like >300ms. COCKROACH_SCHEDULER_SHARD_SIZE doesn't seem to change that.

Here's the timebounds for the 32 shard experiment I did this morning:

http://treilly-32-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster?start=1695132979&end=1695134779

@DrewKimball
Copy link
Collaborator

@cucaroach If we want to make sure that this regression is due to plan changes and not something more exotic like mutex contention, maybe we could try running a build that reads/builds the histogram, but otherwise ignores it when calculating stats.

@mgartner
Copy link
Collaborator

@cucaroach If we want to make sure that this regression is due to plan changes and not something more exotic like mutex contention, maybe we could try running a build that reads/builds the histogram, but otherwise ignores it when calculating stats.

My understanding is that this is exactly what optimizer_use_histograms is controlling. When it is off, we are still collecting histograms for system tables, but we're not considering those histograms when calculating memo statistics.

It's not immediately clear looking at where that setting is queries to know if it would prove/disprove query plan changes or some exotic mutex contention as the culprit here though.

@cucaroach
Copy link
Contributor

I don't think mutex contention is the issue but I could be wrong. I see roughly the same amount of mutex contention in the good and bad cases but it is squarely all centered around RWMutex's and lots related to the ReplicaMutex. Here's the good case:

Screenshot 2023-09-19 at 3 46 22 PM

Here's the bad case:

Screenshot 2023-09-19 at 3 46 41 PM

I feel like with 32 CPU's and 64 client sessions 3-4s of contention isn't terrible, these are 3s mutex profile deltas. Certainly an experiment with DRWMutex is warranted but I'm focused on examining plans right now, the changing plans must change the KV request demographics somehow. As an experiment to guage whether this amount of contention was problematic I completely commented out the statsCollector Start/EndTransaction calls and it made no difference.

@mgartner
Copy link
Collaborator

One thing that's suspicious in the trace from the slow statement bundle here is that there's a significant amount of time spent between steps that usually take <1ms from what I've seen in common statement bundle (maybe I'm just not very worldly w.r.t. traces?):

image

Is the ~30ms spent here a clue?

There's still a big unexplained gap in the trace—maybe it makes sense to insert addition tracing logs and try to pin down where the majority of that time is spent?

@srosenberg
Copy link
Member

Is the ~30ms spent here a clue?

I'd say the first ~7ms is suspect. It suggests the request is being held in some queue(s).

@DrewKimball
Copy link
Collaborator

I'd say the first ~7ms is suspect. It suggests the request is being held in some queue(s).

Admission control is the obvious suspect there. It happens right after the first line.

@cucaroach
Copy link
Contributor

I tried turning off admission control b/c I suspected it was letting in too much work and when in degraded mode if I disable admission control it seems to get worse.

One weird thing about this is that if you turn back on the histogram disabling for internal queries performance doesn't return to normal, we're stuck in degraded mode. But I discovered if I call reset_sql_stats and reset_activity_tables the throughput collapse goes away. So there's some bad interaction between how those tables work and histograms. TRUNCATE on the statement_statistics and transaction_statistics tables does nothing so it must be that the in memory cache for these things is the issue (the reset functions also reset the in memory).

So the next thing I tried was removing this lock: https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go#L359

Not postive that's completely sound but it seems to remove over half the throughput collapse, we bottom out around 30K instead of 20k, progress!

Screenshot 2023-09-21 at 10 02 27 AM

Making all those heap allocations holding a lock is silly but I need to draw a line between the histogram setting change and the cache performance, I suspect the cardinality of the maps goes way up. Interestingly the raft replication P99 is cut in half with this change, I still have no idea how that's connected to these front end happenings.

@cucaroach
Copy link
Contributor

Not postive that's completely sound but it seems to remove over half the throughput collapse, we bottom out around 30K instead of 20k, progress!

This result didn't replicate after 3 runs, only the first looked like this, the last two fell into the <20k QPS hole.

@srosenberg
Copy link
Member

It's probably been considered already... what about just logging EXPLAIN (OPT) for every issued query in both runs? We'd expect no material difference, but what if it is.

@cucaroach
Copy link
Contributor

By saving off the statement_statistics between runs and doing some dogfooding I was able to narrow it down:

https://docs.google.com/spreadsheets/d/16M8MeaLlWr1UWyGlkn9IXa5fUC8g9Bfy9QVfsCWZ5TI

By disabling the optimizer histogram usage for just the first 10 queries I was able to get the good performance. Now I need to analyze what goes wrong with these queries/plans and narrow it down further.

@yuzefovich yuzefovich moved this from Active to Triage in SQL Queries Oct 12, 2023
@michae2 michae2 moved this from Triage to 24.1 Release in SQL Queries Oct 17, 2023
@michae2
Copy link
Collaborator

michae2 commented Oct 17, 2023

[triage] Our plan is to leave histograms disabled for internal queries in 23.2 and pick this back up during 24.1 development.

@mgartner mgartner removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 25, 2023
@mgartner mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 28, 2023
@michae2
Copy link
Collaborator

michae2 commented Jan 25, 2024

[queries collab session] Based on Tommy's analysis, one idea is to use a table-level setting to disable histograms for system.statement_statistics, system.transaction_statistics, and system.statement_activity.

craig bot pushed a commit that referenced this issue Jan 30, 2024
118289: sql/tests: remove stale skipped error r=yuzefovich a=yuzefovich

We now have key-encoding of JSONs.

Epic: None

Release note: None

118330: ttl: use better defaults for some session variables r=yuzefovich a=yuzefovich

This commit fixes some problems that we've seen around internally-executed queries issued by the TTL jobs, present before 23.2. In particular, on 23.1 and prior releases we used Go defaults for the session data parameters for internally-executed queries, which can differ from the defaults set by CRDB. As a result, query plans could be suboptimal as we've seen in a couple of recent escalations. This bug has been fixed on 23.2 proper (we now use CRDB defaults for all session variables except for one), and on 23.1 and prior this commit applies a targeted fix only to the TTL queries. In particular, the following overrides are now set:
- `reorder_joins_limit` to the default value of 8
- `optimizer_use_histograms` and `optimizer_use_multi_col_stats` are both set to `true`.

On 23.2 and later only `optimizer_use_histograms` needs to be updated since it's the only exception mentioned above. However, I chose to keep all 3 in the same commit so that it can be backported more easily to 23.1, and the following commit will revert the other 2 on 23.2 and later.

Touches: #102954.
Fixes: #118129.

Release note (bug fix): Internal queries issued by the TTL jobs should now use optimal plans. The bug has been present since at least 22.2 version.

118479: sql: enable read committed cluster setting by default r=rafiss a=rafiss

Epic CRDB-34172

Release note (sql change): The sql.txn.read_committed_isolation.enabled cluster setting is now true by default. This means that any syntax and settings that configure the READ COMMITTED isolation level will now cause the transaction to use that isolation level, rather than automatically upgrading the transaction to SERIALIZABLE.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@mgartner mgartner moved this from 24.2 Release to 24.3 Release in SQL Queries May 16, 2024
@mgartner
Copy link
Collaborator

mgartner commented Aug 1, 2024

The most likely explanation is that this makes some internal query plans worse, but we were unable to confirm that last time we looked at it.

This is currently preventing histograms from being used by internal executor queries. However, there's not a strong motivation to enable them—we're not aware of any query plans that would improve with the usage of histograms—so we're going to backlog this for now.

@mgartner mgartner moved this from 24.3 Release to Backlog in SQL Queries Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage branch-master Failures and bugs on the master branch. regression Regression from a release. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests