-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor] #102954
Comments
Hi @kvoli, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cc @cockroachdb/test-eng |
Bisection results, PR #101486 (cc: @rafiss )
|
Thanks for tracking that down. I was wondering if my PR would turn anything up. If you didn't get a chance to look at it, the summary is that we previously were using zero-valued defaults for all session variables used in internal operations (jobs, leases, delegates queries for some builtin functions, etc). After #101486, the default values for session variables will be used instead. One notable change is that previously, internal operations would use To track this down further, I'd be curious to see if there are any specific queries that are slower. Then we can look at their traces. |
@rafiss Thanks for the change overview. I might be of limited help here, but can you link me any documentation on obtaining query stats for this particular workload during a roachtest run? |
This regression also appears for |
stmt-bundle-870525437835771905/ - slow trace |
This comment was marked as outdated.
This comment was marked as outdated.
It wasn't previously unbounded, I don't think: cockroach/pkg/sql/execinfra/server_config.go Lines 340 to 343 in 84ed8ac
|
Two runs of |
Of the commits in that PR, it's da48b19 in particular where the regression appears. Confirmed this by isolating the SHA (reverted all else from master and re-running kv0 experiments while still observing degradation). |
I looked through debug zips, looking for clues in logs around when one of these step changes happened. I saw things like:
The 106 table ID corresponds to
It's made all the difference. Here are two runs of I did the same at 1ee30a1 but running (All the pink grafana annotations correspond to jobs running, including the auto stats one.) |
stmt-bundle-871053367241342979.zip is the statement bundle for the trace in #102954 (comment). I can look further (I'm generally interested in understanding how the PR above + automatic table statistics could cause an increase in what looks like KV replication latencies), but I'd like some pointers from @cockroachdb/sql-queries first. It's a simple
|
I see, interesting. Thank you for digging into this so deeply. I can see why this has proved so elusive. |
I wonder if this is because it increases Raft scheduler mutex contention, which alleviates pressure on some pathologically contended resource elsewhere in the system. Did Raft scheduler latency increase or decrease with this change? |
This is quite a puzzle! Admission control kicks in at 14:45 (see KV Admission Slots); shortly thereafter, concurrency (as measured by runnable goroutines) drops. Any insights from the goroutine profiles? Also, was the workload running on |
Raft scheduler latency is high when performance is degraded and low when it isn't. With histograms disabled P99 is like 25ms, when they are enabled its like >300ms. COCKROACH_SCHEDULER_SHARD_SIZE doesn't seem to change that. Here's the timebounds for the 32 shard experiment I did this morning: |
@cucaroach If we want to make sure that this regression is due to plan changes and not something more exotic like mutex contention, maybe we could try running a build that reads/builds the histogram, but otherwise ignores it when calculating stats. |
My understanding is that this is exactly what It's not immediately clear looking at where that setting is queries to know if it would prove/disprove query plan changes or some exotic mutex contention as the culprit here though. |
I don't think mutex contention is the issue but I could be wrong. I see roughly the same amount of mutex contention in the good and bad cases but it is squarely all centered around RWMutex's and lots related to the ReplicaMutex. Here's the good case: Here's the bad case: I feel like with 32 CPU's and 64 client sessions 3-4s of contention isn't terrible, these are 3s mutex profile deltas. Certainly an experiment with DRWMutex is warranted but I'm focused on examining plans right now, the changing plans must change the KV request demographics somehow. As an experiment to guage whether this amount of contention was problematic I completely commented out the statsCollector Start/EndTransaction calls and it made no difference. |
One thing that's suspicious in the trace from the slow statement bundle here is that there's a significant amount of time spent between steps that usually take <1ms from what I've seen in common statement bundle (maybe I'm just not very worldly w.r.t. traces?): Is the ~30ms spent here a clue? There's still a big unexplained gap in the trace—maybe it makes sense to insert addition tracing logs and try to pin down where the majority of that time is spent? |
I'd say the first ~7ms is suspect. It suggests the request is being held in some queue(s). |
Admission control is the obvious suspect there. It happens right after the first line. |
I tried turning off admission control b/c I suspected it was letting in too much work and when in degraded mode if I disable admission control it seems to get worse. One weird thing about this is that if you turn back on the histogram disabling for internal queries performance doesn't return to normal, we're stuck in degraded mode. But I discovered if I call reset_sql_stats and reset_activity_tables the throughput collapse goes away. So there's some bad interaction between how those tables work and histograms. TRUNCATE on the statement_statistics and transaction_statistics tables does nothing so it must be that the in memory cache for these things is the issue (the reset functions also reset the in memory). So the next thing I tried was removing this lock: https://github.com/cockroachdb/cockroach/blob/release-23.1/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go#L359 Not postive that's completely sound but it seems to remove over half the throughput collapse, we bottom out around 30K instead of 20k, progress! Making all those heap allocations holding a lock is silly but I need to draw a line between the histogram setting change and the cache performance, I suspect the cardinality of the maps goes way up. Interestingly the raft replication P99 is cut in half with this change, I still have no idea how that's connected to these front end happenings. |
This result didn't replicate after 3 runs, only the first looked like this, the last two fell into the <20k QPS hole. |
It's probably been considered already... what about just logging |
By saving off the statement_statistics between runs and doing some dogfooding I was able to narrow it down: https://docs.google.com/spreadsheets/d/16M8MeaLlWr1UWyGlkn9IXa5fUC8g9Bfy9QVfsCWZ5TI By disabling the optimizer histogram usage for just the first 10 queries I was able to get the good performance. Now I need to analyze what goes wrong with these queries/plans and narrow it down further. |
[triage] Our plan is to leave histograms disabled for internal queries in 23.2 and pick this back up during 24.1 development. |
[queries collab session] Based on Tommy's analysis, one idea is to use a table-level setting to disable histograms for system.statement_statistics, system.transaction_statistics, and system.statement_activity. |
118289: sql/tests: remove stale skipped error r=yuzefovich a=yuzefovich We now have key-encoding of JSONs. Epic: None Release note: None 118330: ttl: use better defaults for some session variables r=yuzefovich a=yuzefovich This commit fixes some problems that we've seen around internally-executed queries issued by the TTL jobs, present before 23.2. In particular, on 23.1 and prior releases we used Go defaults for the session data parameters for internally-executed queries, which can differ from the defaults set by CRDB. As a result, query plans could be suboptimal as we've seen in a couple of recent escalations. This bug has been fixed on 23.2 proper (we now use CRDB defaults for all session variables except for one), and on 23.1 and prior this commit applies a targeted fix only to the TTL queries. In particular, the following overrides are now set: - `reorder_joins_limit` to the default value of 8 - `optimizer_use_histograms` and `optimizer_use_multi_col_stats` are both set to `true`. On 23.2 and later only `optimizer_use_histograms` needs to be updated since it's the only exception mentioned above. However, I chose to keep all 3 in the same commit so that it can be backported more easily to 23.1, and the following commit will revert the other 2 on 23.2 and later. Touches: #102954. Fixes: #118129. Release note (bug fix): Internal queries issued by the TTL jobs should now use optimal plans. The bug has been present since at least 22.2 version. 118479: sql: enable read committed cluster setting by default r=rafiss a=rafiss Epic CRDB-34172 Release note (sql change): The sql.txn.read_committed_isolation.enabled cluster setting is now true by default. This means that any syntax and settings that configure the READ COMMITTED isolation level will now cause the transaction to use that isolation level, rather than automatically upgrading the transaction to SERIALIZABLE. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
The most likely explanation is that this makes some internal query plans worse, but we were unable to confirm that last time we looked at it. This is currently preventing histograms from being used by internal executor queries. However, there's not a strong motivation to enable them—we're not aware of any query plans that would improve with the usage of histograms—so we're going to backlog this for now. |
kv0/nodes=3/cpu=96
had a serious (66%) regression in terms of ops/s since the Apr 30th run. The regression occurs on both GCE and AWS.Jira issue: CRDB-27754
The text was updated successfully, but these errors were encountered: