sql: fix flake in TestTxnContentionEventsTable #118372

michae2 · 2024-01-26T22:15:05Z

In causeContention we deliberately hold a transaction open using pg_sleep to block an update statement. The timing we're trying to achieve is:

transaction insert
update starts and blocks
transaction held open using pg_sleep

We were using a WaitGroup to order (2) after (1), but there was no synchronization to ensure (3) came after (2).

This commit adds a retry loop that checks crdb_internal.cluster_queries to ensure (3) comes after (2).

Fixes: #118236

Release note: None

cockroach-teamcity · 2024-01-26T22:15:15Z

This change is

yuzefovich · 2024-01-26T22:48:10Z

Shouldn't we fix the test proper? I think the problem is that start is computed after the concurrent goroutine might have started, so in some rare cases we begin executing pg_sleep query before start is evaluated, so overall contention ends up being less than 500ms in some very rare cases. Perhaps the right moment to evaluate the start of contention (which is what start is about IIUC) is after executing the INSERT but before unblocking the main goroutine.

We should also bump the value back to 500ms - this was changed recently in bd9d545.

I'm thinking about something like this (haven't tested it though)

-       // Create a new connection, and then in a go routine have it start a
-       // transaction, update a row, sleep for a time, and then complete the
-       // transaction. With original connection attempt to update the same row
-       // being updated concurrently in the separate go routine, this will be
-       // blocked until the original transaction completes.
-       var wgTxnStarted sync.WaitGroup
-       wgTxnStarted.Add(1)
-
-       // Lock to wait for the txn to complete to avoid the test finishing
-       // before the txn is committed.
-       var wgTxnDone sync.WaitGroup
-       wgTxnDone.Add(1)
+       // contentionStartCh is used to communicate to the main goroutine when the
+       // INSERT has been performed by the worker goroutine, meaning that the
+       // contention can now start.
+       contentionStartCh := make(chan time.Time, 1)
+       // wg ensures that the function blocks until the concurrent goroutine exits.
+       var wg sync.WaitGroup
+       wg.Add(1)
+       defer wg.Wait()
 
        go func() {
-               defer wgTxnDone.Done()
+               defer wg.Done()
                tx, errTxn := conn.BeginTx(ctx, &gosql.TxOptions{})
                require.NoError(t, errTxn)
                _, errTxn = tx.ExecContext(ctx,
                        fmt.Sprintf("INSERT INTO %s (id, s) VALUES ('test', $1);", table),
                        insertValue)
                require.NoError(t, errTxn)
-               wgTxnStarted.Done()
+               contentionStartCh <- timeutil.Now()
                _, errTxn = tx.ExecContext(ctx, "select pg_sleep(.5);")
                require.NoError(t, errTxn)
                errTxn = tx.Commit()
                require.NoError(t, errTxn)
        }()
 
-       start := timeutil.Now()
-
-       // Need to wait for the txn to start to ensure lock contention.
-       wgTxnStarted.Wait()
+       // Need to wait for the INSERT to be performed to ensure lock contention.
+       start := <-contentionStartCh
        // This will be blocked until the updateRowWithDelay finishes.
        _, errUpdate := conn.ExecContext(
                ctx, fmt.Sprintf("UPDATE %s SET s = $1 where id = 'test';", table), updateValue)
        require.NoError(t, errUpdate)
        end := timeutil.Now()
-       require.GreaterOrEqual(t, end.Sub(start), 499*time.Millisecond)
-
-       wgTxnDone.Wait()
+       require.GreaterOrEqual(t, end.Sub(start), 500*time.Millisecond)
 }

Thoughts?

michae2

Yes, you are right, I just got lazy. Good call.

I tried your idea, but it was still possible for the select pg_sleep to start executing before the UPDATE. I think we need to wait until the UPDATE has actually started executing. We can't figure that out in go, because conn.ExecContext is blocking, so I've added a retry loop that repeatedly queries crdb_internal.cluster_queries until it has seen the query. WDYT?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)

yuzefovich

Nice!

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/crdb_internal_test.go line 1009 at r2 (raw file):

		// Wait for the update to show up in cluster_queries.
		var seen bool
		for i := 1; i <= 5 && !seen; i++ {

nit: consider using SucceedsSoon or SucceedsWithin helpers.

pkg/sql/crdb_internal_test.go line 1012 at r2 (raw file):

			time.Sleep(time.Duration(i) * 10 * time.Millisecond)
			row := tx.QueryRowContext(
				ctx, "SELECT EXISTS(SELECT * FROM crdb_internal.cluster_queries WHERE query LIKE '%/* shuba */')",

ISWYDT 😃

In causeContention we deliberately hold a transaction open using pg_sleep to block an update statement. The timing we're trying to achieve is: 1. transaction insert 2. update starts and blocks 3. transaction held open using pg_sleep We were using a WaitGroup to order (2) after (1), but there was no synchronization to ensure (3) came after (2). This commit adds a retry loop that checks `crdb_internal.cluster_queries` to ensure (3) comes after (2). Fixes: cockroachdb#118236 Release note: None

michae2 · 2024-02-05T22:36:50Z

TFTR!

bors r=yuzefovich

craig · 2024-02-06T01:48:07Z

Build succeeded:

Bazel Essential CI (Cockroach)

michae2 requested review from mgartner and yuzefovich January 26, 2024 22:15

michae2 force-pushed the b118236 branch from 434094c to 373310d Compare February 1, 2024 23:14

michae2 commented Feb 1, 2024

View reviewed changes

yuzefovich approved these changes Feb 2, 2024

View reviewed changes

michae2 force-pushed the b118236 branch from 373310d to 199a586 Compare February 3, 2024 07:23

michae2 added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. and removed backport-23.1.x Flags PRs that need to be backported to 23.1 labels Feb 5, 2024

craig bot merged commit 804d37e into cockroachdb:master Feb 6, 2024
9 checks passed

blathers-crl bot mentioned this pull request Feb 6, 2024

release-23.2: sql: fix flake in TestTxnContentionEventsTable #118808

Merged

michae2 deleted the b118236 branch February 6, 2024 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: fix flake in TestTxnContentionEventsTable #118372

sql: fix flake in TestTxnContentionEventsTable #118372

michae2 commented Jan 26, 2024 •

edited

Loading

cockroach-teamcity commented Jan 26, 2024

yuzefovich commented Jan 26, 2024

michae2 left a comment

yuzefovich left a comment

michae2 commented Feb 5, 2024

craig bot commented Feb 6, 2024

sql: fix flake in TestTxnContentionEventsTable #118372

sql: fix flake in TestTxnContentionEventsTable #118372

Conversation

michae2 commented Jan 26, 2024 • edited Loading

cockroach-teamcity commented Jan 26, 2024

yuzefovich commented Jan 26, 2024

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 commented Feb 5, 2024

craig bot commented Feb 6, 2024

michae2 commented Jan 26, 2024 •

edited

Loading