Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix flake in TestTxnContentionEventsTable #118372

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Jan 26, 2024

In causeContention we deliberately hold a transaction open using pg_sleep to block an update statement. The timing we're trying to achieve is:

  1. transaction insert
  2. update starts and blocks
  3. transaction held open using pg_sleep

We were using a WaitGroup to order (2) after (1), but there was no synchronization to ensure (3) came after (2).

This commit adds a retry loop that checks crdb_internal.cluster_queries to ensure (3) comes after (2).

Fixes: #118236

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member

Shouldn't we fix the test proper? I think the problem is that start is computed after the concurrent goroutine might have started, so in some rare cases we begin executing pg_sleep query before start is evaluated, so overall contention ends up being less than 500ms in some very rare cases. Perhaps the right moment to evaluate the start of contention (which is what start is about IIUC) is after executing the INSERT but before unblocking the main goroutine.

We should also bump the value back to 500ms - this was changed recently in bd9d545.

I'm thinking about something like this (haven't tested it though)

-       // Create a new connection, and then in a go routine have it start a
-       // transaction, update a row, sleep for a time, and then complete the
-       // transaction. With original connection attempt to update the same row
-       // being updated concurrently in the separate go routine, this will be
-       // blocked until the original transaction completes.
-       var wgTxnStarted sync.WaitGroup
-       wgTxnStarted.Add(1)
-
-       // Lock to wait for the txn to complete to avoid the test finishing
-       // before the txn is committed.
-       var wgTxnDone sync.WaitGroup
-       wgTxnDone.Add(1)
+       // contentionStartCh is used to communicate to the main goroutine when the
+       // INSERT has been performed by the worker goroutine, meaning that the
+       // contention can now start.
+       contentionStartCh := make(chan time.Time, 1)
+       // wg ensures that the function blocks until the concurrent goroutine exits.
+       var wg sync.WaitGroup
+       wg.Add(1)
+       defer wg.Wait()
 
        go func() {
-               defer wgTxnDone.Done()
+               defer wg.Done()
                tx, errTxn := conn.BeginTx(ctx, &gosql.TxOptions{})
                require.NoError(t, errTxn)
                _, errTxn = tx.ExecContext(ctx,
                        fmt.Sprintf("INSERT INTO %s (id, s) VALUES ('test', $1);", table),
                        insertValue)
                require.NoError(t, errTxn)
-               wgTxnStarted.Done()
+               contentionStartCh <- timeutil.Now()
                _, errTxn = tx.ExecContext(ctx, "select pg_sleep(.5);")
                require.NoError(t, errTxn)
                errTxn = tx.Commit()
                require.NoError(t, errTxn)
        }()
 
-       start := timeutil.Now()
-
-       // Need to wait for the txn to start to ensure lock contention.
-       wgTxnStarted.Wait()
+       // Need to wait for the INSERT to be performed to ensure lock contention.
+       start := <-contentionStartCh
        // This will be blocked until the updateRowWithDelay finishes.
        _, errUpdate := conn.ExecContext(
                ctx, fmt.Sprintf("UPDATE %s SET s = $1 where id = 'test';", table), updateValue)
        require.NoError(t, errUpdate)
        end := timeutil.Now()
-       require.GreaterOrEqual(t, end.Sub(start), 499*time.Millisecond)
-
-       wgTxnDone.Wait()
+       require.GreaterOrEqual(t, end.Sub(start), 500*time.Millisecond)
 }
 

Thoughts?

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, I just got lazy. Good call.

I tried your idea, but it was still possible for the select pg_sleep to start executing before the UPDATE. I think we need to wait until the UPDATE has actually started executing. We can't figure that out in go, because conn.ExecContext is blocking, so I've added a retry loop that repeatedly queries crdb_internal.cluster_queries until it has seen the query. WDYT?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/crdb_internal_test.go line 1009 at r2 (raw file):

		// Wait for the update to show up in cluster_queries.
		var seen bool
		for i := 1; i <= 5 && !seen; i++ {

nit: consider using SucceedsSoon or SucceedsWithin helpers.


pkg/sql/crdb_internal_test.go line 1012 at r2 (raw file):

			time.Sleep(time.Duration(i) * 10 * time.Millisecond)
			row := tx.QueryRowContext(
				ctx, "SELECT EXISTS(SELECT * FROM crdb_internal.cluster_queries WHERE query LIKE '%/* shuba */')",

ISWYDT 😃

In causeContention we deliberately hold a transaction open using
pg_sleep to block an update statement. The timing we're trying to
achieve is:

1. transaction insert
2. update starts and blocks
3. transaction held open using pg_sleep

We were using a WaitGroup to order (2) after (1), but there was no
synchronization to ensure (3) came after (2).

This commit adds a retry loop that checks
`crdb_internal.cluster_queries` to ensure (3) comes after (2).

Fixes: cockroachdb#118236

Release note: None
@michae2 michae2 added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. and removed backport-23.1.x Flags PRs that need to be backported to 23.1 labels Feb 5, 2024
@michae2
Copy link
Collaborator Author

michae2 commented Feb 5, 2024

TFTR!

bors r=yuzefovich

@craig craig bot merged commit 804d37e into cockroachdb:master Feb 6, 2024
9 checks passed
@craig
Copy link
Contributor

craig bot commented Feb 6, 2024

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: TestTxnContentionEventsTable failed
3 participants