Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestDeadlock fails intermittently in local environment with go test -races #508

Closed
IronCore864 opened this issue Oct 10, 2024 · 3 comments · Fixed by #511
Closed

TestDeadlock fails intermittently in local environment with go test -races #508

IronCore864 opened this issue Oct 10, 2024 · 3 comments · Fixed by #511
Labels
Tests Test failures (usually intermittent)

Comments

@IronCore864
Copy link
Contributor

We have a test case TestDeadlock in internals/daemon/api_services_test.go where we randomly issue a bunch of pebble start/stop commands (c.a. 50 in total or so) within a second and expect all the changes to be done. This test case fails intermittently with go test -race because not all changes can be ready in certain cases. In my local Ubuntu 22.04 machine it's relatively easy to reproduce, failing once every five times or so, but weirdly it never occurs in GitHub Action runs.

Error message:

----------------------------------------------------------------------
FAIL: api_services_test.go:367: apiSuite.TestDeadlock

api_services_test.go:450:
    c.Fatal("timed out waiting for final request -- deadlock!")
... Error: timed out waiting for final request -- deadlock!

==================
WARNING: DATA RACE
Write at 0x0000012d2198 by goroutine 774:
  github.com/canonical/pebble/internals/daemon.(*Daemon).addRoutes()
      /home/ubuntu/pebble/internals/daemon/daemon.go:390 +0x1bb
  github.com/canonical/pebble/internals/daemon.(*apiSuite).daemon()
      /home/ubuntu/pebble/internals/daemon/api_test.go:77 +0x17c
  github.com/canonical/pebble/internals/daemon.(*apiSuite).TestGetPlan()
      /home/ubuntu/pebble/internals/daemon/api_plan_test.go:63 +0xaa
  runtime.call16()
      /snap/go/10730/src/runtime/asm_amd64.s:775 +0x42
  reflect.Value.Call()
      /snap/go/10730/src/reflect/value.go:365 +0xb5
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:775 +0x965
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xe9

Previous read at 0x0000012d2198 by goroutine 761:
  github.com/canonical/pebble/internals/daemon.v1PostServices()
      /home/ubuntu/pebble/internals/daemon/api_services.go:74 +0x313
  github.com/canonical/pebble/internals/daemon.(*apiSuite).TestDeadlock.func2()
      /home/ubuntu/pebble/internals/daemon/api_services_test.go:405 +0x277
  github.com/canonical/pebble/internals/daemon.(*apiSuite).TestDeadlock.func4()
      /home/ubuntu/pebble/internals/daemon/api_services_test.go:424 +0x58

Goroutine 774 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:666 +0x5ba
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:757 +0x155
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:812 +0x419
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:618 +0x3c6
  gopkg.in/check%2ev1.Run()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/run.go:92 +0x44
  gopkg.in/check%2ev1.RunAll()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/run.go:84 +0x124
  gopkg.in/check%2ev1.TestingT()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/run.go:72 +0x5d3
  github.com/canonical/pebble/internals/daemon.Test()
      /home/ubuntu/pebble/internals/daemon/daemon_test.go:52 +0x26
  testing.tRunner()
      /snap/go/10730/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /snap/go/10730/src/testing/testing.go:1743 +0x44

Goroutine 761 (running) created at:
  github.com/canonical/pebble/internals/daemon.(*apiSuite).TestDeadlock()
      /home/ubuntu/pebble/internals/daemon/api_services_test.go:422 +0x5d4
  runtime.call16()
      /snap/go/10730/src/runtime/asm_amd64.s:775 +0x42
  reflect.Value.Call()
      /snap/go/10730/src/reflect/value.go:365 +0xb5
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:775 +0x965
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/ubuntu/go/pkg/mod/gopkg.in/[email protected]/check.go:669 +0xe9

I tried to use the Go client to do the same as the test case with a manually started Pebble then check the changes' status but I didn't reproduce this issue successfully. Code used:

package main

import (
	"context"
	"fmt"
	"log"
	"math/rand"
	"time"

	"github.com/canonical/pebble/client"
)

func getServices(pebble *client.Client) {
	_, err := pebble.Services(&client.ServicesOptions{})
	if err != nil {
		panic(fmt.Sprintf("getServices error: %v", err))
	}
}

func start(pebble *client.Client) {
	_, err := pebble.Start(&client.ServiceOptions{Names: []string{"test"}})
	if err != nil {
		panic(err)
	}
}

func stop(pebble *client.Client) {
	_, err := pebble.Stop(&client.ServiceOptions{Names: []string{"test"}})
	if err != nil {
		panic(err)
	}
}

func main() {
	pebble, err := client.New(&client.Config{Socket: "/home/ubuntu/PEBBLE_HOME/.pebble.socket"})
	if err != nil {
		log.Fatal(err)
	}

	ctx, cancel := context.WithCancel(context.Background())

	go func() {
		for ctx.Err() == nil {
			getServices(pebble)
			time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)
		}
	}()

	go func() {
		for ctx.Err() == nil {
			start(pebble)
			time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)
		}
	}()

	go func() {
		for ctx.Err() == nil {
			stop(pebble)
			time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)
		}
	}()

	time.Sleep(time.Second)
	cancel()

	// Try to hit GET /v1/services once more; if it times out -- deadlock!
	done := make(chan struct{})
	go func() {
		getServices(pebble)
		done <- struct{}{}
	}()
	select {
	case <-done:
	case <-time.After(time.Second):
		log.Fatal("timed out waiting for final request -- deadlock!")
	}
}
@IronCore864 IronCore864 added the Tests Test failures (usually intermittent) label Oct 10, 2024
@dimaqq
Copy link

dimaqq commented Oct 11, 2024

my 2c: extend the timeout. a deadlock would still be detected.

@benhoyt
Copy link
Contributor

benhoyt commented Oct 11, 2024

Tiexin and I rubber-ducked this a bit, and it's not clear what's causing the issue, but it may be a real problem. I had thought this was unrelated, but it's likely related to the "allow stopping services in 'starting' state" change we just merged.

We've already determined that increasing that "timed out waiting for final request" timeout from 1s to 5s doesn't help (and this is not on a loaded machine), so there's clearly something going on.

It could be a buggy test, or it could be buggy code, so let's dig further. Tiexin's going to 1) try to reproduce the failure before fix #503, and 2) add logging to try to determine where the problem is.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Oct 14, 2024

After some debugging and testing, I figured out that it was a deadlock issue with Ringbuffer:

Ringbuffer HeadIterator() (and TailIterator(), for that matter) acquires the iteratorMutex lock first then the rwlock (inexplicitly in rb.Closed()), but Close() (to be differentiated with Closed() acquires the rwlock first then the iteratorMutex lock (inexplicitly in releaseIterators), causing a deadlock.

It can be reproduced with a test like this:

func BenchmarkRingBufferDeadlock(b *testing.B) {
	payload := []byte("foobar")

	var rb *servicelog.RingBuffer

	ctx, cancel := context.WithCancel(context.Background())

	go func() {
		for ctx.Err() == nil {
			fmt.Println("A")
			rb = servicelog.NewRingBuffer(10 * len(payload))
			_ = rb.HeadIterator(0)
			time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)
		}
	}()
	go func() {
		for ctx.Err() == nil {
			if rb != nil {
				fmt.Println("B")
				rb.Close()
			}
			time.Sleep(time.Duration(rand.Intn(50)) * time.Millisecond)
		}
	}()

	time.Sleep(10 * time.Second)
	cancel()
}

And after instrumenting ringbuffer with some logs for locking/unlocking, we can observe the deadlock:

B
=== in Close, rwlock locked
=== in Close, iteratorMutex locked
=== in Close, iteratorMutex unlocked
=== Close done!
=== in Close, rwlock unlocked
A
=== in HeadIterator, iteratorMutex locked
=== in Closed, rwlock locked
=== in Closed, rwlock unlocked
=== HeadIterator done!
=== in HeadIterator, iteratorMutex unlocked
A
=== in HeadIterator, iteratorMutex locked
=== in Closed, rwlock locked
=== in Closed, rwlock unlocked
=== HeadIterator done!
=== in HeadIterator, iteratorMutex unlocked
A
B
=== in Close, rwlock locked
=== in HeadIterator, iteratorMutex locked

(hangs here)

IronCore864 added a commit that referenced this issue Oct 14, 2024
…rting state (#510)

In [this PR](#503), we
introduced a feature to allow stopping services that are in the starting
state / within the okayDelay. However, this makes [a deadlock issue in
ringbuffer a real
issue](#508). So we are
reverting this change now and will redo it after the deadlock is
resolved.
IronCore864 added a commit that referenced this issue Oct 21, 2024
Previously, we introduced [a fix to allow stopping services in the
starting state](#503).

Because of this fix, we discovered [another deadlock
issue](#508), so we rolled it
back. Now that the deadlock issue is [fixed by this
PR](https://github.com/canonical/pebble/pull/511/files), we are
reintroducing the fix about "allowing stopping services in the starting
state".

Benchmark test, manual test to reproduce the 3-way deadlock issue, and
race test are all done and passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tests Test failures (usually intermittent)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants