Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/integration: Fix flaky TestScaleoutClusterSuite #242

Merged

Conversation

anoopcs9
Copy link
Collaborator

@anoopcs9 anoopcs9 commented Aug 27, 2022

TestScaleoutClusterSuite fails more frequently with the following error:

=== RUN   TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
    reconcile_test.go:151:
        Error Trace:    reconcile_test.go:151
        Error:          Not equal:
                        expected: 3
                        actual  : 2
        Test:           TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
        Messages:       Clustersize not as expected

Above check is to make sure that number of replicas within StatefulSet reflects the updated SmbShare.Spec.Scaling.MinClusterSize. But an immediate check on StatefulSet.Spec.Replicas might not always give us the desired(updated) value.

Therefore we retry this check within a brief 3 seconds timeout on account of any delay in field update. In addition, we at least wait for the existence of extra pods corresponding to updated replica count. Considering the increased overall test time we further raise the timeout from 20m to 30m.

@anoopcs9
Copy link
Collaborator Author

Hm.. we were already at the verge of test timeout(20m) and now with this change possibility is higher as we can see with the very first run.

=== RUN   TestIntegration/reconciliation/scaleoutCluster
    reconcile_test.go:142: test ID: gtxg2j
=== RUN   TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
=== RUN   TestIntegration/resourceUpdate
=== RUN   TestIntegration/resourceUpdate/SmbShareUpdateSuite
    resource_update_test.go:78: test ID: 7v842d
=== RUN   TestIntegration/resourceUpdate/SmbShareUpdateSuite/TestEditReadOnly
    resource_update_test.go:149: checking smbclient login to share
    resource_update_test.go:186: Setting readonly=true for SmbShare samba-operator-system/tshare1-7v842d
    resource_update_test.go:227: checking smbclient write to share
    resource_update_test.go:186: Setting readonly=false for SmbShare samba-operator-system/tshare1-7v842d
panic: test timed out after 20m0s

Copy link
Collaborator

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern I have with the change is that it seems like we're checking things in reverse order. In other words, the operator is going to update the StatefulSet based on changes to the SmbShare, then the kube built-in controllers will update the pods based on the StatefulSet. Now the test waits for the expected number of pods before checking the values of the StatefulSet, which strikes me as a bit inverted.

I guess the test already had that flaw though. Maybe what we really want is (pseudocode):

updateSmbShare();
ctx2 := contextWithTimeout()
poll(ctx, func() {
  l, err := StatefulSets.List(...)
  checkStatefulSet(l)
})
require.NoError(waitForPodExist(ctx, s), "smb server pod does not exist")

What do you think?

tests/integration/reconcile_test.go Outdated Show resolved Hide resolved
@phlogistonjohn
Copy link
Collaborator

phlogistonjohn commented Aug 27, 2022

Hm.. we were already at the verge of test timeout(20m) and now with this change possibility is higher as we can see with the very first run.

On thursday I was saying that I noticed the non-clustered runs were taking a while when I had some networking issues the suite was timing out. I have a change to increase both by 10 mins (to 20 minutes and 30 minutes). e038c17 is clearly not yet ready for prime time, but feel free to take and adapt it for this PR if you want to.

@anoopcs9
Copy link
Collaborator Author

Hm.. we were already at the verge of test timeout(20m) and now with this change possibility is higher as we can see with the very first run.

On thursday I was saying that I noticed the non-clustered runs were taking a while when I had some networking issues the suite was timing out. I have a change to increase both by 10 mins (to 20 minutes and 30 minutes). e038c17 is clearly not yet ready for prime time, but feel free to take and adapt it for this PR if you want to.

I added the change for increasing timeout for clustered test runs from 20m to 30m.

@anoopcs9
Copy link
Collaborator Author

I guess the test already had that flaw though. Maybe what we really want is (pseudocode):

updateSmbShare();
ctx2 := contextWithTimeout()
poll(ctx, func() {
  l, err := StatefulSets.List(...)
  checkStatefulSet(l)
})
require.NoError(waitForPodExist(ctx, s), "smb server pod does not exist")

What do you think?

Ok, sounds reasonable.

@anoopcs9 anoopcs9 force-pushed the fix-scaleout-cluster-test branch 2 times, most recently from f9f736b to 42dc35a Compare August 29, 2022 09:01
@anoopcs9
Copy link
Collaborator Author

Networking issues on infra where CentOS CI is hosted: https://lists.centos.org/pipermail/ci-users/2022-August/004605.html

Rook setup is most likely unsuccessful due to the above outage. But here is a new make check failure from golangci-lint:

internal/resources/metrics.go:14:2: could not import k8s.io/apimachinery/pkg/types (-: could not load export
data: cannot import "k8s.io/apimachinery/pkg/types" (unstable iexport format version 2, just rebuild compiler
and std library), export data is newer version - update tool) (typecheck)
	"k8s.io/apimachinery/pkg/types"
	^

@phlogistonjohn
Copy link
Collaborator

Wow, that's quite the error. Oddly, that's not what I see when I look in the CI logs. Rather, I see:

go: downloading google.golang.org/protobuf v1.27.1
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.1
golangci-lint installed in /home/runner/work/samba-operator/samba-operator/.bin
/home/runner/work/samba-operator/samba-operator/.bin/golangci-lint -c .golangci.yaml run ./...
! gofmt -e -s -l . | sed 's,^,formatting error: ,' | grep 'go$'
make: *** [Makefile:214: check-format] Error 1
formatting error: tests/integration/reconcile_test.go
Error: Process completed with exit code 2.

@phlogistonjohn
Copy link
Collaborator

I looked at the patch, and the CI is "right" there are formatting issues in your patch (spacing in the struct fields in the poll.Prober.
I'm wondering if the other error is a local issue on your setup with that linter?

@anoopcs9 anoopcs9 force-pushed the fix-scaleout-cluster-test branch from 42dc35a to 0847fb2 Compare August 30, 2022 07:22
@anoopcs9
Copy link
Collaborator Author

I looked at the patch, and the CI is "right" there are formatting issues in your patch (spacing in the struct fields in the poll.Prober.

Running gofmt with -d reported the following diff:

--- tests/integration/reconcile_test.go.orig	2022-08-30 12:49:55.592909058 +0530
+++ tests/integration/reconcile_test.go	2022-08-30 12:49:55.592909058 +0530
@@ -202,7 +202,7 @@
 		ctx, smbShare)
 	require.NoError(err)
 
-	ctx2, cancel := context.WithTimeout(s.defaultContext(), 3 * time.Second)
+	ctx2, cancel := context.WithTimeout(s.defaultContext(), 3*time.Second)
 	defer cancel()
 	s.Require().NoError(poll.TryUntil(ctx2, &poll.Prober{
 		RetryInterval: time.Second,

-d, if present with older versions, can better pin point the formatting error.

I'm wondering if the other error is a local issue on your setup with that linter?

internal/resources/metrics.go:14:2: could not import k8s.io/apimachinery/pkg/types (-: could not load export
data: cannot import "k8s.io/apimachinery/pkg/types" (unstable iexport format version 2, just rebuild compiler
and std library), export data is newer version - update tool) (typecheck)
	"k8s.io/apimachinery/pkg/types"
	^

I could see few issues reported online and it has to do with the installed versions of Go and golanci-lint. I encountered above error on a system with following versions installed:

# go version
go version go1.18.4 linux/amd64

# .bin/golangci-lint --version
golangci-lint has version v1.43.0 built from (unknown, mod sum: "h1:SLwZFEmDgopqZpfP495zCtV9REUf551JJlJ51Ql7NZA=") on (unknown)

Our install script restrict golanci-lint at v1.43.0 and GitHub CI runs on Go 1.16. In order to get rid of this new error I did an update to latest golangci-lint.

@phlogistonjohn
Copy link
Collaborator

GitHub CI runs on Go 1.16

Oh, we should definitely update that at some point soonish.

phlogistonjohn
phlogistonjohn previously approved these changes Aug 31, 2022
Copy link
Collaborator

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, thanks!

@anoopcs9
Copy link
Collaborator Author

anoopcs9 commented Sep 1, 2022

/test centos-ci/sink-clustered/mini-k8s-1.24

Copy link
Collaborator

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the test.sh change a separate commit and then we should be all good.

TestScaleoutClusterSuite fails more frequently with the following error:

=== RUN   TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
    reconcile_test.go:151:
        Error Trace:    reconcile_test.go:151
        Error:          Not equal:
                        expected: 3
                        actual  : 2
        Test:           TestIntegration/reconciliation/scaleoutCluster/TestScaleoutClusterSuite
        Messages:       Clustersize not as expected

Above check is to make sure that number of replicas within StatefulSet
reflects the updated SmbShare.Spec.Scaling.MinClusterSize. But an
immediate check on StatefulSet.Spec.Replicas might not always give us
the desired(updated) value.

Therefore we retry this check within a brief 3 seconds timeout on
account of any delay in field update. In addition, we at least wait
for the existence of extra pods corresponding to updated replica count.

Signed-off-by: Anoop C S <[email protected]>
@anoopcs9 anoopcs9 force-pushed the fix-scaleout-cluster-test branch from db9da7a to b64253f Compare September 21, 2022 15:47
Copy link
Collaborator

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I was fine with it being a separate commit rather than a full blown seperate PR, but either way this now looks fine AFAIAC.

@phlogistonjohn phlogistonjohn merged commit fd5ef50 into samba-in-kubernetes:master Sep 21, 2022
@anoopcs9 anoopcs9 deleted the fix-scaleout-cluster-test branch September 22, 2022 05:45
@anoopcs9
Copy link
Collaborator Author

LGTM. I was fine with it being a separate commit rather than a full blown seperate PR, but either way this now looks fine AFAIAC.

Ah..my bad !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants