Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix close in use certificate providers after double Close() method call on wrapper object #7128

Merged
merged 5 commits into from
May 29, 2024

Conversation

bozaro
Copy link
Contributor

@bozaro bozaro commented Apr 16, 2024

wrappedProvider add refcounter to certprovider.Provider, but there no guards to prevent multiple Close() call after single Build() call.

The situation is complicated by:

  • the error does not appear where there was a double closure;
  • before the rotation of certificates, the problem may not manifest itself (that is, the effect is greatly delayed in time);
  • certprovider.Provider implementation don't panic/error on multiple Close() calls.

Fix errors like:

[core][Channel #39 SubChannel #941] grpc: addrConn.createTransport failed to connect to ...
Err: connection error: desc = "transport: authentication handshake failed:
xds: fetching trusted roots from CertificateProvider failed: provider instance is closed"

RELEASE NOTES:

  • Fix xds: fetching trusted roots from CertificateProvider failed: provider instance is closed error

…call on wrapper object

Fix errors like:
```
[core][Channel grpc#39 SubChannel grpc#941] grpc: addrConn.createTransport failed to connect to ...
Err: connection error: desc = "transport: authentication handshake failed:
xds: fetching trusted roots from CertificateProvider failed: provider instance is closed"
```
@aranjans aranjans added this to the 1.64 Release milestone Apr 17, 2024
@arvindbr8
Copy link
Member

arvindbr8 commented Apr 17, 2024

@bozaro -- Calling Close() feels more like a User error to me and wouldnt categorize this as a bug. I agree that the API could have been better by not providing a Close() method which can be called multiple times, which would cause incorrect accounting wrt wp.refCount.

That being said, inorder to fix this, I would rather make Provider.Close a grpcsync.OnceFunc. That way we can cleanly assert that a reference can only decrement their reference when calling Close().

@bozaro
Copy link
Contributor Author

bozaro commented Apr 18, 2024

Calling Close() feels more like a User error to me and wouldnt categorize this as a bug.

I think that even if the double Close() call is considered a user error, it is useful to do a check here to localize the consequences of it and not play the whack-a-mole game.

The situation when all neighboring uses must be written correctly for your code to work is very fragile.

I have found at least two places through which double Close() is possible:

That being said, inorder to fix this, I would rather make Provider.Close a grpcsync.OnceFunc.

I can also add a wrapper in grpcsync.OnceFunc, but this will not make the code simpler: I still need to check the closing status to return an error from provider.KeyMaterial of the closed wrapper.

@easwars
Copy link
Contributor

easwars commented May 7, 2024

@bozaro : Thank you for figuring this out and sending us a PR.

There are a few options available to fix this issue:

  • Make breaking API changes to this package
    • Remove the Close method from the Provider interface
    • Change the signature of the starter argument passed to NewBuildableConfig to return a tuple Provider, func(). The second return value is a close function to be invoked when the provider is to be closed.
    • Change BuildableConfig.Build() to return three values: (Provider, func(), error) . The second return value would the close function returned by the provider wrapped in a grpcsync.OnceFunc.
    • With this approach, the user of a provider will be able to close it how many ever times they want, and things will still continue to work cleanly. But this approach could hide potential bugs in the user's code that never show up in their tests, but only happen in production. Therefore we want to avoid this approach
  • The approach taken by you, where we return a separate per-instantiation wrapper that wraps the current wrappedProvider.
    • With this approach, the user of a provider instance that has called Close will start seeing errProviderClosed on subsequent invocations of KeyMaterial()
    • While this approach is a significant improvement over the existing code, this still will not help the user easily debug when and where they are double closing.

Instead the approach that we like (which is quite similar to your approach) is as follows:

  • Have a per-instantiation wrapper that wraps the current wrappedProvider. Call it something else, wrappedProviderCloser is too close to wrappedProvider. Maybe something like singleCloseWrappedProvider or singleRefWrappedProvider or something better.
type singleCloseWrappedProvider struct {
       mu       sync.Mutex
       provider Provider
}

func (s *singleCloseWrappedProvider) KeyMaterial(ctx context.Context) (*KeyMaterial, error) {
       s.mu.Lock()
       p := s.provider
       s.mu.Unlock()
       return p.KeyMaterial(ctx)
}
func (s *singleCloseWrappedProvider) Close() {
       s.mu.Lock()
       s.provider.Close()
       s.provider = panickingProvider{}
       s.mu.Unlock()
}
  • Have a provider implementation that panics in its Close() method.
type panickingProvider struct{}

func (panickingProvider) Close() {
       panic("Close called more than once on certificate provider")
}
func (panickingProvider) KeyMaterial(context.Context) (*KeyMaterial, error) {
       return nil, errProviderClosed
}
  • Change BuildableConfig.Build() to always return an instance of singleCloseWrappedProvider wrapping a newly created wrappedProvider or an existing one (in which case, the ref count on it would have to be incremented, as is done today)

Please let me know your thoughts on this approach.

@bozaro bozaro force-pushed the fix-certificate-provider-close branch from fd7af42 to 64fdac1 Compare May 11, 2024 07:17
@bozaro
Copy link
Contributor Author

bozaro commented May 11, 2024

Make breaking API changes to this package...

This approach has the opposite problem - you can use Provider after closing if it is used by someone else.

The approach taken by you, where we return a separate per-instantiation wrapper that wraps the current wrappedProvider...

I don't think double closure is a problem that needs to be dealt with:

  1. Go often uses a pattern like:
func foo() error {
   x, err := os.Open(fileName)
   if err != nil {
   	return err
   }
   defer x.Close()
   // ...
   if err := x.Close(); err != nil {
   	return err
   }
   // ...
   return nil
}
  1. Errors will be localized:
    • Close of one wrapper will not break neighboring ones;
    • An unexpected Close wrapper will break code that continues to use the same wrapper;
  2. The current Provider implementation ignores double Close calls.

Instead the approach that we like (which is quite similar to your approach) is as follows:

I have corrected the PR, except that I do not panic on double Close call.

@bozaro
Copy link
Contributor Author

bozaro commented May 15, 2024

@easwars, as a result, I'm trying to make the wrappedProvider have the same behavior as the certprovider.Provider itself.

@easwars
Copy link
Contributor

easwars commented May 15, 2024

I don't think double closure is a problem that needs to be dealt with:

I agree that there are things like files and conns, where Go supports multiple calls to close. But there are things like channels, which panic when closed more than once.

But making the double close lead to panic in our case, we can ensure that our users are notified right at the moment where they are doing something wrong (i.e closing the same provider more than once). And this also ensures that this can be caught in their testing.

I agree that both approaches (whether we panic or not on double close) result in KeyMaterial returning errProviderClosed when a closed provider is used by the user. And this also possibly be caught in their testing.

@bozaro
Copy link
Contributor Author

bozaro commented May 16, 2024

@easwars, I tried adding panic when closing twice: https://github.com/bozaro/grpc-go/pull/2/files (all unit tests green).

But:

  • This is a breaking change.
  • I don't really understand what this panic gives in this case from a practical point of view.
  • Unfortunately, I am not able to reproduce problems with double Close() in a test environment, only in production (require xds:/// to reproduce issue).
  • When using the server in production with xds:/// + mTLS, there is guaranteed (or very often) to be a double Close() call at the shutdown stage.

@easwars
Copy link
Contributor

easwars commented May 16, 2024

@dfawley for second set of eyes.

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the server in production with xds:/// + mTLS, there is guaranteed (or very often) to be a double Close() call at the shutdown stage.

This is concerning, as it indicates there may be a problem with either the basic usage of the provider itself, or the ownership model surrounding the provider. Can you provide any more information about how this happens?

}

// singleCloseWrappedProvider wraps a provider instance with a reference count to avoid double
// close still in use provider.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

... with a reference count to properly handle multiple calls to Close.

w.mu.Lock()
defer w.mu.Unlock()

w.provider.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this call without the lock held?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the use of sync.RWMutex with atomic.Pointer

@bozaro bozaro force-pushed the fix-certificate-provider-close branch from eccee0e to 527266f Compare May 17, 2024 05:09
@bozaro
Copy link
Contributor Author

bozaro commented May 17, 2024

Can you provide any more information about how this happens?

The most common example of double Close() (I don't think it's the only one):

First Close() call:

google.golang.org/grpc/xds/internal/server.(*connWrapper).Close(0x4008382d80)
	go/pkg/mod/google.golang.org/[email protected]/xds/internal/server/conn_wrapper.go:162 +0x50
crypto/tls.(*Conn).Close(0x40031f6e08?)
	GOROOT/src/crypto/tls/conn.go:1429 +0xf4
google.golang.org/grpc/credentials/xds.(*credsImpl).ServerHandshake(0x40048b0030, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/credentials/xds/xds.go:250 +0x3f8
google.golang.org/grpc/internal/transport.NewServerTransport({0x36bf288, 0x4008382d80}, 0x400170be58)
	go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:146 +0x84
google.golang.org/grpc.(*Server).newHTTP2Transport(0x40019f2a00, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:974 +0xfc
google.golang.org/grpc.(*Server).handleRawConn(0x40019f2a00, {0x4001db7f80, 0x9}, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:933 +0x94
google.golang.org/grpc.(*Server).Serve.func3()
	go/pkg/mod/google.golang.org/[email protected]/server.go:917 +0x64
created by google.golang.org/grpc.(*Server).Serve in goroutine 429
	go/pkg/mod/google.golang.org/[email protected]/server.go:916 +0x4f4

Second Close() call:

google.golang.org/grpc/xds/internal/server.(*connWrapper).Close(0x4008382d80)
	go/pkg/mod/google.golang.org/[email protected]/xds/internal/server/conn_wrapper.go:162 +0x50
google.golang.org/grpc.(*Server).newHTTP2Transport(0x40019f2a00, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x37c
google.golang.org/grpc.(*Server).handleRawConn(0x40019f2a00, {0x4001db7f80, 0x9}, {0x36bf288, 0x4008382d80})
	go/pkg/mod/google.golang.org/[email protected]/server.go:933 +0x94
google.golang.org/grpc.(*Server).Serve.func3()
	go/pkg/mod/google.golang.org/[email protected]/server.go:917 +0x64
created by google.golang.org/grpc.(*Server).Serve in goroutine 429
	go/pkg/mod/google.golang.org/[email protected]/server.go:916 +0x4f4

@easwars
Copy link
Contributor

easwars commented May 17, 2024

@bozaro : Thank you very much for the stack traces.

I think this is what is happening:

  • xDS enabled gRPC server gets a connection
  • This server uses a custom net.Listener which accepts the connection and returns a connWrapper here:
    cw := &connWrapper{Conn: conn, filterChain: fc, parent: l, urc: fc.UsableRouteConfiguration}
  • The gRPC server then spawns a goroutine to handle this connection here:
    s.handleRawConn(lis.Addr().String(), rawConn)
  • As part of handling the new connection, an HTTP/2 server transport is created here:
    st, err := transport.NewServerTransport(c, config)
    • If the server transport creation fails and the returned error value is not credentials.ErrConnDispatched, the rawConn is closed, which ends up calling Close on the connWrapper (this will be the second close)
  • As part of creating the HTTP/2 server transport, the server TLS handshake is initiated here:
    conn, authInfo, err = config.Credentials.ServerHandshake(rawConn)
  • This calls to ServerHandshake is handled by the xDS credentials implementation here:
    func (c *credsImpl) ServerHandshake(rawConn net.Conn) (net.Conn, credentials.AuthInfo, error) {
  • It wraps the rawConn in a tls.Conn here:
    conn := tls.Server(rawConn, cfg)
  • It then performs the TLS handshake by calling the Handshake method on the tls.Conn here:
    if err := conn.Handshake(); err != nil {
    • And if the handshake fails, then it calls Close on the tls.Conn, which ends up calling Close on the wrapped conn, which essentially lands in connWrapper (this will be the first close)

Also, I see that we have defined this error sentinel ErrConnDispatched here:

var ErrConnDispatched = errors.New("credentials: rawConn is dispatched out of gRPC")
, but none of the credentials implementations are returning this error.

I feel that the xDS credentials implementation should return this error in the case where the server handshake failed, and it ended up calling Close on the tls.Conn.

@dfawley
Do you know some history behind this error sentinel? What do you think about returning this value from the xDS creds implementation in the above case? Thanks.

@dfawley
Copy link
Member

dfawley commented May 20, 2024

Dispatched is used to allow another protocol to handle the connection. That's not intended for this kind of use case.

Interesting sequence of events. It seems if we use TLS to wrap a connection, we need to allow for double-close (either in connWrapper or certprovider), because:

  1. we can't start requiring that if any custom ServerHandshake errors, that it closes the connection (as that would be a behavior change), and
  2. we probably shouldn't remove the conn.Close from our TLS handshaker, because not calling the Close method on a tls.Conn is probably a bad idea (even if it doesn't leak anything today).

@dfawley dfawley assigned easwars and unassigned dfawley May 20, 2024
@bozaro
Copy link
Contributor Author

bozaro commented May 21, 2024

Do I need to do anything with this PR (e.g. rebase)?

@easwars
Copy link
Contributor

easwars commented May 28, 2024

@dfawley : Based on our offline discussion, the changes here reflect mostly what we want. We can track the double close in the TLS code path separately (if required, and if we can do anything actionable there).

This is waiting for your second review, apart from the minor nit that I have above.

@dfawley dfawley assigned easwars and unassigned dfawley May 29, 2024
@easwars easwars merged commit 24e9024 into grpc:master May 29, 2024
12 checks passed
bozaro added a commit to bozaro/grpc-go that referenced this pull request May 30, 2024
…call on wrapper object (grpc#7128)

(cherry picked from commit 24e9024)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants