Skip to content

Commit

Permalink
v3rpc, mvcc: Guarantee order of requested progress notifications
Browse files Browse the repository at this point in the history
Progress notifications requested using ProgressRequest were sent
directly using the ctrlStream, which means that they could race
against watch responses in the watchStream.

This would especially happen when the stream was not synced - e.g. if
you requested a progress notification on a freshly created unsynced
watcher, the notification would typically arrive indicating a revision
for which not all watch responses had been sent.

This changes the behaviour so that v3rpc always goes through the watch
stream, using a new RequestProgressAll function that closely matches
the behaviour of the v3rpc code - i.e.

1. Generate a message with WatchId -1, indicating the revision for
   *all* watchers in the stream

2. Guarantee that a response is (eventually) sent

The latter might require us to defer the response until all watchers
are synced, which is likely as it should be. Note that we do *not*
guarantee that the number of progress notifications matches the number
of requests, only that eventually at least one gets sent.
  • Loading branch information
scpmw committed Jan 31, 2023
1 parent 772dfbf commit 8de2177
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 14 deletions.
22 changes: 13 additions & 9 deletions etcdserver/api/v3rpc/watch.go
Original file line number Diff line number Diff line change
Expand Up @@ -316,10 +316,9 @@ func (sws *serverWatchStream) recvLoop() error {
}

This comment has been minimized.

Copy link
@serathius

serathius Feb 2, 2023

This commit branches from some commit around v3.4.10 release. Any fix should be applied to the latest affected etcd version and we can backport it later. Please cherry pick it to main branch.

case *pb.WatchRequest_ProgressRequest:
if uv.ProgressRequest != nil {
sws.ctrlStream <- &pb.WatchResponse{
Header: sws.newResponseHeader(sws.watchStream.Rev()),
WatchId: -1, // response is not associated with any WatchId and will be broadcast to all watch channels
}
// Request progress for all watchers,
// force generation of a response
sws.watchStream.RequestProgressAll(true)
}
default:
// we probably should not shutdown the entire stream when
Expand Down Expand Up @@ -363,6 +362,7 @@ func (sws *serverWatchStream) sendLoop() {
// either return []*mvccpb.Event from the mvcc package
// or define protocol buffer with []mvccpb.Event.
evs := wresp.Events
progressNotify := len(evs) == 0

This comment has been minimized.

Copy link
@ahrtr

ahrtr Feb 1, 2023

This isn't needed, because wresp.WatchID always equal to -1 for Progress notifications.

This comment has been minimized.

Copy link
@scpmw

scpmw Feb 2, 2023

Author Owner

I don't think that's true - the "normal" timed progress notifications from progress are tied to a particular watch:

w.send(WatchResponse{WatchID: w.id, Revision: s.rev()})

events := make([]*mvccpb.Event, len(evs))
sws.mu.RLock()
needPrevKV := sws.prevKV[wresp.WatchID]
Expand All @@ -387,11 +387,15 @@ func (sws *serverWatchStream) sendLoop() {
Canceled: canceled,
}

if _, okID := ids[wresp.WatchID]; !okID {
// buffer if id not yet announced
wrs := append(pending[wresp.WatchID], wr)
pending[wresp.WatchID] = wrs
continue
// Progress notifications can have WatchID -1
// if they announce on behalf of multiple watchers
if !progressNotify || wresp.WatchID != -1 {

This comment has been minimized.

Copy link
@ahrtr

ahrtr Feb 1, 2023

Change this to

if wresp.WatchID != clientv3.InvalidWatchID {
if _, okID := ids[wresp.WatchID]; !okID {
// buffer if id not yet announced
wrs := append(pending[wresp.WatchID], wr)
pending[wresp.WatchID] = wrs
continue
}
}

mvcc.ReportEventReceived(len(evs))
Expand Down
45 changes: 40 additions & 5 deletions mvcc/watchable_store.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ var (
type watchable interface {
watch(key, end []byte, startRev int64, id WatchID, ch chan<- WatchResponse, fcs ...FilterFunc) (*watcher, cancelFunc)
progress(w *watcher)
progress_all(force bool)
rev() int64
}

Expand All @@ -62,6 +63,9 @@ type watchableStore struct {
// The key of the map is the key that the watcher watches on.
synced watcherGroup

// Whether to generate a progress notification once all watchers are synchronised
progressOnSync bool

This comment has been minimized.

Copy link
@ahrtr

ahrtr Feb 1, 2023

I don't think we need this. If the watchStream isn't synced, we can just skip the progress notification.

This comment has been minimized.

Copy link
@scpmw

scpmw Feb 2, 2023

Author Owner

Yeah, I expected this to be somewhat contentious - this is trying to keep the old behaviour where every request guarantees at least one answer. Without this, we would have to change the documented semantics of WatchProgressRequest from

Requests the [sic] a watch stream progress status be sent in the watch response stream as soon as possible.

to

Requests that a (possibly empty) watch response be sent in the watch response stream as soon as possible.

And I can see the point - though personally I would prefer to keep the old semantics. The reason we stumbled across this in the first place is in attempting to accurately reconstruct the database state from the watch stream. The issue is that when there are multiple watchers, you have no way of telling when all watch responses for a particular revision have been delivered. From that perspective, requesting a progress notifiation would be a (slightly underhanded?) way of forcing a synchronisation point in the stream.

Anyway, could still be made work without progressOnSync - when in doubt we would just need to keep re-requesting until a progress notification is received. Or we could try a PR adding a "all watchers synched" flag for WatchResponse, which I suppose would be the direct way to support the use case.


stopc chan struct{}
wg sync.WaitGroup
}
Expand All @@ -79,11 +83,12 @@ func newWatchableStore(lg *zap.Logger, b backend.Backend, le lease.Lessor, ci ci
lg = zap.NewNop()
}
s := &watchableStore{
store: NewStore(lg, b, le, ci, cfg),
victimc: make(chan struct{}, 1),
unsynced: newWatcherGroup(),
synced: newWatcherGroup(),
stopc: make(chan struct{}),
store: NewStore(lg, b, le, ci, cfg),
victimc: make(chan struct{}, 1),
unsynced: newWatcherGroup(),
synced: newWatcherGroup(),
stopc: make(chan struct{}),
progressOnSync: false,
}
s.store.ReadView = &readView{s}
s.store.WriteView = &writeView{s}
Expand Down Expand Up @@ -406,6 +411,15 @@ func (s *watchableStore) syncWatchers() int {
}
slowWatcherGauge.Set(float64(s.unsynced.size() + vsz))

// Deferred progress notification left to send when synced?
if s.progressOnSync && s.unsynced.size() == 0 {
for w, _ := range s.synced.watchers {
w.send(WatchResponse{WatchID: -1, Revision: s.rev()})
break
}
s.progressOnSync = false
}

return s.unsynced.size()
}

Expand Down Expand Up @@ -484,6 +498,27 @@ func (s *watchableStore) progress(w *watcher) {
}
}

func (s *watchableStore) progress_all(force bool) {
s.mu.RLock()
defer s.mu.RUnlock()

// Any watcher unsynced?
if s.unsynced.size() > 0 {
// If forced: Defer progress until successfully synced
if force {
s.progressOnSync = true
}

} else {
// If all watchers are synchronised, send out progress
// watch response on first watcher (if any)
for w, _ := range s.synced.watchers {
w.send(WatchResponse{WatchID: -1, Revision: s.rev()})
break
}
}

This comment has been minimized.

Copy link
@ahrtr

ahrtr Feb 1, 2023

This is a littler over-kill to me, we should only check whether a specific watchID is synced or not.

}

type watcher struct {
// the watcher key
key []byte
Expand Down
11 changes: 11 additions & 0 deletions mvcc/watcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,13 @@ type WatchStream interface {
// of the watchers since the watcher is currently synced.
RequestProgress(id WatchID)

// Requests a progress notification for the entire watcher
// group. The response will only be sent if all watchers are
// synced - or once they become synced, if forced. The
// responses will be sent through the WatchRespone Chan of the
// first watcher of this stream, if any.
RequestProgressAll(force bool)

// Cancel cancels a watcher by giving its ID. If watcher does not exist, an error will be
// returned.
Cancel(id WatchID) error
Expand Down Expand Up @@ -191,3 +198,7 @@ func (ws *watchStream) RequestProgress(id WatchID) {
}
ws.watchable.progress(w)
}

func (ws *watchStream) RequestProgressAll(force bool) {
ws.watchable.progress_all(force)
}

0 comments on commit 8de2177

Please sign in to comment.