Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rpc reconnection on failure #149

Merged
merged 3 commits into from
Oct 29, 2020
Merged

feat: rpc reconnection on failure #149

merged 3 commits into from
Oct 29, 2020

Conversation

iand
Copy link
Contributor

@iand iand commented Oct 27, 2020

Tasks are now passed an APIOpener which they use to acquire a lens. On
encountering a fatal error the task exits and closes the lens. The scheduler
restarts the task after a delay allowing the task to attempt a new connection
to the lens.

The lotus lens opener returns a new rpc connection. The lotusrepo and carrepo
openers return a shared instance of their repo since they hold exclusive locks
over the store.

Fixes #98

@iand iand force-pushed the iand/api-reconnect branch from bc361f8 to c9800a8 Compare October 27, 2020 11:25
@iand iand self-assigned this Oct 27, 2020
func GetFullNodeAPI(cctx *cli.Context) (context.Context, lens.API, lens.APICloser, error) {
type APIOpener struct {
// TODO: replace dependency on cli.Context with tokenMaddr and repo path
cctx *cli.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if open takes a context, why do we need a context on NewAPIOpener? we can take it for the length of opening, but i'm not seeing what the value of storing it on the opener struct is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the cli Context? It's stored because we pass it to visor which adds a tracing key and (now I look closely) sets up a signal handler. I don't know why we're doing this. I'd prefer not to return the context at all from NewAPIOpener but I was trying to limit the scope of this change to connection handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more that you also pass a ctx to the Open function, and that's the one that should probably get tagged / live for the length of the open connection rather than the one passed in constructing the opener object itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm note sure I follow

NewAPIOpener accepts a cli.Context (and uses it for reading flags/settings), Open accepts a context.Context (and uses it for calls to the lotus API)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's just the disjoint between the singleton APIOpener lifetime of the structs vs the duration of connections from the Open call where the semantics of how the cctx on the opener object are used is non-obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cctx is required (unfortunately) because we have a hard dependency on passing a *cli.Context to github.com/filecoin-project/lotus/cli.GetFullNodeAPI where it expects to be able to read flags and environment to locate the repo on disk. We could unpick that dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so we don't return the context from the creation of api openers. Does this address your concerns @willscott

Comment on lines +57 to +61
node, closer, err := p.opener.Open(ctx)
if err != nil {
return xerrors.Errorf("open lens: %w", err)
}
defer closer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this relate to the lifetime of the API?
if the api fails, this whole task will need to restart, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. But tasks are designed to be restarted by the scheduler. When processBatch returns an error then wait.RepeatUntil will return and closer will be called. For the lotus api that closes the rpc client so next time the task starts (after a 1 minute delay by default) it will acquire a new connection to the lotus api.

The repo and sql lenses manage a single reference to the repo and calling closer here is a noop.

@@ -14,3 +16,7 @@ type API interface {
}

type APICloser func()

type APIOpener interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a pattern where there's a wrapped 'reconnecting API' that presents the lens interface, but which internally will re-open the underlying lens after a delay upon failures be an easier abstraction than this one?

having each task have to worry about the lifetime / reconnecting to the lens on failures seems like it may result in duplicate reconnecting code in more places

Copy link
Contributor Author

@iand iand Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each task needs to be able to close and reopen its connection independently. Otherwise we have to manage concurrent access to the lens state. There's not really any specific reconnecting code in the tasks: they open a connection and close it when they exit. The scheduler ensures that tasks are restarted after they fail. The connection logic itself is encapsulated in the opener that is passed to the task.

@iand iand force-pushed the iand/api-reconnect branch from c9800a8 to d0a1263 Compare October 28, 2020 13:52
Comment on lines 56 to 61
api, closer, err = getFullNodeAPIUsingCredentials(ctx, toks[1], toks[0])
if err != nil {
return nil, nil, nil, xerrors.Errorf("get full node api with credentials: %w", err)
return nil, nil, xerrors.Errorf("get full node api with credentials: %w", err)
}
} else {
api, closer, err = lcli.GetFullNodeAPI(cctx)
api, closer, err = lcli.GetFullNodeAPI(o.cctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why when we have token credentials do we use the ctx passed in to Open as the lifetime of the API, versus when we don't (and look on disk for the api endpoint) we use the context passed to create the APIOpener instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above they are different types. As far as I can tell GetFullNodeAPI expects the cli context so it can read the flags and environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @frrist or @mg could explain further since this code comes from the original spike

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in the first case you could provide o.cctx.Context for consistency though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always prefer the context.Context passed to a function because it may have tracing or metrics keys in it. o.cctx.Context would be the original context derived from context.Background by the CLI package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. i see the issue now 😢

we could do something like:

o.cctx.Context = ctx
api, closer, err = lcli.GetFullNodeAPI(o.cctx)

maybe would want to save/restore the full context to make sure it doesn't get lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on "the decision to pass in credentials as a flag shouldn't change the lifetime of the API" - I don't think I understand why it would change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(since the lifetime of the API is always the duration of a single run of the task that is using it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i meant more "which context is used" - if an extra tag (or a cancel context) is added to the context passed to the Open method, it will be applied if there is a --api flag and the upper case of instantiation is used, but not if there isn't and only the original app context goes to the API.

Having consistency of the context used for the API node / json RPC seems desirable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps frrist or mg could explain further since this code comes from the original spike

Not 100% sure on the question, but it looks like the methods called interanlly by lcli.GetFullNodeAPI are all exported meaning lcli.GetFullNodeAPI could be re-implemented here for simpler context handling.

@iand iand force-pushed the iand/api-reconnect branch from d0a1263 to eaef48d Compare October 28, 2020 16:40
iand added 2 commits October 29, 2020 10:20
Tasks are now passed an APIOpener which they use to acquire a lens. On
encountering a fatal error the task exits and closes the lens. The scheduler
restarts the task after a delay allowing the task to attempt a new connection
to the lens.

The lotus lens opener returns a new rpc connection. The lotusrepo and carrepo
openers return a shared instance of their repo since they hold exclusive locks
over the store.
@iand iand force-pushed the iand/api-reconnect branch from eaef48d to 2c6082e Compare October 29, 2020 10:26
@iand
Copy link
Contributor Author

iand commented Oct 29, 2020

@willscott I refactored the lotus lens to remove the dependency on lotus lcli and the long term reference to cli.Context. The opener now holds only the address and token which is uses to reconnect to the api when Open is called.

@iand iand merged commit d2d803a into master Oct 29, 2020
@iand iand deleted the iand/api-reconnect branch October 29, 2020 17:17
placer14 added a commit that referenced this pull request Nov 4, 2020
 By Ian Davis (25) and others
 Via GitHub
 origin/master: (66 commits)
  feat: add changelog generator (#199)
  feat: disable leasing when gas outputs lease time is zero (#198)
  feat: add power actor claim extraction
  chore: Include RC releases in push docker images (#195)
  fix(metrics): export the completion and batch selection views (#197)
  fix: fix actor completion query
  chore: add metrics to leasing and work completion queries
  continue message parsing task when cbor parse fails (#185)
  perf: ensure processing updates always include height in criteria (#192)
  perf: include height restrictions in update clauses of leasing queries (#189)
  open lotusrepo lens read only
  feat: allow actor state processor to run without leasing (#178)
  feat: rpc reconnection on failure (#149)
  chore: add changelog (#150)
  polish: update miner processing logic
  feat: add dynamic panel creation based on tags (#159)
  don't codeql on every commit to pr. only on PR merge and cron (#168)
  track statechange persist duration
  feat: add dynamic panel creation based on tags
  Replace filecoin-ffi with stubs to simplify build process (#166)
  ...

Conflicts:
	commands/run.go

 It looks like you may be committing a merge.
 If this is not correct, please remove the file
	/Users/mg/work/pl/sentinel/.git/modules/sentinel-visor/MERGE_HEAD
 and try again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Visor failed to handle daemon restarts
3 participants