feat: rpc reconnection on failure #149

iand · 2020-10-27T11:13:23Z

Tasks are now passed an APIOpener which they use to acquire a lens. On
encountering a fatal error the task exits and closes the lens. The scheduler
restarts the task after a delay allowing the task to attempt a new connection
to the lens.

The lotus lens opener returns a new rpc connection. The lotusrepo and carrepo
openers return a shared instance of their repo since they hold exclusive locks
over the store.

Fixes #98

willscott · 2020-10-27T14:49:48Z

lens/lotus/lotus.go

-func GetFullNodeAPI(cctx *cli.Context) (context.Context, lens.API, lens.APICloser, error) {
+type APIOpener struct {
+	// TODO: replace dependency on cli.Context with tokenMaddr and repo path
+	cctx  *cli.Context


if open takes a context, why do we need a context on NewAPIOpener? we can take it for the length of opening, but i'm not seeing what the value of storing it on the opener struct is

Are you referring to the cli Context? It's stored because we pass it to visor which adds a tracing key and (now I look closely) sets up a signal handler. I don't know why we're doing this. I'd prefer not to return the context at all from NewAPIOpener but I was trying to limit the scope of this change to connection handling.

more that you also pass a ctx to the Open function, and that's the one that should probably get tagged / live for the length of the open connection rather than the one passed in constructing the opener object itself.

I'm note sure I follow

NewAPIOpener accepts a cli.Context (and uses it for reading flags/settings), Open accepts a context.Context (and uses it for calls to the lotus API)

i think it's just the disjoint between the singleton APIOpener lifetime of the structs vs the duration of connections from the Open call where the semantics of how the cctx on the opener object are used is non-obvious.

cctx is required (unfortunately) because we have a hard dependency on passing a *cli.Context to github.com/filecoin-project/lotus/cli.GetFullNodeAPI where it expects to be able to read flags and environment to locate the repo on disk. We could unpick that dependency.

Updated so we don't return the context from the creation of api openers. Does this address your concerns @willscott

willscott · 2020-10-27T14:51:27Z

tasks/chain/economics.go

+	node, closer, err := p.opener.Open(ctx)
+	if err != nil {
+		return xerrors.Errorf("open lens: %w", err)
+	}
+	defer closer()


how does this relate to the lifetime of the API?
if the api fails, this whole task will need to restart, right?

It does. But tasks are designed to be restarted by the scheduler. When processBatch returns an error then wait.RepeatUntil will return and closer will be called. For the lotus api that closes the rpc client so next time the task starts (after a 1 minute delay by default) it will acquire a new connection to the lotus api.

The repo and sql lenses manage a single reference to the repo and calling closer here is a noop.

willscott · 2020-10-27T14:53:38Z

lens/interface.go

@@ -14,3 +16,7 @@ type API interface {
 }

 type APICloser func()
+
+type APIOpener interface {


would a pattern where there's a wrapped 'reconnecting API' that presents the lens interface, but which internally will re-open the underlying lens after a delay upon failures be an easier abstraction than this one?

having each task have to worry about the lifetime / reconnecting to the lens on failures seems like it may result in duplicate reconnecting code in more places

Each task needs to be able to close and reopen its connection independently. Otherwise we have to manage concurrent access to the lens state. There's not really any specific reconnecting code in the tasks: they open a connection and close it when they exit. The scheduler ensures that tasks are restarted after they fail. The connection logic itself is encapsulated in the opener that is passed to the task.

willscott · 2020-10-28T15:06:22Z

lens/lotus/lotus.go

+		api, closer, err = getFullNodeAPIUsingCredentials(ctx, toks[1], toks[0])
 		if err != nil {
-			return nil, nil, nil, xerrors.Errorf("get full node api with credentials: %w", err)
+			return nil, nil, xerrors.Errorf("get full node api with credentials: %w", err)
 		}
 	} else {
-		api, closer, err = lcli.GetFullNodeAPI(cctx)
+		api, closer, err = lcli.GetFullNodeAPI(o.cctx)


why when we have token credentials do we use the ctx passed in to Open as the lifetime of the API, versus when we don't (and look on disk for the api endpoint) we use the context passed to create the APIOpener instead?

As mentioned above they are different types. As far as I can tell GetFullNodeAPI expects the cli context so it can read the flags and environment.

Perhaps @frrist or @mg could explain further since this code comes from the original spike

but in the first case you could provide o.cctx.Context for consistency though

We should always prefer the context.Context passed to a function because it may have tracing or metrics keys in it. o.cctx.Context would be the original context derived from context.Background by the CLI package.

ah. i see the issue now 😢

we could do something like:

o.cctx.Context = ctx api, closer, err = lcli.GetFullNodeAPI(o.cctx)

maybe would want to save/restore the full context to make sure it doesn't get lost.

Can you expand on "the decision to pass in credentials as a flag shouldn't change the lifetime of the API" - I don't think I understand why it would change it.

(since the lifetime of the API is always the duration of a single run of the task that is using it)

i guess i meant more "which context is used" - if an extra tag (or a cancel context) is added to the context passed to the Open method, it will be applied if there is a --api flag and the upper case of instantiation is used, but not if there isn't and only the original app context goes to the API.

Having consistency of the context used for the API node / json RPC seems desirable

Perhaps frrist or mg could explain further since this code comes from the original spike

Not 100% sure on the question, but it looks like the methods called interanlly by lcli.GetFullNodeAPI are all exported meaning lcli.GetFullNodeAPI could be re-implemented here for simpler context handling.

Tasks are now passed an APIOpener which they use to acquire a lens. On encountering a fatal error the task exits and closes the lens. The scheduler restarts the task after a delay allowing the task to attempt a new connection to the lens. The lotus lens opener returns a new rpc connection. The lotusrepo and carrepo openers return a shared instance of their repo since they hold exclusive locks over the store.

iand · 2020-10-29T11:16:49Z

@willscott I refactored the lotus lens to remove the dependency on lotus lcli and the long term reference to cli.Context. The opener now holds only the address and token which is uses to reconnect to the api when Open is called.

By Ian Davis (25) and others Via GitHub origin/master: (66 commits) feat: add changelog generator (#199) feat: disable leasing when gas outputs lease time is zero (#198) feat: add power actor claim extraction chore: Include RC releases in push docker images (#195) fix(metrics): export the completion and batch selection views (#197) fix: fix actor completion query chore: add metrics to leasing and work completion queries continue message parsing task when cbor parse fails (#185) perf: ensure processing updates always include height in criteria (#192) perf: include height restrictions in update clauses of leasing queries (#189) open lotusrepo lens read only feat: allow actor state processor to run without leasing (#178) feat: rpc reconnection on failure (#149) chore: add changelog (#150) polish: update miner processing logic feat: add dynamic panel creation based on tags (#159) don't codeql on every commit to pr. only on PR merge and cron (#168) track statechange persist duration feat: add dynamic panel creation based on tags Replace filecoin-ffi with stubs to simplify build process (#166) ... Conflicts: commands/run.go It looks like you may be committing a merge. If this is not correct, please remove the file /Users/mg/work/pl/sentinel/.git/modules/sentinel-visor/MERGE_HEAD and try again.

iand force-pushed the iand/api-reconnect branch from bc361f8 to c9800a8 Compare October 27, 2020 11:25

iand self-assigned this Oct 27, 2020

willscott reviewed Oct 27, 2020

View reviewed changes

iand force-pushed the iand/api-reconnect branch from c9800a8 to d0a1263 Compare October 28, 2020 13:52

willscott reviewed Oct 28, 2020

View reviewed changes

iand force-pushed the iand/api-reconnect branch from d0a1263 to eaef48d Compare October 28, 2020 16:40

iand added 2 commits October 29, 2020 10:20

Don't return context when creating lens openers

2c6082e

iand force-pushed the iand/api-reconnect branch from eaef48d to 2c6082e Compare October 29, 2020 10:26

Remove lotus lens opener dependency on cached cli context

57a0a7d

willscott approved these changes Oct 29, 2020

View reviewed changes

iand merged commit d2d803a into master Oct 29, 2020

iand deleted the iand/api-reconnect branch October 29, 2020 17:17

placer14 mentioned this pull request Nov 11, 2020

"Lock not acquired" should be fatal #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rpc reconnection on failure #149

feat: rpc reconnection on failure #149

iand commented Oct 27, 2020 •

edited

Loading

willscott Oct 27, 2020

iand Oct 27, 2020

willscott Oct 27, 2020

iand Oct 27, 2020

willscott Oct 27, 2020

iand Oct 27, 2020

iand Oct 28, 2020

willscott Oct 27, 2020

iand Oct 27, 2020

willscott Oct 27, 2020

iand Oct 27, 2020 •

edited

Loading

willscott Oct 28, 2020

iand Oct 28, 2020

iand Oct 28, 2020

willscott Oct 28, 2020

iand Oct 28, 2020

willscott Oct 28, 2020

iand Oct 28, 2020

iand Oct 28, 2020

willscott Oct 28, 2020

frrist Oct 28, 2020

iand commented Oct 29, 2020

feat: rpc reconnection on failure #149

feat: rpc reconnection on failure #149

Conversation

iand commented Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iand Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iand commented Oct 29, 2020

iand commented Oct 27, 2020 •

edited

Loading

iand Oct 27, 2020 •

edited

Loading