-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: rpc reconnection on failure #149
Conversation
bc361f8
to
c9800a8
Compare
lens/lotus/lotus.go
Outdated
func GetFullNodeAPI(cctx *cli.Context) (context.Context, lens.API, lens.APICloser, error) { | ||
type APIOpener struct { | ||
// TODO: replace dependency on cli.Context with tokenMaddr and repo path | ||
cctx *cli.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if open takes a context, why do we need a context on NewAPIOpener? we can take it for the length of opening, but i'm not seeing what the value of storing it on the opener struct is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the cli Context? It's stored because we pass it to visor which adds a tracing key and (now I look closely) sets up a signal handler. I don't know why we're doing this. I'd prefer not to return the context at all from NewAPIOpener but I was trying to limit the scope of this change to connection handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more that you also pass a ctx to the Open
function, and that's the one that should probably get tagged / live for the length of the open connection rather than the one passed in constructing the opener object itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm note sure I follow
NewAPIOpener accepts a cli.Context (and uses it for reading flags/settings), Open accepts a context.Context (and uses it for calls to the lotus API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's just the disjoint between the singleton APIOpener lifetime of the structs vs the duration of connections from the Open
call where the semantics of how the cctx on the opener object are used is non-obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cctx is required (unfortunately) because we have a hard dependency on passing a *cli.Context
to github.com/filecoin-project/lotus/cli.GetFullNodeAPI
where it expects to be able to read flags and environment to locate the repo on disk. We could unpick that dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so we don't return the context from the creation of api openers. Does this address your concerns @willscott
node, closer, err := p.opener.Open(ctx) | ||
if err != nil { | ||
return xerrors.Errorf("open lens: %w", err) | ||
} | ||
defer closer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this relate to the lifetime of the API?
if the api fails, this whole task will need to restart, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. But tasks are designed to be restarted by the scheduler. When processBatch returns an error then wait.RepeatUntil
will return and closer will be called. For the lotus api that closes the rpc client so next time the task starts (after a 1 minute delay by default) it will acquire a new connection to the lotus api.
The repo and sql lenses manage a single reference to the repo and calling closer here is a noop.
@@ -14,3 +16,7 @@ type API interface { | |||
} | |||
|
|||
type APICloser func() | |||
|
|||
type APIOpener interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a pattern where there's a wrapped 'reconnecting API' that presents the lens interface, but which internally will re-open the underlying lens after a delay upon failures be an easier abstraction than this one?
having each task have to worry about the lifetime / reconnecting to the lens on failures seems like it may result in duplicate reconnecting code in more places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each task needs to be able to close and reopen its connection independently. Otherwise we have to manage concurrent access to the lens state. There's not really any specific reconnecting code in the tasks: they open a connection and close it when they exit. The scheduler ensures that tasks are restarted after they fail. The connection logic itself is encapsulated in the opener that is passed to the task.
c9800a8
to
d0a1263
Compare
lens/lotus/lotus.go
Outdated
api, closer, err = getFullNodeAPIUsingCredentials(ctx, toks[1], toks[0]) | ||
if err != nil { | ||
return nil, nil, nil, xerrors.Errorf("get full node api with credentials: %w", err) | ||
return nil, nil, xerrors.Errorf("get full node api with credentials: %w", err) | ||
} | ||
} else { | ||
api, closer, err = lcli.GetFullNodeAPI(cctx) | ||
api, closer, err = lcli.GetFullNodeAPI(o.cctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why when we have token credentials do we use the ctx
passed in to Open
as the lifetime of the API, versus when we don't (and look on disk for the api endpoint) we use the context passed to create the APIOpener instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above they are different types. As far as I can tell GetFullNodeAPI expects the cli context so it can read the flags and environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in the first case you could provide o.cctx.Context
for consistency though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always prefer the context.Context passed to a function because it may have tracing or metrics keys in it. o.cctx.Context would be the original context derived from context.Background by the CLI package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. i see the issue now 😢
we could do something like:
o.cctx.Context = ctx
api, closer, err = lcli.GetFullNodeAPI(o.cctx)
maybe would want to save/restore the full context to make sure it doesn't get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on "the decision to pass in credentials as a flag shouldn't change the lifetime of the API" - I don't think I understand why it would change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(since the lifetime of the API is always the duration of a single run of the task that is using it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess i meant more "which context is used" - if an extra tag (or a cancel context) is added to the context passed to the Open
method, it will be applied if there is a --api
flag and the upper case of instantiation is used, but not if there isn't and only the original app context goes to the API.
Having consistency of the context used for the API node / json RPC seems desirable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps frrist or mg could explain further since this code comes from the original spike
Not 100% sure on the question, but it looks like the methods called interanlly by lcli.GetFullNodeAPI
are all exported meaning lcli.GetFullNodeAPI
could be re-implemented here for simpler context handling.
d0a1263
to
eaef48d
Compare
Tasks are now passed an APIOpener which they use to acquire a lens. On encountering a fatal error the task exits and closes the lens. The scheduler restarts the task after a delay allowing the task to attempt a new connection to the lens. The lotus lens opener returns a new rpc connection. The lotusrepo and carrepo openers return a shared instance of their repo since they hold exclusive locks over the store.
eaef48d
to
2c6082e
Compare
@willscott I refactored the lotus lens to remove the dependency on lotus lcli and the long term reference to cli.Context. The opener now holds only the address and token which is uses to reconnect to the api when Open is called. |
By Ian Davis (25) and others Via GitHub origin/master: (66 commits) feat: add changelog generator (#199) feat: disable leasing when gas outputs lease time is zero (#198) feat: add power actor claim extraction chore: Include RC releases in push docker images (#195) fix(metrics): export the completion and batch selection views (#197) fix: fix actor completion query chore: add metrics to leasing and work completion queries continue message parsing task when cbor parse fails (#185) perf: ensure processing updates always include height in criteria (#192) perf: include height restrictions in update clauses of leasing queries (#189) open lotusrepo lens read only feat: allow actor state processor to run without leasing (#178) feat: rpc reconnection on failure (#149) chore: add changelog (#150) polish: update miner processing logic feat: add dynamic panel creation based on tags (#159) don't codeql on every commit to pr. only on PR merge and cron (#168) track statechange persist duration feat: add dynamic panel creation based on tags Replace filecoin-ffi with stubs to simplify build process (#166) ... Conflicts: commands/run.go It looks like you may be committing a merge. If this is not correct, please remove the file /Users/mg/work/pl/sentinel/.git/modules/sentinel-visor/MERGE_HEAD and try again.
Tasks are now passed an APIOpener which they use to acquire a lens. On
encountering a fatal error the task exits and closes the lens. The scheduler
restarts the task after a delay allowing the task to attempt a new connection
to the lens.
The lotus lens opener returns a new rpc connection. The lotusrepo and carrepo
openers return a shared instance of their repo since they hold exclusive locks
over the store.
Fixes #98