-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add etcd store auto re-initialize #2650
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2650 +/- ##
==========================================
+ Coverage 74.06% 74.30% +0.24%
==========================================
Files 201 201
Lines 7781 7823 +42
Branches 872 872
==========================================
+ Hits 5763 5813 +50
+ Misses 1723 1707 -16
- Partials 295 303 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
48dc988
to
10efb10
Compare
Updated
The active check uses an etcd sync request that runs every 10 seconds (it is used to update cluster endpoints), if it does not respond within 5 seconds then I will assume it has a connection failure, if it fails multiple times then I assume that watch is likely to have hung and faked death, at which point I will re-execute the initialization, i.e. list + watch. |
select { | ||
case <-time.Tick(10 * time.Second): | ||
sCtx, sCancel := context.WithTimeout(context.TODO(), 5*time.Second) | ||
err := etcdClient.Sync(sCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I think it is unnecessary. Because the client has a retry mechanism, even if the connection is interrupted during operation, reusing the client will automatically recover, only the watch
cannot recover automatically in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection can be automatically restored, but the behavior of watch is unobservable, and sometimes watch is incapacitated, but it still does not actively quit. There is no API to actively know which watchers are currently available and what is the state of those watchers.
This is a complete black box for its applicators, the entire documentation of etcd mentions that its connections are self-healing and watch will not lose any data, but it does happen in our environment.
I think this is really necessary, and if you don't think it's appropriate, we can use a less expensive way than sync requests, which is to implement it using etcd client's internal connection state monitor, client.getActiveConnection(), which will not initiate the request, but still get the connection state.
IMHO, for the uncertain behavior in the etcd client (it is indeed a black box), where all the grpc stream details are hidden and the documentation does not match the actual behavior, an active health checking mechanism independent of the internal mechanism is a must.
sleep 20 | ||
|
||
[ "$(grep -c "etcd connection recovered" ${LOG_FILE})" -ge '1' ] | ||
[ "$(grep -c "etcd store reinitializing" ${LOG_FILE})" -ge '1' ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add e2e test to ensure the data is not loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not possible in theory, because each connection exception triggers a re-initialization of the entire abstract storage layer, which involves reloading the "list" in full, and creating a new watcher. As soon as the full load succeeds, the existing data in the cache will be overwritten by the new data, and what is not there will be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the load fails, an error will be output to the console and log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the timer will still detect the fault at this point and do it again after recovery.
Co-authored-by: Peter Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out your branch and tried to reproduce it, but failed.
I think we could merge it first.
Before that, it is recommended to run it with APISIX for a long time to avoid causing other hidden issues. @bzp2010
@nic-chen Thanks to your support, I have added a new use case to the CLI test to ensure that data changes during a connection outage will be resynchronized after the data is reinitialized. We can merge and continue to see the impact of this fix. |
Co-authored-by: Peter Zhu <[email protected]>
(cherry picked from commit f64372f)
* upstream/master: fix: change default CSP value (apache#2601) fix: ant-table unable to request (apache#2641) fix: plugin_config missing on service exist (apache#2657) feat: add etcd store auto re-initialize (apache#2650) feat: add login filter of OpenID-Connect (apache#2608) feat:Configure plug-ins to support this feature (apache#2647) feat: Adding a Loading state to buttons (apache#2630) feat: dashboard support windows (apache#2619) Feat: add tip and preset model for plugin editor, improve e2e stability (apache#2581) docs: add Slack invitation link badge (apache#2617) # Conflicts: # .github/workflows/backend-cli-test.yml # Dockerfile # api/test/shell/cli_test.sh # web/src/components/Footer/index.tsx # web/src/components/RightContent/index.tsx # web/src/pages/ServerInfo/List.tsx
Please answer these questions before submitting a pull request, or your PR will get closed.
Why submit this pull request?
What changes will this PR take into?
Resynchronize when the etcd local cache does not match the data source.
Related issues
fix #2461 #2360 and more same problem
Checklist: