Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add etcd store auto re-initialize #2650

Merged
merged 20 commits into from
Oct 29, 2022
Merged

Conversation

bzp2010
Copy link
Contributor

@bzp2010 bzp2010 commented Oct 24, 2022

Please answer these questions before submitting a pull request, or your PR will get closed.

Why submit this pull request?

  • Bugfix
  • New feature provided
  • Improve performance
  • Backport patches

What changes will this PR take into?

Resynchronize when the etcd local cache does not match the data source.

Related issues

fix #2461 #2360 and more same problem

Checklist:

  • Did you explain what problem does this PR solve? Or what new features have been added?
  • Have you added corresponding test cases?
  • Have you modified the corresponding document?
  • Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

@bzp2010 bzp2010 self-assigned this Oct 24, 2022
@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2022

Codecov Report

Merging #2650 (92d65ca) into master (831df82) will increase coverage by 0.24%.
The diff coverage is 69.38%.

@@            Coverage Diff             @@
##           master    #2650      +/-   ##
==========================================
+ Coverage   74.06%   74.30%   +0.24%     
==========================================
  Files         201      201              
  Lines        7781     7823      +42     
  Branches      872      872              
==========================================
+ Hits         5763     5813      +50     
+ Misses       1723     1707      -16     
- Partials      295      303       +8     
Flag Coverage Δ
backend-e2e-test-ginkgo 65.34% <57.14%> (+0.50%) ⬆️
backend-unit-test 50.07% <76.00%> (+0.08%) ⬆️
frontend-e2e-test 76.49% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
api/internal/core/server/store.go 25.00% <0.00%> (ø)
api/internal/core/storage/etcd.go 54.38% <44.44%> (+4.77%) ⬆️
api/internal/utils/closer.go 71.42% <50.00%> (+38.09%) ⬆️
api/cmd/root.go 69.56% <62.16%> (+7.06%) ⬆️
api/internal/core/store/store.go 89.47% <83.33%> (+0.41%) ⬆️
api/internal/core/server/server.go 62.50% <0.00%> (+8.33%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bzp2010 bzp2010 requested review from nic-chen and starsz October 27, 2022 02:48
@bzp2010 bzp2010 marked this pull request as ready for review October 27, 2022 03:03
@bzp2010
Copy link
Contributor Author

bzp2010 commented Oct 27, 2022

Updated

  • Update etcd dependencies to v3.5.5 (old still v3.3.x)
  • Add passive live check for etcd watch (I have not observed active exits of watch due to connection exceptions in many tests and I am not sure if it works properly, so there is an additional active check mechanism)
  • Add active connection check for all etcd stream
  • Add cli test for active check
  • Improve some codes

The active check uses an etcd sync request that runs every 10 seconds (it is used to update cluster endpoints), if it does not respond within 5 seconds then I will assume it has a connection failure, if it fails multiple times then I assume that watch is likely to have hung and faked death, at which point I will re-execute the initialization, i.e. list + watch.

api/cmd/root.go Outdated Show resolved Hide resolved
select {
case <-time.Tick(10 * time.Second):
sCtx, sCancel := context.WithTimeout(context.TODO(), 5*time.Second)
err := etcdClient.Sync(sCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think it is unnecessary. Because the client has a retry mechanism, even if the connection is interrupted during operation, reusing the client will automatically recover, only the watch cannot recover automatically in some cases

Copy link
Contributor Author

@bzp2010 bzp2010 Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nic-chen

The connection can be automatically restored, but the behavior of watch is unobservable, and sometimes watch is incapacitated, but it still does not actively quit. There is no API to actively know which watchers are currently available and what is the state of those watchers.
This is a complete black box for its applicators, the entire documentation of etcd mentions that its connections are self-healing and watch will not lose any data, but it does happen in our environment.

I think this is really necessary, and if you don't think it's appropriate, we can use a less expensive way than sync requests, which is to implement it using etcd client's internal connection state monitor, client.getActiveConnection(), which will not initiate the request, but still get the connection state.

IMHO, for the uncertain behavior in the etcd client (it is indeed a black box), where all the grpc stream details are hidden and the documentation does not match the actual behavior, an active health checking mechanism independent of the internal mechanism is a must.

@bzp2010 bzp2010 requested a review from nic-chen October 27, 2022 03:57
@bzp2010
Copy link
Contributor Author

bzp2010 commented Oct 28, 2022

ping @nic-chen @starsz again

api/cmd/root.go Outdated Show resolved Hide resolved
api/cmd/root.go Outdated Show resolved Hide resolved
sleep 20

[ "$(grep -c "etcd connection recovered" ${LOG_FILE})" -ge '1' ]
[ "$(grep -c "etcd store reinitializing" ${LOG_FILE})" -ge '1' ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add e2e test to ensure the data is not loss.

Copy link
Contributor Author

@bzp2010 bzp2010 Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible in theory, because each connection exception triggers a re-initialization of the entire abstract storage layer, which involves reloading the "list" in full, and creating a new watcher. As soon as the full load succeeds, the existing data in the cache will be overwritten by the new data, and what is not there will be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the load fails, an error will be output to the console and log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the timer will still detect the fault at this point and do it again after recovery.

@bzp2010 bzp2010 requested a review from starsz October 28, 2022 09:11
Copy link
Member

@nic-chen nic-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out your branch and tried to reproduce it, but failed.
I think we could merge it first.
Before that, it is recommended to run it with APISIX for a long time to avoid causing other hidden issues. @bzp2010

@bzp2010
Copy link
Contributor Author

bzp2010 commented Oct 29, 2022

@nic-chen Thanks to your support, I have added a new use case to the CLI test to ensure that data changes during a connection outage will be resynchronized after the data is reinitialized. We can merge and continue to see the impact of this fix.

Co-authored-by: Peter Zhu <[email protected]>
@bzp2010 bzp2010 merged commit f64372f into apache:master Oct 29, 2022
bzp2010 added a commit to bzp2010/apisix-dashboard that referenced this pull request Oct 31, 2022
hongbinhsu added a commit to fitphp/apix-dashboard that referenced this pull request Nov 7, 2022
* upstream/master:
  fix: change default CSP value (apache#2601)
  fix: ant-table unable to request (apache#2641)
  fix: plugin_config missing on service exist (apache#2657)
  feat: add etcd store auto re-initialize (apache#2650)
  feat: add login filter of OpenID-Connect (apache#2608)
  feat:Configure plug-ins to support this feature (apache#2647)
  feat: Adding a Loading state to buttons (apache#2630)
  feat: dashboard support windows (apache#2619)
  Feat: add tip and preset model for plugin editor, improve e2e stability (apache#2581)
  docs: add Slack invitation link badge (apache#2617)

# Conflicts:
#	.github/workflows/backend-cli-test.yml
#	Dockerfile
#	api/test/shell/cli_test.sh
#	web/src/components/Footer/index.tsx
#	web/src/components/RightContent/index.tsx
#	web/src/pages/ServerInfo/List.tsx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cache failed to update when etcd restarted
4 participants