Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program auto exits after a few seconds #33

Closed
Du-Li opened this issue Mar 21, 2023 · 20 comments
Closed

Program auto exits after a few seconds #33

Du-Li opened this issue Mar 21, 2023 · 20 comments

Comments

@Du-Li
Copy link

Du-Li commented Mar 21, 2023

Describe the bug
About 5 seconds after starting go-kafka-connect-couchbase connector, the health check failed and the connector auto exited.

To Reproduce
Steps to reproduce the behavior:

  1. extend the example main.go to print the received cb-dcp events
  2. copy the example main.go and config.yml to an ubuntu box
  3. run the code like go run main.go
  4. See error

{"level":"debug","time":"2023-03-21T04:39:33Z","message":"vbucket discovery opened with membership type: static"}
{"level":"info","time":"2023-03-21T04:39:33Z","message":"member: 1/1, vbucket range: 0-1023"}
{"level":"debug","time":"2023-03-21T04:39:33Z","message":"loaded checkpoint"}
{"level":"debug","time":"2023-03-21T04:39:33Z","message":"stream started"}
{"level":"debug","time":"2023-03-21T04:39:33Z","message":"started checkpoint schedule"}
{"level":"info","time":"2023-03-21T04:39:33Z","message":"dcp stream started"}
{"level":"info","time":"2023-03-21T04:39:33Z","message":"metric middleware registered on path /metrics"}
{"level":"info","time":"2023-03-21T04:39:33Z","message":"api starting on port 8080"}
{"level":"debug","time":"2023-03-21T04:39:43Z","message":"no need to save checkpoint"}
{"level":"error","error":"context deadline exceeded","time":"2023-03-21T04:39:48Z","message":"health check failed"}
{"level":"debug","time":"2023-03-21T04:39:48Z","message":"vbucket discovery closed"}
{"level":"debug","time":"2023-03-21T04:39:48Z","message":"no need to save checkpoint"}
{"level":"debug","time":"2023-03-21T04:39:48Z","message":"stopped checkpoint schedule"}
{"level":"debug","time":"2023-03-21T04:39:49Z","message":"stream stopped"}
{"level":"debug","time":"2023-03-21T04:39:49Z","message":"api stopped"}
{"level":"debug","time":"2023-03-21T04:39:49Z","message":"dcp connection closed
{"level":"debug","time":"2023-03-21T04:39:49Z","message":"connections closed
{"level":"info","time":"2023-03-21T04:39:49Z","message":"dcp stream closed"}

Expected behavior
It's expected to continuously print the cb dcp events as long as a workload is fed into cb.

Screenshots
N/A

Version (please complete the following information):

  • OS: Linux ubuntu-box 5.4.231-137.341.amzn2.x86_64
  • Golang version: go1.18.1 linux/amd64
  • Couchbase: Community Edition 6.5.0 build 4966
  • Kafka: 3.3.1

Additional context
Add any other context about the problem here.

@mhmtszr
Copy link
Member

mhmtszr commented Mar 21, 2023

Hello @Du-Li, it seems there was a dcp checkpoint timeout, could you please try to increase the checkpoint timeout? I will share an example config. Also, which version of go-kafka-connect-couchbase have you used?

dcp:
  ...
checkpoint:
  timeout: 100s
  type: manual

@erayarslan
Copy link
Member

Hi @Du-Li did you see any log like connected to ..., bucket: ..., meta bucket: ... and can you sure your couchbase still up?
thank you.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

@erayarslan Thanks for your reply. Yes my CB is always up and those messages were rightly printed.

My go-kafka-connect-cb was 0.0.20 and go-dcp-client was 0.0.22. I added the suggested type: manual but the program still exited by itself albeit after 15 seconds.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

It looks that 15 seconds are the default checkpoint interval 10s plus timeout 5s, so the configured checkpoint.timeout: 100s and checkpoint.type: manual didn't work. I need to be able to control the exit of program before seriously trying out other things.

@erayarslan
Copy link
Member

@Du-Li its not about checkpoint. i think its about our default healthCheck timeout which is 5 seconds. you can check it from here
can you increase and retry?
for example

healthCheck:
  timeout: 30s

thank you for feedbacks.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

Thanks @erayarslan.
Yes the program ran a bit longer by the extended healthCheck.timeout. But why do we want the health check fail if nothing goes wrong? Is there a way that I can control the exit of the program? Ideally it should just run even if there are no dcp events.

@erayarslan
Copy link
Member

health check periodically sending ping request to couchbase to ensure its alive.
if ping gonna fail or timeout, we are shutting down everything.
logic is here
somehow your cluster didnt response client's ping request.
any additional configuration about your cluster can you provide to me? so i can try to reproduce on that environment.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

My CB cluster is a local setup from community version 6.5. It worked fine with Kafka Connect and the CB connector. I am not sure why the dcp client's ping request failed.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

The program exited by the healthCheck interval+timeout despite a workload continuously updating CB and the program itself printing the right dcp events.

@erayarslan
Copy link
Member

i think i understand the issue. your cluster running on aws right? when we send ping request, we are controlling all kind of couchbase service like memd, mgmt, n1ql, fts, capi ... if your network policy not allowing one of these, ping gonna fail probably.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

I see. I am running the connector and kafka in EKS while the CB cluster out EKS. What kind of network policy is needed to make ping work?

@erayarslan
Copy link
Member

i dont think we should check them all because i think u have right policy to listen dcp.
we need ignore unnecessary services for ping. let me change this after that we can retry.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

yep I can't agree more. as long as it is able to receive dcp events, it should be considered healthy. no need for comprehensive checking.

erayarslan added a commit that referenced this issue Mar 21, 2023
@erayarslan
Copy link
Member

v0.0.22 ping only needs memd (default port: 11210) and mnmt (default port: 8091) services.
if your network policy allowing these ports, problem should be solved. can you retry?

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

@erayarslan health check still failed after interval+timeout time. From the same machine the program was run, I tried telnet on the two ports to one of the CB servers and they worked.

{"level":"info","time":"2023-03-21T18:35:41Z","message":"api starting on port 8080"}
{"level":"error","error":"context deadline exceeded","time":"2023-03-21T18:35:56Z","message":"health check failed"}

@erayarslan
Copy link
Member

are you sure your version upgraded to go-kafka-connect-couchbase v0.0.22 and go-dcp-client v0.0.24?
cause i reproduce this issue on my vm with firewall configurations. and after new version its solved.

@Du-Li
Copy link
Author

Du-Li commented Mar 21, 2023

yes I am very sure. following is part of my go.mod file.

`
require (
github.com/Trendyol/go-kafka-connect-couchbase v0.0.22
github.com/gookit/config/v2 v2.2.1
)

require (
github.com/Trendyol/go-dcp-client v0.0.24 // indirect
`

@erayarslan
Copy link
Member

i cannot reproduce your behaviour so i expose healthcheck disable/enable config. with v0.0.23 u can disable healthcheck like

healthCheck:
  enabled: false

@Du-Li
Copy link
Author

Du-Li commented Mar 22, 2023

@erayarslan Thanks. That worked.

@erayarslan
Copy link
Member

thank you so much for contribution 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants