Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 使用 flink-doris-connector 版本写入doris 2.1.6时,当一个be节点长时间失联时,flink job失败且无法从checkpoint恢复 #504

Open
2 of 3 tasks
thunderlike opened this issue Oct 25, 2024 · 3 comments

Comments

@thunderlike
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

flink1.14.4 flink-doris-connector-1.14 1.1.1 版本写入doris 2.1.6

What's Wrong?

使用 flink1.14.4 flink-doris-connector-1.14 1.1.1 版本写入doris 2.1.6当一个be节点长时间失联时,flink job失败且无法从checkpoint恢复。
核心报错如下:
2024-10-12 14:13:48,049 ERROR org.apache.doris.flink.sink.committer.DorisCommitter [] - commit transaction failed:
org.apache.http.conn.HttpHostConnectException: Connect to 10.126.72.64:8040 [/10.126.72.64] failed: Connection refused (Connection
refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) ~[
blob_p-61f0320fe7cc111cc844c6575ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374) ~[bl
ob_p-61f0320fe7cc111cc844c6575ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[blob_p-61f0320fe7cc111cc844c65
75ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-61f0320fe7cc111cc844c6575ef7cc
48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[blob_p-61f0320fe7cc111cc844c6575ef7cc48af
8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[blob_p-61f0320fe7cc111cc844c6575ef7cc48af8a4b1f-
139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-61f0320fe7cc111cc844c6575ef7cc48af
8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[blob_p-61f0320fe7cc111cc844c65
75ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[blob_p-61f0320fe7cc111cc844c657
5ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[blob_p-61f0320fe7cc111cc844c65
75ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.doris.flink.sink.committer.DorisCommitter.commitTransaction(DorisCommitter.java:91) ~[blob_p-61f0320fe7cc11
1cc844c6575ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.doris.flink.sink.committer.DorisCommitter.commit(DorisCommitter.java:71) ~[blob_p-61f0320fe7cc111cc844c6575
ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.flink.streaming.runtime.operators.sink.StreamingCommitterHandler.commit(StreamingCommitterHandler.java:54)
~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler.retry(AbstractStreamingCommitterHa
ndler.java:99) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.operators.sink.AbstractCommitterHandler.retry(AbstractCommitterHandler.java:66) ~[f
link-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.operators.sink.CommitRetrier.retry(CommitRetrier.java:80) ~[flink-dist_2.11-1.14.4.
jar:1.14.4]
at org.apache.flink.streaming.runtime.operators.sink.CommitRetrier.lambda$retryAt$0(CommitRetrier.java:63) ~[flink-dist_2
.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:1693) ~[flink-dist_2.
11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$22(StreamTask.java:1684) ~[flink-dist_2.11-1.14.4.jar:
1.14.4]
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[fl
ink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProc
essor.java:338) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324) ~[flink-dist_
2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[flink-di
st_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809) ~[flink-dist_2.11-1.14.4.jar:1
.14.4]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) [flink-dist_2.11-1.14.4.jar:1.14.
4]
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) [flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) [flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) [flink-dist_2.11-1.14.4.jar:1.14.4]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_322]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_322]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_322]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_322]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_322]
at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_322]
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[blob_p-
61f0320fe7cc111cc844c6575ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[
blob_p-61f0320fe7cc111cc844c6575ef7cc48af8a4b1f-139883e44a5481fda2030ca9306e6ec8:?]

看起来是从checkpoint恢复时,读取的host还是故障节点ip,所以访问连接不上

What You Expected?

我们没法升级flink1.14这个环境,请问 flink-doris-connector-1.14 1.1.1版本这个问题可以修复下吗?系统对高可用较高,be节点当磁盘故障时,无法短时间恢复,那么这个问题就肯定会复现。

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@JNSimba
Copy link
Member

JNSimba commented Oct 30, 2024

Can upgrade to the latest version to try

@MaoMiMao
Copy link
Contributor

MaoMiMao commented Nov 1, 2024

I encountered a similar issue where, when setting sink.enable.batch-mode=true and restarting the Doris cluster, the program was unable to resume writing.
driver version:24.0.0
doris version : 2.1.X

@JNSimba
Copy link
Member

JNSimba commented Nov 4, 2024

I encountered a similar issue where, when setting sink.enable.batch-mode=true and restarting the Doris cluster, the program was unable to resume writing. driver version:24.0.0 doris version : 2.1.X

What is the specific error? Because when you turn on this parameter, sink.enable.batch-mode=true, you should not reach the DorisCommiter method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants