Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA coredump #2213

Closed
trevor211 opened this issue Apr 19, 2023 · 5 comments
Closed

RDMA coredump #2213

trevor211 opened this issue Apr 19, 2023 · 5 comments

Comments

@trevor211
Copy link

Describe the bug (描述bug)
example/rdma_performance测试,client coredump。
1681720740452-image

To Reproduce (复现方法)

  1. 准备2台支持rdma的机器:host_a, host_b。
  2. host_a部署example/rdma_performance中的server,host_b部署example/rdma_performance中的client。
  3. 在host_a中,执行rdma故障注入,故障注入脚本down_rdma.sh:
#!/bin/bash
source /etc/profile
eth_list=`ip a | grep "eth" | grep "UP" | awk -F ': ' '{print $2}'`
for eth in $eth_list
do
    mlnx_qos -i $eth --pfc  0,0,0,0,0,0,0,0
    mlnx_qos -i $eth --buffer_size=262016,0,0,0,0,0,0,0
    mlnx_qos -i $eth --prio2buffer=0,0,0,4,0,0,0,0
done

Expected behavior (期望行为)
client不core

Versions (各种版本)
OS: CentOS Linux,VERSION="7 (Core)"
Compiler: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
brpc: master分支
protobuf: 2.5.0-8.el7

Additional context/screenshots (更多上下文/截图)
--rdma_prepared_qp_cnt=1 和--rdma_prepared_qp_cnt=4000的时候不会core,设置为其他值会core。
出问题的cq没有提前释放。

@lorinlee
Copy link
Contributor

这个问题辛苦 @Tuvie 帮忙看看哈

@Tuvie
Copy link
Contributor

Tuvie commented Apr 20, 2023

这个故障注入的原理是什么?我通过所提供的脚本无法实现故障注入,仍然正常运行。

@jiangzhuti
Copy link

jiangzhuti commented Apr 21, 2023

这个故障注入的原理是什么?我通过所提供的脚本无法实现故障注入,仍然正常运行。

你好。这个脚本会关闭3队列网卡buffer,我们roce在3队列运行,关闭buffer后相当于主机侧rdma流量断开,用perftest打流测试表现为出现timeout cqe。我们通过这个脚本实现rdma的故障注入,并且不影响tcp协议

如果你们的环境roce不在3队列,比如在5队列,可以把
mlnx_qos -i $eth --prio2buffer=0,0,0,4,0,0,0,0
改成
mlnx_qos -i $eth --prio2buffer=0,0,0,0,0,4,0,0
来实现故障注入。并且最好事先跑一下mlnx_qos -i eth0 记录下receive buffer size、pfc enable和buffer的正确配置,用于恢复roce通信

@Tuvie
Copy link
Contributor

Tuvie commented Apr 22, 2023

#2220
可以试下这个pr。应该是预申请的资源释放的过早了。

@trevor211
Copy link
Author

试过了,core问题解决了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants