Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon No.112】 PR.md #4463

Merged
merged 2 commits into from
Oct 25, 2022
Merged

Conversation

ImNoBadBoy
Copy link
Contributor

【队名】:xd_no-bad

【序号】:112

【状态】:PR提交

【队名】:xd_no-bad

【序号】:112

【状态】:PR提交
@paddle-bot-old
Copy link

paddle-bot-old bot commented Apr 3, 2022

Thanks for your contribution!

```

## (3)对比两者的易用性与区别
Pytorch的分布式环境在曙光平台安装时需要手动编译torchversion,这一点上pytorch比较繁琐。但是pytorch的环境在曙光平台比较稳定,而paddle环境在曙光平台经常不稳定,有时候能运行,有时候不能运行。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『paddle环境在曙光平台经常不稳定,有时候能运行,有时候不能运行』为了后续paddle改善易用性,请补充详细点的不稳定的现象及问题描述

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

您好已经添加了一些报错图片,并把错误汇总到了最后一章节

添加了一些报错图片。

『paddle环境在曙光平台经常不稳定,有时候能运行,有时候不能运行』给出了不稳定的现象的报错图片。
Copy link
Contributor Author

@ImNoBadBoy ImNoBadBoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经添加了一些报错图片

```

## (3)对比两者的易用性与区别
Pytorch的分布式环境在曙光平台安装时需要手动编译torchversion,这一点上pytorch比较繁琐。但是pytorch的环境在曙光平台比较稳定,而paddle环境在曙光平台经常不稳定,有时候能运行,有时候不能运行。
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

您好已经添加了一些报错图片,并把错误汇总到了最后一章节

@ImNoBadBoy
Copy link
Contributor Author

ImNoBadBoy commented Apr 20, 2022 via email

![image](https://user-images.githubusercontent.com/102226413/164143166-cde2793b-eb06-43a3-92d1-bfa68c2f1558.png)


另外,我们在曙光上使用paddle的方法为开启镜像的方式,但是曙光平台对docker镜像的支持不太好,每次镜像保持的时间最多为72小时,而且每次关闭镜像后,无法重新开启原先镜像。为了方便使用,希望能够支持 任务提交方式运行的paddle分布式框架。而且任务提交的方式还方便管理多节点运行。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的。
测试安装结果之前可以执行salloc指令,申请一个dcu,来进行安装测试。
image

申请一个节点
salloc -p kshdtest -N 1 -c 8 --gres=dcu:1
image

登录节点
ssh j17r2n04

切换 rocm版本
module switch compiler/rocm/4.0.1

验证是否安装成功python -c "import paddle; paddle.utils.run_check()"

image

5、未解决问题(无法在曙光上使用paddle 的问题)
![image](https://user-images.githubusercontent.com/102226413/164143125-70d0e4ff-46d7-4461-8cb0-72c14e98b8e0.png)

![image](https://user-images.githubusercontent.com/102226413/164143166-cde2793b-eb06-43a3-92d1-bfa68c2f1558.png)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第二个问题可以配置上
export NCCL_IB_HCA=mlx5_0
export NCCL_SOCKET_IFNAME=eno1
export NCCL_IB_DISABLE=0
试试,

或者用export NCCL_IB_DISABLE=1禁用相关配置

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@TCChenlong TCChenlong requested a review from xymyeah May 13, 2022 08:25
@Ligoml Ligoml merged commit a0811ec into PaddlePaddle:develop Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants