Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

picodet 訓練時報錯 #9064

Open
3 tasks done
Liuuuu54 opened this issue Jul 17, 2024 · 0 comments
Open
3 tasks done

picodet 訓練時報錯 #9064

Liuuuu54 opened this issue Jul 17, 2024 · 0 comments
Assignees

Comments

@Liuuuu54
Copy link

Liuuuu54 commented Jul 17, 2024

问题确认 Search before asking

  • 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

目前訓練picodet時,經過10~20epoch 時就會報錯而中止,要如何解決?

[07/16 13:59:29] ppdet.engine INFO: Epoch: [50] [ 9280/13865] learning_rate: 0.115867 loss_vfl: 0.726240 loss_bbox: 0.376503 loss_dfl: 0.373168 loss: 1.441112 eta: 9 days, 10:05:33 batch_cost: 0.3815 data_cost: 0.0012 ips: 20.9692 images/s
[07/16 13:59:41] ppdet.engine INFO: Epoch: [50] [ 9300/13865] learning_rate: 0.115867 loss_vfl: 0.941561 loss_bbox: 0.426528 loss_dfl: 0.395401 loss: 1.785857 eta: 9 days, 10:06:08 batch_cost: 0.5066 data_cost: 0.0013 ips: 15.7928 images/s


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   egr::Backward(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, bool)
1   egr::RunBackward(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, bool, bool, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, bool, std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&)
2   egr::GradNodeAccumulation::operator()(paddle::small_vector<std::vector<paddle::Tensor, std::allocator<paddle::Tensor> >, 15u>&, bool, bool)
3   egr::GradNodeAccumulation::ApplyReduceHooks()
4   paddle::distributed::EagerReducer::AddDistHook(unsigned long)
5   paddle::distributed::EagerReducer::MarkVarReady(unsigned long, bool)
6   paddle::distributed::EagerReducer::FinalizeBackward()
7   paddle::distributed::EagerReducer::ProcessUnusedDenseVars()
8   void paddle::framework::TensorToVector<int>(phi::DenseTensor const&, phi::DeviceContext const&, std::vector<int, std::allocator<int> >*)
9   void paddle::memory::Copy<phi::CPUPlace, phi::Place>(phi::CPUPlace, void*, phi::Place, void const*, unsigned long, void*)
10  void paddle::memory::Copy<phi::Place, phi::Place>(phi::Place, void*, phi::Place, void const*, unsigned long, void*)
11  void paddle::memory::Copy<phi::CPUPlace, phi::GPUPlace>(phi::CPUPlace, void*, phi::GPUPlace, void const*, unsigned long, void*)
12  phi::backends::gpu::GpuMemcpyAsync(void*, void const*, unsigned long, cudaMemcpyKind, CUstream_st*)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1721138387 (unix time) try "date -d @1721138387" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3d6a) received by PID 15776 (TID 0x7f45c035b2c0) from PID 15722 ***]
  

复现环境 Environment

  • OS: Linux
  • PaddlePaddle: 2.6.1
  • PaddleDetection: release/2.7
  • Python: 3.9.12
  • CUDA: 11.6
  • CUDNN: 9.2.1
  • GCC: 7.5.0

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants