Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

华为910B显卡上跑MindYolo训练报错 EZ9999: Inner Error! #319

Open
MuDaMuDaMuDa2 opened this issue Nov 28, 2024 · 2 comments
Open

Comments

@MuDaMuDaMuDa2
Copy link

环境

Hardware Environment(Ascend):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device ascend
华为910B显卡

Software Environment:

  • MindSpore version (2.3.1 , 2.3.0-rc1)尝试过这两种版本:
  • Python version (e.g., Python 3.8.20 ,2.9.11)尝试过这两种版本:
  • OS platform and distribution (e.g., Ubuntu 22.04.5 LTS):
  • GCC/Compiler version (gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0
    Copyright (C) 2023 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.)
    :

Describe the current behavior

一下是代码报错部分:
(md_py3911) root@6163cbabedd4:/home/data/jupyter/ai_project/scripts/workcloth# python RunTrainCommand.py
/home/data/jupyter/ai_project/scripts/workcloth
2024-11-28 09:43:46,562 [INFO] parse_args:
2024-11-28 09:43:46,562 [INFO] task detect
2024-11-28 09:43:46,562 [INFO] device_target Ascend
2024-11-28 09:43:46,562 [INFO] save_dir ./runs/2024.11.28-09.43.46
2024-11-28 09:43:46,562 [INFO] log_level INFO
2024-11-28 09:43:46,562 [INFO] is_parallel False
2024-11-28 09:43:46,562 [INFO] ms_mode 0
2024-11-28 09:43:46,562 [INFO] ms_amp_level O0
2024-11-28 09:43:46,562 [INFO] keep_loss_fp32 True
2024-11-28 09:43:46,562 [INFO] anchor_base True
2024-11-28 09:43:46,562 [INFO] ms_loss_scaler static
2024-11-28 09:43:46,562 [INFO] ms_loss_scaler_value 1024.0
2024-11-28 09:43:46,562 [INFO] ms_jit True
2024-11-28 09:43:46,562 [INFO] ms_enable_graph_kernel False
2024-11-28 09:43:46,562 [INFO] ms_datasink False
2024-11-28 09:43:46,562 [INFO] overflow_still_update True
2024-11-28 09:43:46,562 [INFO] clip_grad False
2024-11-28 09:43:46,562 [INFO] clip_grad_value 10.0
2024-11-28 09:43:46,562 [INFO] ema True
2024-11-28 09:43:46,562 [INFO] weight ../../models/yolov7-tiny_300e.ckpt
2024-11-28 09:43:46,562 [INFO] ema_weight
2024-11-28 09:43:46,562 [INFO] freeze []
2024-11-28 09:43:46,562 [INFO] epochs 2
2024-11-28 09:43:46,562 [INFO] per_batch_size 8
2024-11-28 09:43:46,562 [INFO] img_size 640
2024-11-28 09:43:46,562 [INFO] nbs 64
2024-11-28 09:43:46,562 [INFO] accumulate 1
2024-11-28 09:43:46,562 [INFO] auto_accumulate False
2024-11-28 09:43:46,562 [INFO] log_interval 10
2024-11-28 09:43:46,562 [INFO] single_cls False
2024-11-28 09:43:46,562 [INFO] sync_bn False
2024-11-28 09:43:46,562 [INFO] keep_checkpoint_max 100
2024-11-28 09:43:46,562 [INFO] run_eval False
2024-11-28 09:43:46,562 [INFO] conf_thres 0.001
2024-11-28 09:43:46,562 [INFO] iou_thres 0.65
2024-11-28 09:43:46,562 [INFO] conf_free False
2024-11-28 09:43:46,562 [INFO] rect False
2024-11-28 09:43:46,562 [INFO] nms_time_limit 20.0
2024-11-28 09:43:46,562 [INFO] recompute False
2024-11-28 09:43:46,562 [INFO] recompute_layers 0
2024-11-28 09:43:46,562 [INFO] seed 2
2024-11-28 09:43:46,562 [INFO] summary True
2024-11-28 09:43:46,562 [INFO] profiler False
2024-11-28 09:43:46,562 [INFO] profiler_step_num 1
2024-11-28 09:43:46,562 [INFO] opencv_threads_num 2
2024-11-28 09:43:46,562 [INFO] strict_load False
2024-11-28 09:43:46,562 [INFO] enable_modelarts False
2024-11-28 09:43:46,562 [INFO] data_url
2024-11-28 09:43:46,562 [INFO] ckpt_url
2024-11-28 09:43:46,562 [INFO] multi_data_url
2024-11-28 09:43:46,562 [INFO] pretrain_url
2024-11-28 09:43:46,562 [INFO] train_url
2024-11-28 09:43:46,562 [INFO] data_dir /cache/data/
2024-11-28 09:43:46,562 [INFO] ckpt_dir /cache/pretrain_ckpt/
2024-11-28 09:43:46,562 [INFO] data.dataset_name WorkCloth
2024-11-28 09:43:46,562 [INFO] data.train_set ../../dataset/WorkCloth/train.txt
2024-11-28 09:43:46,562 [INFO] data.val_set ../../dataset/WorkCloth/val.txt
2024-11-28 09:43:46,562 [INFO] data.test_set ../../dataset/WorkCloth/test.txt
2024-11-28 09:43:46,562 [INFO] data.nc 2
2024-11-28 09:43:46,562 [INFO] data.names ['work_clothes', '']
2024-11-28 09:43:46,562 [INFO] data.num_parallel_workers 4
2024-11-28 09:43:46,562 [INFO] data.train_transforms [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2}, {'func_name': 'resample_segments'}, {'func_name': 'random_perspective', 'prob': 1.0, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0}, {'func_name': 'mixup', 'alpha': 8.0, 'beta': 8.0, 'prob': 0.05, 'pre_transform': [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2}, {'func_name': 'resample_segments'}, {'func_name': 'random_perspective', 'prob': 1.0, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0}]}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'pastein', 'prob': 0.05, 'num_sample': 30}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_norm', 'xyxy2xywh_': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2024-11-28 09:43:46,562 [INFO] data.test_transforms [{'func_name': 'letterbox', 'scaleup': False, 'only_image': True}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2024-11-28 09:43:46,562 [INFO] optimizer.lr_init 0.01
2024-11-28 09:43:46,562 [INFO] optimizer.optimizer momentum
2024-11-28 09:43:46,562 [INFO] optimizer.momentum 0.937
2024-11-28 09:43:46,562 [INFO] optimizer.nesterov True
2024-11-28 09:43:46,562 [INFO] optimizer.loss_scale 1.0
2024-11-28 09:43:46,562 [INFO] optimizer.warmup_epochs 3
2024-11-28 09:43:46,562 [INFO] optimizer.warmup_momentum 0.8
2024-11-28 09:43:46,562 [INFO] optimizer.warmup_bias_lr 0.1
2024-11-28 09:43:46,562 [INFO] optimizer.min_warmup_step 1000
2024-11-28 09:43:46,562 [INFO] optimizer.group_param yolov7
2024-11-28 09:43:46,562 [INFO] optimizer.gp_weight_decay 0.0005
2024-11-28 09:43:46,562 [INFO] optimizer.start_factor 1.0
2024-11-28 09:43:46,562 [INFO] optimizer.end_factor 0.01
2024-11-28 09:43:46,562 [INFO] optimizer.epochs 2
2024-11-28 09:43:46,562 [INFO] optimizer.nbs 64
2024-11-28 09:43:46,562 [INFO] optimizer.accumulate 1
2024-11-28 09:43:46,562 [INFO] optimizer.total_batch_size 8
2024-11-28 09:43:46,562 [INFO] loss.name YOLOv7Loss
2024-11-28 09:43:46,562 [INFO] loss.box 0.05
2024-11-28 09:43:46,562 [INFO] loss.cls 0.5
2024-11-28 09:43:46,562 [INFO] loss.cls_pw 1.0
2024-11-28 09:43:46,562 [INFO] loss.obj 1.0
2024-11-28 09:43:46,562 [INFO] loss.obj_pw 1.0
2024-11-28 09:43:46,562 [INFO] loss.fl_gamma 0.0
2024-11-28 09:43:46,562 [INFO] loss.anchor_t 4.0
2024-11-28 09:43:46,562 [INFO] loss.label_smoothing 0.0
2024-11-28 09:43:46,562 [INFO] network.model_name yolov7
2024-11-28 09:43:46,562 [INFO] network.depth_multiple 1.0
2024-11-28 09:43:46,562 [INFO] network.width_multiple 1.0
2024-11-28 09:43:46,562 [INFO] network.stride [8, 16, 32]
2024-11-28 09:43:46,562 [INFO] network.anchors [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]]
2024-11-28 09:43:46,562 [INFO] network.backbone [[-1, 1, 'ConvNormAct', [32, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [256, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [256, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']]]
2024-11-28 09:43:46,562 [INFO] network.head [[-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'SP', [5]], [-2, 1, 'SP', [9]], [-3, 1, 'SP', [13]], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -7], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [21, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [14, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, 47], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [256, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, 37], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [57, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [65, 1, 'ConvNormAct', [256, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [73, 1, 'ConvNormAct', [512, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[74, 75, 76], 1, 'YOLOv7Head', ['nc', 'anchors', 'stride']]]
2024-11-28 09:43:46,562 [INFO] config ../../configs/yolov7/yolov7-tiny_xyz_workcloth.yaml
2024-11-28 09:43:46,562 [INFO] rank 0
2024-11-28 09:43:46,562 [INFO] rank_size 1
2024-11-28 09:43:46,562 [INFO] total_batch_size 8
2024-11-28 09:43:46,562 [INFO] callback []
2024-11-28 09:43:46,562 [INFO]
2024-11-28 09:43:46,564 [INFO] Please check the above information for the configurations
2024-11-28 09:43:54,359 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:54,468 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:54,882 [INFO] number of network params, total: 6.032533M, trainable: 6.017694M
2024-11-28 09:43:56,393 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:56,495 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:56,964 [INFO] number of network params, total: 6.032533M, trainable: 6.017694M
2024-11-28 09:43:57,925 [WARNING] Dropping checkpoint parameter model.model.77.m.0.weight with shape (255, 128, 1, 1), which is inconsistent with cell shape (21, 128, 1, 1)
2024-11-28 09:43:57,925 [WARNING] Dropping checkpoint parameter model.model.77.m.0.bias with shape (255,), which is inconsistent with cell shape (21,)
2024-11-28 09:43:57,925 [WARNING] Dropping checkpoint parameter model.model.77.m.1.weight with shape (255, 256, 1, 1), which is inconsistent with cell shape (21, 256, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.m.1.bias with shape (255,), which is inconsistent with cell shape (21,)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.m.2.weight with shape (255, 512, 1, 1), which is inconsistent with cell shape (21, 512, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.m.2.bias with shape (255,), which is inconsistent with cell shape (21,)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.im.0.implicit with shape (1, 255, 1, 1), which is inconsistent with cell shape (1, 21, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.im.1.implicit with shape (1, 255, 1, 1), which is inconsistent with cell shape (1, 21, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.im.2.implicit with shape (1, 255, 1, 1), which is inconsistent with cell shape (1, 21, 1, 1)
[WARNING] ME(337942:281469431465216,MainProcess):2024-11-28-09:43:57.945.768 [mindspore/train/serialization.py:1560] For 'load_param_into_net', 9 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint.
[WARNING] ME(337942:281469431465216,MainProcess):2024-11-28-09:43:57.945.909 [mindspore/train/serialization.py:1564] ['model.model.77.m.0.weight', 'model.model.77.m.0.bias', 'model.model.77.m.1.weight', 'model.model.77.m.1.bias', 'model.model.77.m.2.weight', 'model.model.77.m.2.bias', 'model.model.77.im.0.implicit', 'model.model.77.im.1.implicit', 'model.model.77.im.2.implicit'] are not loaded.
2024-11-28 09:43:57,946 [INFO] Pretrain model load from "../../models/yolov7-tiny_300e.ckpt" success.
2024-11-28 09:44:10,634 [INFO] ema_weight not exist, default pretrain weight is currently used.
2024-11-28 09:44:10,693 [INFO] Dataset Cache file hash/version check success.
2024-11-28 09:44:10,694 [INFO] Load dataset cache from [../../dataset/WorkCloth/train.cache.npy] success.
Scanning '../../dataset/WorkCloth/train.cache.npy' images and labels... 2395 found, 0 missing, 0 empty, 0 corrupted: 100%|█| 2395/2395 [00:00<?, ?it/s]
2024-11-28 09:44:10,724 [INFO] Dataloader num parallel workers: [4]
2024-11-28 09:44:12,185 [INFO] Registry(name=callback, total=4)
2024-11-28 09:44:12,185 [INFO] (0): YoloxSwitchTrain in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO] (1): EvalWhileTrain in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO] (2): SummaryCallback in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO] (3): ProfilerCallback in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO]
2024-11-28 09:44:12,299 [INFO] got 1 active callback as follows:
2024-11-28 09:44:12,300 [INFO] SummaryCallback()
2024-11-28 09:44:12,300 [WARNING] The first epoch will be compiled for the graph, which may take a long time; You can come back later :).
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
[ERROR] DEVICE(337942,fffcf347f1a0,python):2024-11-28-09:53:18.692.462 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:232] TaskExceptionCallback] Run Task failed, task_id: 0, stream_id: 548, tid: 338667, device_id: 0, retcode: 507018 (aicpu exception)
[ERROR] GE_ADPT(337942,fffcf347f1a0,python):2024-11-28-09:53:18.795.720 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:371] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
Traceback (most recent call last):
File "../../mindyolo-master/train.py", line 330, in
train(args)
File "../../mindyolo-master/train.py", line 285, in train
trainer.train(
File "/home/data/jupyter/ai_project/mindyolo-master/mindyolo/utils/trainer_factory.py", line 170, in train
run_context.loss, run_context.lr = self.train_step(imgs, labels, segments,
File "/home/data/jupyter/ai_project/mindyolo-master/mindyolo/utils/trainer_factory.py", line 366, in train_step
loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True)
File "/usr/local/python3.8/lib/python3.8/site-packages/mindspore/common/api.py", line 941, in staging_specialize
out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
File "/usr/local/python3.8/lib/python3.8/site-packages/mindspore/common/api.py", line 185, in wrapper
results = fn(*arg, **kwargs)
File "/usr/local/python3.8/lib/python3.8/site-packages/mindspore/common/api.py", line 572, in call
output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: Exec graph failed


  • Ascend Error Message:

EZ9999: Inner Error!
EZ9999: 2024-11-28-09:53:18.691.987 Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1776][THREAD:338667]
TraceBack (most recent call last):
Aicpu kernel execute failed, device_id=0, stream_id=548, task_id=0, errorCode=2a.[FUNC:PrintAicpuErrorInfo][FILE:task_info.cc][LINE:1579][THREAD:338667]
AICPU Kernel task happen error, retCode=0x2a.[FUNC:GetError][FILE:stream.cc][LINE:1512][THREAD:338667]
Aicpu kernel execute failed, device_id=0, stream_id=548, task_id=0, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:1512][THREAD:338667]
rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53][THREAD:338667]
Call rtStreamSynchronize(stream) fail, ret: 0x7BC8A[FUNC:LaunchKernelCustAicpuSo][FILE:model_manager.cc][LINE:1698][THREAD:338667]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0xaaad7b705a70.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:513][THREAD:338667]
[Run][Graph]Run graph with stream asyn failed, error code = 507018, session id = 0,graph id = 2, stream = 0xaaad7b705a70.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:800][THREAD:338667]

(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:1332 RunGraphRefMode

Describe the expected behavior

解决报错,能够正常训练模型

Steps to reproduce the issue

  1. conda activate md_3911
  2. 进入对应路径文件夹 执行python RunTrainCommand.py
    3.触发内核错误

Related log / screenshot

image
image
image

Special notes for this issue

@MuDaMuDaMuDa2
Copy link
Author

重新安装了开发套件也是不行
image
image

@zhouyifeng888
Copy link

该问题可以参考昇腾社区mindspore板块的如下帖子:
https://www.hiascend.com/forum/thread-0281168172389447092-1-1.html

大致是环境搭建相关的cann版本、昇腾驱动版本等的匹配问题,如下是经过测试可以运行mindyolo的一个docker镜像,上述帖子里有提到,可供参考:
swr.cn-southwest-2.myhuaweicloud.com/atelier/mindspore_2_3_ascend:mindspore_2.3.0-cann_8.0.rc1-py_3.9-euler_2.10.7-aarch64-snt9b-20240525100222-259922e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants