华为910B显卡上跑MindYolo训练报错 EZ9999: Inner Error! #319

MuDaMuDaMuDa2 · 2024-11-28T10:22:28Z

环境

Hardware Environment(`Ascend`):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device ascend
华为910B显卡

Software Environment:

MindSpore version (2.3.1 , 2.3.0-rc1)尝试过这两种版本:
Python version (e.g., Python 3.8.20 ,2.9.11)尝试过这两种版本:
OS platform and distribution (e.g., Ubuntu 22.04.5 LTS):
GCC/Compiler version (gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.):

Describe the current behavior

一下是代码报错部分：
(md_py3911) root@6163cbabedd4:/home/data/jupyter/ai_project/scripts/workcloth# python RunTrainCommand.py
/home/data/jupyter/ai_project/scripts/workcloth
2024-11-28 09:43:46,562 [INFO] parse_args:
2024-11-28 09:43:46,562 [INFO] task detect
2024-11-28 09:43:46,562 [INFO] device_target Ascend
2024-11-28 09:43:46,562 [INFO] save_dir ./runs/2024.11.28-09.43.46
2024-11-28 09:43:46,562 [INFO] log_level INFO
2024-11-28 09:43:46,562 [INFO] is_parallel False
2024-11-28 09:43:46,562 [INFO] ms_mode 0
2024-11-28 09:43:46,562 [INFO] ms_amp_level O0
2024-11-28 09:43:46,562 [INFO] keep_loss_fp32 True
2024-11-28 09:43:46,562 [INFO] anchor_base True
2024-11-28 09:43:46,562 [INFO] ms_loss_scaler static
2024-11-28 09:43:46,562 [INFO] ms_loss_scaler_value 1024.0
2024-11-28 09:43:46,562 [INFO] ms_jit True
2024-11-28 09:43:46,562 [INFO] ms_enable_graph_kernel False
2024-11-28 09:43:46,562 [INFO] ms_datasink False
2024-11-28 09:43:46,562 [INFO] overflow_still_update True
2024-11-28 09:43:46,562 [INFO] clip_grad False
2024-11-28 09:43:46,562 [INFO] clip_grad_value 10.0
2024-11-28 09:43:46,562 [INFO] ema True
2024-11-28 09:43:46,562 [INFO] weight ../../models/yolov7-tiny_300e.ckpt
2024-11-28 09:43:46,562 [INFO] ema_weight
2024-11-28 09:43:46,562 [INFO] freeze []
2024-11-28 09:43:46,562 [INFO] epochs 2
2024-11-28 09:43:46,562 [INFO] per_batch_size 8
2024-11-28 09:43:46,562 [INFO] img_size 640
2024-11-28 09:43:46,562 [INFO] nbs 64
2024-11-28 09:43:46,562 [INFO] accumulate 1
2024-11-28 09:43:46,562 [INFO] auto_accumulate False
2024-11-28 09:43:46,562 [INFO] log_interval 10
2024-11-28 09:43:46,562 [INFO] single_cls False
2024-11-28 09:43:46,562 [INFO] sync_bn False
2024-11-28 09:43:46,562 [INFO] keep_checkpoint_max 100
2024-11-28 09:43:46,562 [INFO] run_eval False
2024-11-28 09:43:46,562 [INFO] conf_thres 0.001
2024-11-28 09:43:46,562 [INFO] iou_thres 0.65
2024-11-28 09:43:46,562 [INFO] conf_free False
2024-11-28 09:43:46,562 [INFO] rect False
2024-11-28 09:43:46,562 [INFO] nms_time_limit 20.0
2024-11-28 09:43:46,562 [INFO] recompute False
2024-11-28 09:43:46,562 [INFO] recompute_layers 0
2024-11-28 09:43:46,562 [INFO] seed 2
2024-11-28 09:43:46,562 [INFO] summary True
2024-11-28 09:43:46,562 [INFO] profiler False
2024-11-28 09:43:46,562 [INFO] profiler_step_num 1
2024-11-28 09:43:46,562 [INFO] opencv_threads_num 2
2024-11-28 09:43:46,562 [INFO] strict_load False
2024-11-28 09:43:46,562 [INFO] enable_modelarts False
2024-11-28 09:43:46,562 [INFO] data_url
2024-11-28 09:43:46,562 [INFO] ckpt_url
2024-11-28 09:43:46,562 [INFO] multi_data_url
2024-11-28 09:43:46,562 [INFO] pretrain_url
2024-11-28 09:43:46,562 [INFO] train_url
2024-11-28 09:43:46,562 [INFO] data_dir /cache/data/
2024-11-28 09:43:46,562 [INFO] ckpt_dir /cache/pretrain_ckpt/
2024-11-28 09:43:46,562 [INFO] data.dataset_name WorkCloth
2024-11-28 09:43:46,562 [INFO] data.train_set ../../dataset/WorkCloth/train.txt
2024-11-28 09:43:46,562 [INFO] data.val_set ../../dataset/WorkCloth/val.txt
2024-11-28 09:43:46,562 [INFO] data.test_set ../../dataset/WorkCloth/test.txt
2024-11-28 09:43:46,562 [INFO] data.nc 2
2024-11-28 09:43:46,562 [INFO] data.names ['work_clothes', '']
2024-11-28 09:43:46,562 [INFO] data.num_parallel_workers 4
2024-11-28 09:43:46,562 [INFO] data.train_transforms [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2}, {'func_name': 'resample_segments'}, {'func_name': 'random_perspective', 'prob': 1.0, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0}, {'func_name': 'mixup', 'alpha': 8.0, 'beta': 8.0, 'prob': 0.05, 'pre_transform': [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2}, {'func_name': 'resample_segments'}, {'func_name': 'random_perspective', 'prob': 1.0, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0}]}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'pastein', 'prob': 0.05, 'num_sample': 30}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_norm', 'xyxy2xywh_': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2024-11-28 09:43:46,562 [INFO] data.test_transforms [{'func_name': 'letterbox', 'scaleup': False, 'only_image': True}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2024-11-28 09:43:46,562 [INFO] optimizer.lr_init 0.01
2024-11-28 09:43:46,562 [INFO] optimizer.optimizer momentum
2024-11-28 09:43:46,562 [INFO] optimizer.momentum 0.937
2024-11-28 09:43:46,562 [INFO] optimizer.nesterov True
2024-11-28 09:43:46,562 [INFO] optimizer.loss_scale 1.0
2024-11-28 09:43:46,562 [INFO] optimizer.warmup_epochs 3
2024-11-28 09:43:46,562 [INFO] optimizer.warmup_momentum 0.8
2024-11-28 09:43:46,562 [INFO] optimizer.warmup_bias_lr 0.1
2024-11-28 09:43:46,562 [INFO] optimizer.min_warmup_step 1000
2024-11-28 09:43:46,562 [INFO] optimizer.group_param yolov7
2024-11-28 09:43:46,562 [INFO] optimizer.gp_weight_decay 0.0005
2024-11-28 09:43:46,562 [INFO] optimizer.start_factor 1.0
2024-11-28 09:43:46,562 [INFO] optimizer.end_factor 0.01
2024-11-28 09:43:46,562 [INFO] optimizer.epochs 2
2024-11-28 09:43:46,562 [INFO] optimizer.nbs 64
2024-11-28 09:43:46,562 [INFO] optimizer.accumulate 1
2024-11-28 09:43:46,562 [INFO] optimizer.total_batch_size 8
2024-11-28 09:43:46,562 [INFO] loss.name YOLOv7Loss
2024-11-28 09:43:46,562 [INFO] loss.box 0.05
2024-11-28 09:43:46,562 [INFO] loss.cls 0.5
2024-11-28 09:43:46,562 [INFO] loss.cls_pw 1.0
2024-11-28 09:43:46,562 [INFO] loss.obj 1.0
2024-11-28 09:43:46,562 [INFO] loss.obj_pw 1.0
2024-11-28 09:43:46,562 [INFO] loss.fl_gamma 0.0
2024-11-28 09:43:46,562 [INFO] loss.anchor_t 4.0
2024-11-28 09:43:46,562 [INFO] loss.label_smoothing 0.0
2024-11-28 09:43:46,562 [INFO] network.model_name yolov7
2024-11-28 09:43:46,562 [INFO] network.depth_multiple 1.0
2024-11-28 09:43:46,562 [INFO] network.width_multiple 1.0
2024-11-28 09:43:46,562 [INFO] network.stride [8, 16, 32]
2024-11-28 09:43:46,562 [INFO] network.anchors [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]]
2024-11-28 09:43:46,562 [INFO] network.backbone [[-1, 1, 'ConvNormAct', [32, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [256, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [256, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']]]
2024-11-28 09:43:46,562 [INFO] network.head [[-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'SP', [5]], [-2, 1, 'SP', [9]], [-3, 1, 'SP', [13]], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -7], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [21, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [14, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [32, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [32, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, 47], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [64, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [64, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [256, 3, 2, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, 37], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-2, 1, 'ConvNormAct', [128, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [-1, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[-1, -2, -3, -4], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [57, 1, 'ConvNormAct', [128, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [65, 1, 'ConvNormAct', [256, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [73, 1, 'ConvNormAct', [512, 3, 1, 'None', 1, 1, 'nn.LeakyReLU(0.1)']], [[74, 75, 76], 1, 'YOLOv7Head', ['nc', 'anchors', 'stride']]]
2024-11-28 09:43:46,562 [INFO] config ../../configs/yolov7/yolov7-tiny_xyz_workcloth.yaml
2024-11-28 09:43:46,562 [INFO] rank 0
2024-11-28 09:43:46,562 [INFO] rank_size 1
2024-11-28 09:43:46,562 [INFO] total_batch_size 8
2024-11-28 09:43:46,562 [INFO] callback []
2024-11-28 09:43:46,562 [INFO]
2024-11-28 09:43:46,564 [INFO] Please check the above information for the configurations
2024-11-28 09:43:54,359 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:54,468 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:54,882 [INFO] number of network params, total: 6.032533M, trainable: 6.017694M
2024-11-28 09:43:56,393 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:56,495 [WARNING] Parse Model, args: nearest, keep str type
2024-11-28 09:43:56,964 [INFO] number of network params, total: 6.032533M, trainable: 6.017694M
2024-11-28 09:43:57,925 [WARNING] Dropping checkpoint parameter model.model.77.m.0.weight with shape (255, 128, 1, 1), which is inconsistent with cell shape (21, 128, 1, 1)
2024-11-28 09:43:57,925 [WARNING] Dropping checkpoint parameter model.model.77.m.0.bias with shape (255,), which is inconsistent with cell shape (21,)
2024-11-28 09:43:57,925 [WARNING] Dropping checkpoint parameter model.model.77.m.1.weight with shape (255, 256, 1, 1), which is inconsistent with cell shape (21, 256, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.m.1.bias with shape (255,), which is inconsistent with cell shape (21,)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.m.2.weight with shape (255, 512, 1, 1), which is inconsistent with cell shape (21, 512, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.m.2.bias with shape (255,), which is inconsistent with cell shape (21,)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.im.0.implicit with shape (1, 255, 1, 1), which is inconsistent with cell shape (1, 21, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.im.1.implicit with shape (1, 255, 1, 1), which is inconsistent with cell shape (1, 21, 1, 1)
2024-11-28 09:43:57,926 [WARNING] Dropping checkpoint parameter model.model.77.im.2.implicit with shape (1, 255, 1, 1), which is inconsistent with cell shape (1, 21, 1, 1)
[WARNING] ME(337942:281469431465216,MainProcess):2024-11-28-09:43:57.945.768 [mindspore/train/serialization.py:1560] For 'load_param_into_net', 9 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint.
[WARNING] ME(337942:281469431465216,MainProcess):2024-11-28-09:43:57.945.909 [mindspore/train/serialization.py:1564] ['model.model.77.m.0.weight', 'model.model.77.m.0.bias', 'model.model.77.m.1.weight', 'model.model.77.m.1.bias', 'model.model.77.m.2.weight', 'model.model.77.m.2.bias', 'model.model.77.im.0.implicit', 'model.model.77.im.1.implicit', 'model.model.77.im.2.implicit'] are not loaded.
2024-11-28 09:43:57,946 [INFO] Pretrain model load from "../../models/yolov7-tiny_300e.ckpt" success.
2024-11-28 09:44:10,634 [INFO] ema_weight not exist, default pretrain weight is currently used.
2024-11-28 09:44:10,693 [INFO] Dataset Cache file hash/version check success.
2024-11-28 09:44:10,694 [INFO] Load dataset cache from [../../dataset/WorkCloth/train.cache.npy] success.
Scanning '../../dataset/WorkCloth/train.cache.npy' images and labels... 2395 found, 0 missing, 0 empty, 0 corrupted: 100%|█| 2395/2395 [00:00<?, ?it/s]
2024-11-28 09:44:10,724 [INFO] Dataloader num parallel workers: [4]
2024-11-28 09:44:12,185 [INFO] Registry(name=callback, total=4)
2024-11-28 09:44:12,185 [INFO] (0): YoloxSwitchTrain in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO] (1): EvalWhileTrain in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO] (2): SummaryCallback in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO] (3): ProfilerCallback in mindyolo/utils/callback.py
2024-11-28 09:44:12,185 [INFO]
2024-11-28 09:44:12,299 [INFO] got 1 active callback as follows:
2024-11-28 09:44:12,300 [INFO] SummaryCallback()
2024-11-28 09:44:12,300 [WARNING] The first epoch will be compiled for the graph, which may take a long time; You can come back later :).
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
Warning: tiling offset out of range, index: 32
[ERROR] DEVICE(337942,fffcf347f1a0,python):2024-11-28-09:53:18.692.462 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:232] TaskExceptionCallback] Run Task failed, task_id: 0, stream_id: 548, tid: 338667, device_id: 0, retcode: 507018 (aicpu exception)
[ERROR] GE_ADPT(337942,fffcf347f1a0,python):2024-11-28-09:53:18.795.720 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:371] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
Traceback (most recent call last):
File "../../mindyolo-master/train.py", line 330, in
train(args)
File "../../mindyolo-master/train.py", line 285, in train
trainer.train(
File "/home/data/jupyter/ai_project/mindyolo-master/mindyolo/utils/trainer_factory.py", line 170, in train
run_context.loss, run_context.lr = self.train_step(imgs, labels, segments,
File "/home/data/jupyter/ai_project/mindyolo-master/mindyolo/utils/trainer_factory.py", line 366, in train_step
loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True)
File "/usr/local/python3.8/lib/python3.8/site-packages/mindspore/common/api.py", line 941, in staging_specialize
out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
File "/usr/local/python3.8/lib/python3.8/site-packages/mindspore/common/api.py", line 185, in wrapper
results = fn(*arg, **kwargs)
File "/usr/local/python3.8/lib/python3.8/site-packages/mindspore/common/api.py", line 572, in call
output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: Exec graph failed

Ascend Error Message:

EZ9999: Inner Error!
EZ9999: 2024-11-28-09:53:18.691.987 Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1776][THREAD:338667]
TraceBack (most recent call last):
Aicpu kernel execute failed, device_id=0, stream_id=548, task_id=0, errorCode=2a.[FUNC:PrintAicpuErrorInfo][FILE:task_info.cc][LINE:1579][THREAD:338667]
AICPU Kernel task happen error, retCode=0x2a.[FUNC:GetError][FILE:stream.cc][LINE:1512][THREAD:338667]
Aicpu kernel execute failed, device_id=0, stream_id=548, task_id=0, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:1512][THREAD:338667]
rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53][THREAD:338667]
Call rtStreamSynchronize(stream) fail, ret: 0x7BC8A[FUNC:LaunchKernelCustAicpuSo][FILE:model_manager.cc][LINE:1698][THREAD:338667]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0xaaad7b705a70.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:513][THREAD:338667]
[Run][Graph]Run graph with stream asyn failed, error code = 507018, session id = 0,graph id = 2, stream = 0xaaad7b705a70.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:800][THREAD:338667]

(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)

C++ Call Stack: (For framework developers)

mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:1332 RunGraphRefMode

Describe the expected behavior

解决报错，能够正常训练模型

Steps to reproduce the issue

conda activate md_3911
进入对应路径文件夹执行python RunTrainCommand.py
3.触发内核错误

Related log / screenshot

Special notes for this issue

The text was updated successfully, but these errors were encountered:

MuDaMuDaMuDa2 · 2024-11-28T10:29:47Z

重新安装了开发套件也是不行

zhouyifeng888 · 2024-12-17T09:33:31Z

该问题可以参考昇腾社区mindspore板块的如下帖子：
https://www.hiascend.com/forum/thread-0281168172389447092-1-1.html

大致是环境搭建相关的cann版本、昇腾驱动版本等的匹配问题，如下是经过测试可以运行mindyolo的一个docker镜像，上述帖子里有提到，可供参考：
swr.cn-southwest-2.myhuaweicloud.com/atelier/mindspore_2_3_ascend:mindspore_2.3.0-cann_8.0.rc1-py_3.9-euler_2.10.7-aarch64-snt9b-20240525100222-259922e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

华为910B显卡上跑MindYolo训练报错 EZ9999: Inner Error! #319

华为910B显卡上跑MindYolo训练报错 EZ9999: Inner Error! #319

MuDaMuDaMuDa2 commented Nov 28, 2024

MuDaMuDaMuDa2 commented Nov 28, 2024

zhouyifeng888 commented Dec 17, 2024

华为910B显卡上跑MindYolo训练报错 EZ9999: Inner Error! #319

华为910B显卡上跑MindYolo训练报错 EZ9999: Inner Error! #319

Comments

MuDaMuDaMuDa2 commented Nov 28, 2024

环境

Hardware Environment(Ascend):

Software Environment:

Describe the current behavior

Describe the expected behavior

Steps to reproduce the issue

Related log / screenshot

Special notes for this issue

MuDaMuDaMuDa2 commented Nov 28, 2024

zhouyifeng888 commented Dec 17, 2024

Hardware Environment(`Ascend`):