[Bug] 无法使用多卡评测 #1828

GenerallyCovetous · 2025-01-16T02:28:27Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

环境是昇腾卡NPU
{'CUDA available': False,
'GCC': 'gcc (GCC) 7.3.0',
'MMEngine': '0.9.1',
'OpenCV': '4.8.0',
'PyTorch': '2.1.0',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 10.2\n'
' - C++ Version: 201703\n'
' - Intel(R) MKL-DNN v3.1.1 (Git Hash '
'64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: NO AVX\n'
' - Build settings: BLAS_INFO=open, '
'BUILD_TYPE=Release, '
'CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=old-style-cast '
'-Wno-invalid-partial-specialization '
'-Wno-unused-private-field '
'-Wno-aligned-allocation-unavailable '
'-Wno-missing-braces -fdiagnostics-color=always '
'-faligned-new -Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=open, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.1.0, USE_CUDA=OFF, '
'USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:53:27) '
'[GCC 9.4.0]',
'TorchVision': '0.16.0',
'lmdeploy': "not installed:No module named 'lmdeploy'",
'numpy_random_seed': 2147483648,
'opencompass': '0.3.9+',
'sys.platform': 'linux',
'transformers': '4.43.2'}

Reproduces the problem - code/configuration sample

from mmengine.config import read_base
with read_base():

from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import mmlu_datasets
from opencompass.configs.models.hf_llama.hf_llama3_1_8b_instruct import models as hf_llama3_1_8b_instruct_model
from opencompass.configs.models.qwen2_5.hf_qwen2_5_7b_instruct import models as hf_qwen2_5_7b_instruct_model

work_dir = "outputs/Llama3_1_and_Qwen2_5-7B-Instruct"

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])

Reproduces the problem - command or script

python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py --max-num-workers 8 --num-gpus 8

Reproduces the problem - error message

Other information

虽然知道7B大小的模型只用单卡即可，但为什么指定分布在8卡上运行实际上只有单卡在运行呢？另外DP指定为8貌似也没用

The text was updated successfully, but these errors were encountered:

GenerallyCovetous · 2025-01-16T11:56:54Z

开启tp的时候虽然日志写了ON GPU0, ... 7，但是我查看资源发现还是只有一张卡在实际跑呢.
另外--max-num-workers 8开启dp试了一下发现也是类似的问题, 它确实是切分评测集了, 但是好像只在单卡上运行1/8的评测集，然后串行执行的接下来的第二个1/8，那这样也起不到加速的效果啊

MaiziXiao · 2025-01-17T03:15:23Z

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)

在配置文件中可以直接配置分片逻辑，在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

GenerallyCovetous · 2025-01-17T03:30:08Z

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
在配置文件中可以直接配置分片逻辑，在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

请问这是DP开启8吧，那么如果我的模型上到70B或33B的情况下如何开启TP呢，因为我在python run.py 添加--max-num-workers 8 --hf-num-gpus 8好像都没用呀

GenerallyCovetous · 2025-01-18T08:02:30Z

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
在配置文件中可以直接配置分片逻辑，在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

我在配置文件中加入了
from opencompass.partitioners.num_worker import NumWorkerPartitioner
from opencompass.runners.local import LocalRunner
from opencompass.tasks.openicl_infer import OpenICLInferTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
然后报错
Traceback (most recent call last):
File "/opencompass-main/run.py", line 4, in
main()
File "/opencompass-main/opencompass/cli/main.py", line 231, in main
cfg = get_config_from_arg(args)
File "/opencompass-main/opencompass/utils/run.py", line 97, in get_config_from_arg
config = Config.fromfile(args.config, format_python_code=False)
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 494, in fromfile
raise e
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 492, in fromfile
cfg_dict, imported_names = Config._parse_lazy_import(filename)
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1081, in _parse_lazy_import
_base_cfg_dict, _base_imported_names = Config._parse_lazy_import( # noqa: E501
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1109, in _parse_lazy_import
exec(
File "/opencompass-main/opencompass/partitioners/num_worker.py", line 16, in
@PARTITIONERS.register_module()
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/lazy.py", line 205, in call
raise RuntimeError()
RuntimeError

MaiziXiao · 2025-01-20T02:52:17Z

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
在配置文件中可以直接配置分片逻辑，在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py
我在配置文件中加入了 from opencompass.partitioners.num_worker import NumWorkerPartitioner from opencompass.runners.local import LocalRunner from opencompass.tasks.openicl_infer import OpenICLInferTask infer = dict( partitioner=dict(type=NumWorkerPartitioner, num_worker=8), runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)), ) 然后报错 Traceback (most recent call last): File "/opencompass-main/run.py", line 4, in main() File "/opencompass-main/opencompass/cli/main.py", line 231, in main cfg = get_config_from_arg(args) File "/opencompass-main/opencompass/utils/run.py", line 97, in get_config_from_arg config = Config.fromfile(args.config, format_python_code=False) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 494, in fromfile raise e File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 492, in fromfile cfg_dict, imported_names = Config._parse_lazy_import(filename) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1081, in _parse_lazy_import _base_cfg_dict, _base_imported_names = Config._parse_lazy_import( # noqa: E501 File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1109, in _parse_lazy_import exec( File "/opencompass-main/opencompass/partitioners/num_worker.py", line 16, in @PARTITIONERS.register_module() File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/lazy.py", line 205, in call raise RuntimeError() RuntimeError

TP在 model 的 config 中设置,可参考https://github.com/open-compass/opencompass/blob/main/opencompass/configs/models/qwen2_5/lmdeploy_qwen2_5_72b_instruct.py

GenerallyCovetous · 2025-01-20T03:00:26Z

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
在配置文件中可以直接配置分片逻辑，在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py
我在配置文件中加入了 from opencompass.partitioners.num_worker import NumWorkerPartitioner from opencompass.runners.local import LocalRunner from opencompass.tasks.openicl_infer import OpenICLInferTask infer = dict( partitioner=dict(type=NumWorkerPartitioner, num_worker=8), runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)), ) 然后报错 Traceback (most recent call last): File "/opencompass-main/run.py", line 4, in main() File "/opencompass-main/opencompass/cli/main.py", line 231, in main cfg = get_config_from_arg(args) File "/opencompass-main/opencompass/utils/run.py", line 97, in get_config_from_arg config = Config.fromfile(args.config, format_python_code=False) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 494, in fromfile raise e File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 492, in fromfile cfg_dict, imported_names = Config._parse_lazy_import(filename) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1081, in _parse_lazy_import _base_cfg_dict, _base_imported_names = Config._parse_lazy_import( # noqa: E501 File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1109, in _parse_lazy_import exec( File "/opencompass-main/opencompass/partitioners/num_worker.py", line 16, in @PARTITIONERS.register_module() File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/lazy.py", line 205, in call raise RuntimeError() RuntimeError
TP在 model 的 config 中设置,可参考https://github.com/open-compass/opencompass/blob/main/opencompass/configs/models/qwen2_5/lmdeploy_qwen2_5_72b_instruct.py

您是指run_cfg=dict(num_gpus)更改吗，我之前这里改动也是没用的，另外您上一条回复的NumWorkerPartitioner我会报错呀

mm-assistant bot assigned acylam Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 无法使用多卡评测 #1828

[Bug] 无法使用多卡评测 #1828

GenerallyCovetous commented Jan 16, 2025

GenerallyCovetous commented Jan 16, 2025

MaiziXiao commented Jan 17, 2025

GenerallyCovetous commented Jan 17, 2025

GenerallyCovetous commented Jan 18, 2025

MaiziXiao commented Jan 20, 2025

GenerallyCovetous commented Jan 20, 2025

[Bug] 无法使用多卡评测 #1828

[Bug] 无法使用多卡评测 #1828

Comments

GenerallyCovetous commented Jan 16, 2025

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

GenerallyCovetous commented Jan 16, 2025

MaiziXiao commented Jan 17, 2025

GenerallyCovetous commented Jan 17, 2025

GenerallyCovetous commented Jan 18, 2025

MaiziXiao commented Jan 20, 2025

GenerallyCovetous commented Jan 20, 2025