Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 无法使用多卡评测 #1828

Open
2 tasks done
GenerallyCovetous opened this issue Jan 16, 2025 · 6 comments
Open
2 tasks done

[Bug] 无法使用多卡评测 #1828

GenerallyCovetous opened this issue Jan 16, 2025 · 6 comments
Assignees

Comments

@GenerallyCovetous
Copy link

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

环境是昇腾卡NPU
{'CUDA available': False,
'GCC': 'gcc (GCC) 7.3.0',
'MMEngine': '0.9.1',
'OpenCV': '4.8.0',
'PyTorch': '2.1.0',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 10.2\n'
' - C++ Version: 201703\n'
' - Intel(R) MKL-DNN v3.1.1 (Git Hash '
'64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: NO AVX\n'
' - Build settings: BLAS_INFO=open, '
'BUILD_TYPE=Release, '
'CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=old-style-cast '
'-Wno-invalid-partial-specialization '
'-Wno-unused-private-field '
'-Wno-aligned-allocation-unavailable '
'-Wno-missing-braces -fdiagnostics-color=always '
'-faligned-new -Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=open, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.1.0, USE_CUDA=OFF, '
'USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:53:27) '
'[GCC 9.4.0]',
'TorchVision': '0.16.0',
'lmdeploy': "not installed:No module named 'lmdeploy'",
'numpy_random_seed': 2147483648,
'opencompass': '0.3.9+',
'sys.platform': 'linux',
'transformers': '4.43.2'}

Reproduces the problem - code/configuration sample

from mmengine.config import read_base
with read_base():

from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import mmlu_datasets
from opencompass.configs.models.hf_llama.hf_llama3_1_8b_instruct import models as hf_llama3_1_8b_instruct_model
from opencompass.configs.models.qwen2_5.hf_qwen2_5_7b_instruct import models as hf_qwen2_5_7b_instruct_model

work_dir = "outputs/Llama3_1_and_Qwen2_5-7B-Instruct"

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])

Reproduces the problem - command or script

python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py --max-num-workers 8 --num-gpus 8

Reproduces the problem - error message

image

Other information

虽然知道7B大小的模型只用单卡即可,但为什么指定分布在8卡上运行实际上只有单卡在运行呢?另外DP指定为8貌似也没用

@GenerallyCovetous
Copy link
Author

开启tp的时候虽然日志写了ON GPU0, ... 7,但是我查看资源发现还是只有一张卡在实际跑呢.
另外--max-num-workers 8开启dp试了一下发现也是类似的问题, 它确实是切分评测集了, 但是好像只在单卡上运行1/8的评测集,然后串行执行的接下来的第二个1/8,那这样也起不到加速的效果啊

@MaiziXiao
Copy link
Collaborator

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)

在配置文件中可以直接配置分片逻辑,在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

@GenerallyCovetous
Copy link
Author

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)

在配置文件中可以直接配置分片逻辑,在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

请问这是DP开启8吧,那么如果我的模型上到70B或33B的情况下如何开启TP呢,因为我在python run.py 添加--max-num-workers 8 --hf-num-gpus 8好像都没用呀

@GenerallyCovetous
Copy link
Author

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)

在配置文件中可以直接配置分片逻辑,在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

我在配置文件中加入了
from opencompass.partitioners.num_worker import NumWorkerPartitioner
from opencompass.runners.local import LocalRunner
from opencompass.tasks.openicl_infer import OpenICLInferTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
然后报错
Traceback (most recent call last):
File "/opencompass-main/run.py", line 4, in
main()
File "/opencompass-main/opencompass/cli/main.py", line 231, in main
cfg = get_config_from_arg(args)
File "/opencompass-main/opencompass/utils/run.py", line 97, in get_config_from_arg
config = Config.fromfile(args.config, format_python_code=False)
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 494, in fromfile
raise e
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 492, in fromfile
cfg_dict, imported_names = Config._parse_lazy_import(filename)
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1081, in _parse_lazy_import
_base_cfg_dict, _base_imported_names = Config._parse_lazy_import( # noqa: E501
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1109, in _parse_lazy_import
exec(
File "/opencompass-main/opencompass/partitioners/num_worker.py", line 16, in
@PARTITIONERS.register_module()
File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/lazy.py", line 205, in call
raise RuntimeError()
RuntimeError

@MaiziXiao
Copy link
Collaborator

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)

在配置文件中可以直接配置分片逻辑,在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

我在配置文件中加入了 from opencompass.partitioners.num_worker import NumWorkerPartitioner from opencompass.runners.local import LocalRunner from opencompass.tasks.openicl_infer import OpenICLInferTask infer = dict( partitioner=dict(type=NumWorkerPartitioner, num_worker=8), runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)), ) 然后报错 Traceback (most recent call last): File "/opencompass-main/run.py", line 4, in main() File "/opencompass-main/opencompass/cli/main.py", line 231, in main cfg = get_config_from_arg(args) File "/opencompass-main/opencompass/utils/run.py", line 97, in get_config_from_arg config = Config.fromfile(args.config, format_python_code=False) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 494, in fromfile raise e File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 492, in fromfile cfg_dict, imported_names = Config._parse_lazy_import(filename) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1081, in _parse_lazy_import _base_cfg_dict, _base_imported_names = Config._parse_lazy_import( # noqa: E501 File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1109, in _parse_lazy_import exec( File "/opencompass-main/opencompass/partitioners/num_worker.py", line 16, in @PARTITIONERS.register_module() File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/lazy.py", line 205, in call raise RuntimeError() RuntimeError

TP在 model 的 config 中设置,可参考https://github.com/open-compass/opencompass/blob/main/opencompass/configs/models/qwen2_5/lmdeploy_qwen2_5_72b_instruct.py

@GenerallyCovetous
Copy link
Author

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)

在配置文件中可以直接配置分片逻辑,在外部直接执行python run.py configs/eval_OC15_llama3.1_qwen2_custom_gen.py

我在配置文件中加入了 from opencompass.partitioners.num_worker import NumWorkerPartitioner from opencompass.runners.local import LocalRunner from opencompass.tasks.openicl_infer import OpenICLInferTask infer = dict( partitioner=dict(type=NumWorkerPartitioner, num_worker=8), runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)), ) 然后报错 Traceback (most recent call last): File "/opencompass-main/run.py", line 4, in main() File "/opencompass-main/opencompass/cli/main.py", line 231, in main cfg = get_config_from_arg(args) File "/opencompass-main/opencompass/utils/run.py", line 97, in get_config_from_arg config = Config.fromfile(args.config, format_python_code=False) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 494, in fromfile raise e File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 492, in fromfile cfg_dict, imported_names = Config._parse_lazy_import(filename) File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1081, in _parse_lazy_import _base_cfg_dict, _base_imported_names = Config._parse_lazy_import( # noqa: E501 File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1109, in _parse_lazy_import exec( File "/opencompass-main/opencompass/partitioners/num_worker.py", line 16, in @PARTITIONERS.register_module() File "/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/lazy.py", line 205, in call raise RuntimeError() RuntimeError

TP在 model 的 config 中设置,可参考https://github.com/open-compass/opencompass/blob/main/opencompass/configs/models/qwen2_5/lmdeploy_qwen2_5_72b_instruct.py

您是指run_cfg=dict(num_gpus)更改吗,我之前这里改动也是没用的,另外您上一条回复的NumWorkerPartitioner我会报错呀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants