Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ascend NPU support #975

Merged
merged 1 commit into from
Sep 20, 2023
Merged

Add Ascend NPU support #975

merged 1 commit into from
Sep 20, 2023

Conversation

ji-huazhong
Copy link
Contributor

What does this PR do?

This PR integrates Ascend NPU hardware capabilities into the LLaMA-Efficient-Tuning, and enables users to leverage the NPUs for training, inference and serving of LLMs.

Ascend NPU is an AI processor that support AI frameworks like PyTorch, TensorFlow, which has already integrated by huggingface, deepspeed and others popular LLM related software and tools.

As mentioned above, Ascend NPU already supports libraries such as Transformers/Accelerate, so you can refer to the fine-tuning instructions in the README to perform LLM fine-tuning tasks directly on Ascend NPU.

The model export phase requires adding Ascend NPU adaptation logic, which is addressed in this patch.

@hiyouga
Copy link
Owner

hiyouga commented Sep 20, 2023

Why NPU cannot use the bfloat16 data type?

@hiyouga hiyouga added the pending This problem is yet to be addressed label Sep 20, 2023
@hiyouga hiyouga self-requested a review September 20, 2023 04:17
@ji-huazhong
Copy link
Contributor Author

Support for bfloat16 is still in the works and will likely be available by the end of the year 😃

@hiyouga
Copy link
Owner

hiyouga commented Sep 20, 2023

I have added torch.cuda.is_bf16_supported() to determine whether the bfloat16 is supported by the computation hardware. Do you mean it does not work in NPU's environment?

if is_torch_npu_available():
infer_dtype = torch.float16
else:
infer_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 # detect cuda capability
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an if statement to evaluate the compatibility of the bf16 data type of the current computing environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.is_bf16_supported will cause a runtime error on an NPU environment.
FYI, to use PyTorch on NPU, users need to install both torch-cpu and torch-npu. As a result, cuda-related code is not included.

@hiyouga hiyouga merged commit ac8648b into hiyouga:main Sep 20, 2023
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 21, 2023
@shirwy
Copy link

shirwy commented Nov 10, 2023

accelerate选MULTI_NPU就可以直接使用npu训练吗,不需要修改代码吗,我这边训练报超时错误。

npuSynchronizeDevice:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUStream.cpp:370 NPU error, error code is 107020.
EI0002: The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4294967295].base information: [streamID:[1], taskID[39], taskType[Notify Wait], tag[AllReduce_192.168.0.116%enp69s0_60000_0_1694500287022627].] task information: [notify id:[0x0000000500000010], stage:[0], remote rank:[local].]
Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
TraceBack (most recent call last):
Task execute failed, device_id=5, stream_id=4, task_id=4, flip_num=0, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4294967295].base information: [streamID:[1], taskID[39], taskType[Notify Wait], tag[AllReduce_192.168.0.116%enp69s0_60000_0_1694500287022627].] task information: [notify id:[0x0000000500000010], stage:[0], remote rank:[local].]
Notify wait execute failed, device_id=5, stream_id=1, task_id=39, flip_num=0, notify_id=2[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4].base information: [streamID:[63], taskID[17], taskType[Notify Wait], tag[].] task information: [notify id:[0x0000000500000030], stage:[ffffffff], remote rank:[4].]
Notify wait execute failed, device_id=5, stream_id=63, task_id=17, flip_num=0, notify_id=6[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4].base information: [streamID:[62], taskID[14], taskType[Notify Wait], tag[].] task information: [notify id:[0x0000000500000020], stage:[ffffffff], remote rank:[4].]
Notify wait execute failed, device_id=5, stream_id=62, task_id=14, flip_num=0, notify_id=4[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4294967295].base information: [streamID:[61], taskID[36], taskType[Notify Wait], tag[].] task information: [notify id:[0x0000000500000048], stage:[2], remote rank:[local].]
Notify wait execute failed, device_id=5, stream_id=61, task_id=36, flip_num=0, notify_id=9[FUNC:GetError][FILE:stream.cc][LINE:1483]
rtDeviceSynchronize execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
wait for compute device to finish failed, runtime result = 107020.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

@hiyouga
Copy link
Owner

hiyouga commented Nov 10, 2023

@shirwy 建议先试试单卡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants