Add Ascend NPU support #975

ji-huazhong · 2023-09-20T03:55:12Z

What does this PR do?

This PR integrates Ascend NPU hardware capabilities into the LLaMA-Efficient-Tuning, and enables users to leverage the NPUs for training, inference and serving of LLMs.

Ascend NPU is an AI processor that support AI frameworks like PyTorch, TensorFlow, which has already integrated by huggingface, deepspeed and others popular LLM related software and tools.

As mentioned above, Ascend NPU already supports libraries such as Transformers/Accelerate, so you can refer to the fine-tuning instructions in the README to perform LLM fine-tuning tasks directly on Ascend NPU.

The model export phase requires adding Ascend NPU adaptation logic, which is addressed in this patch.

hiyouga · 2023-09-20T04:17:02Z

Why NPU cannot use the bfloat16 data type?

ji-huazhong · 2023-09-20T05:20:44Z

Support for bfloat16 is still in the works and will likely be available by the end of the year 😃

hiyouga · 2023-09-20T05:28:59Z

I have added torch.cuda.is_bf16_supported() to determine whether the bfloat16 is supported by the computation hardware. Do you mean it does not work in NPU's environment?

hiyouga · 2023-09-20T05:31:34Z

src/llmtuner/tuner/core/loader.py

+        if is_torch_npu_available():
+            infer_dtype = torch.float16
+        else:
+            infer_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 # detect cuda capability


Here is an if statement to evaluate the compatibility of the bf16 data type of the current computing environment.

torch.cuda.is_bf16_supported will cause a runtime error on an NPU environment.
FYI, to use PyTorch on NPU, users need to install both torch-cpu and torch-npu. As a result, cuda-related code is not included.

shirwy · 2023-11-10T09:53:42Z

accelerate选MULTI_NPU就可以直接使用npu训练吗，不需要修改代码吗，我这边训练报超时错误。

npuSynchronizeDevice:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUStream.cpp:370 NPU error, error code is 107020.
EI0002: The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4294967295].base information: [streamID:[1], taskID[39], taskType[Notify Wait], tag[AllReduce_192.168.0.116%enp69s0_60000_0_1694500287022627].] task information: [notify id:[0x0000000500000010], stage:[0], remote rank:[local].]
Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
TraceBack (most recent call last):
Task execute failed, device_id=5, stream_id=4, task_id=4, flip_num=0, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4294967295].base information: [streamID:[1], taskID[39], taskType[Notify Wait], tag[AllReduce_192.168.0.116%enp69s0_60000_0_1694500287022627].] task information: [notify id:[0x0000000500000010], stage:[0], remote rank:[local].]
Notify wait execute failed, device_id=5, stream_id=1, task_id=39, flip_num=0, notify_id=2[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4].base information: [streamID:[63], taskID[17], taskType[Notify Wait], tag[].] task information: [notify id:[0x0000000500000030], stage:[ffffffff], remote rank:[4].]
Notify wait execute failed, device_id=5, stream_id=63, task_id=17, flip_num=0, notify_id=6[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4].base information: [streamID:[62], taskID[14], taskType[Notify Wait], tag[].] task information: [notify id:[0x0000000500000020], stage:[ffffffff], remote rank:[4].]
Notify wait execute failed, device_id=5, stream_id=62, task_id=14, flip_num=0, notify_id=4[FUNC:GetError][FILE:stream.cc][LINE:1483]
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4294967295].base information: [streamID:[61], taskID[36], taskType[Notify Wait], tag[].] task information: [notify id:[0x0000000500000048], stage:[2], remote rank:[local].]
Notify wait execute failed, device_id=5, stream_id=61, task_id=36, flip_num=0, notify_id=9[FUNC:GetError][FILE:stream.cc][LINE:1483]
rtDeviceSynchronize execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
wait for compute device to finish failed, runtime result = 107020.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

hiyouga · 2023-11-10T10:28:37Z

@shirwy 建议先试试单卡

support export model on Ascend NPU

b3e41c6

hiyouga added the pending This problem is yet to be addressed label Sep 20, 2023

hiyouga self-requested a review September 20, 2023 04:17

hiyouga reviewed Sep 20, 2023

View reviewed changes

hiyouga merged commit ac8648b into hiyouga:main Sep 20, 2023

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 21, 2023

hiyouga mentioned this pull request Feb 18, 2024

Add HUAWEI Ascend NPU backend support #2498

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ascend NPU support #975

Add Ascend NPU support #975

ji-huazhong commented Sep 20, 2023

hiyouga commented Sep 20, 2023

ji-huazhong commented Sep 20, 2023

hiyouga commented Sep 20, 2023

hiyouga Sep 20, 2023

ji-huazhong Sep 20, 2023

shirwy commented Nov 10, 2023

hiyouga commented Nov 10, 2023

Add Ascend NPU support #975

Add Ascend NPU support #975

Conversation

ji-huazhong commented Sep 20, 2023

What does this PR do?

hiyouga commented Sep 20, 2023

ji-huazhong commented Sep 20, 2023

hiyouga commented Sep 20, 2023

hiyouga Sep 20, 2023

Choose a reason for hiding this comment

ji-huazhong Sep 20, 2023

Choose a reason for hiding this comment

shirwy commented Nov 10, 2023

hiyouga commented Nov 10, 2023