How to use pipeline parallelism in serve a bloom model? #3013
Unanswered
gaoxt1983
asked this question in
Community | Q&A
Replies: 2 comments 3 replies
-
Hi @gaoxt1983 In fact, as the BLOOM example demonstrates, we recommend using TP. Because PP is inefficient in generating tasks due to bubble. |
Beta Was this translation helpful? Give feedback.
3 replies
-
Is it normal that I generate one token for 100ms~120ms, in a node that holds 8 A100? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a bloom 175b pretrained model. I want to serve this model with EnergonAI in a one node machine with 4 A100 GPUs. So I have modified example/bloom/run.sh:
What I have monitored later was the 4 GPUs were quite idle. the processes which I monitored were like this:
So what have I done wrong, and what should I do to achieve pipeline parallelism?
Beta Was this translation helpful? Give feedback.
All reactions