The following two tables is a comparison of performance between LightSeq and Faster Transformer, Which is tested on Tesla T4 with a model of Transformer-base. We also provide a TF baseline which's code is from Faster Transformer.
batch_size |
beam_size |
seq_len |
TF(ms) |
FT(ms) |
lightseq(ms) |
PyTorch(ms) |
FT speedup |
lightseq speedup |
PyTorch speedup |
1 |
4 |
32 |
419.53 |
26.25 |
29.66 |
385.23 |
15.98 |
14.14 |
1.09 |
1 |
4 |
64 |
806.38 |
54.02 |
63.04 |
760.77 |
14.93 |
12.79 |
1.06 |
8 |
4 |
32 |
439.64 |
35.99 |
34.77 |
416.06 |
12.22 |
12.64 |
1.06 |
8 |
4 |
64 |
891.54 |
79.82 |
79.43 |
835.79 |
11.17 |
11.22 |
1.07 |
32 |
4 |
32 |
536 |
82.82 |
59.49 |
429.78 |
6.47 |
9.01 |
1.25 |
32 |
4 |
64 |
1116.74 |
198.95 |
155.08 |
929.97 |
5.61 |
7.20 |
1.20 |
64 |
4 |
32 |
668.45 |
144.53 |
101.54 |
520.66 |
4.62 |
6.58 |
1.28 |
64 |
4 |
64 |
1476.17 |
351.14 |
277.4 |
1237.79 |
4.20 |
5.32 |
1.19 |
128 |
4 |
32 |
996.88 |
271.8 |
200.49 |
721.66 |
3.67 |
4.97 |
1.38 |
128 |
4 |
64 |
2157.85 |
671.76 |
502.91 |
2158.81 |
3.21 |
4.29 |
1.00 |
batch_size |
topk/topp |
seq_len |
FT(ms) |
lightseq(ms) |
lightseq speedup |
1 |
0.75 |
32 |
34.4 |
29.66 |
1.16 |
1 |
0.75 |
64 |
71.45 |
59.72 |
1.20 |
32 |
0.75 |
32 |
56.61 |
40.40 |
1.40 |
32 |
0.75 |
64 |
120.39 |
100.36 |
1.20 |
128 |
0.75 |
32 |
111.4 |
94.68 |
1.18 |
128 |
0.75 |
64 |
246.97 |
270.55 |
0.91 |
1 |
32 |
32 |
34.35 |
28.06 |
1.22 |
1 |
32 |
64 |
72.48 |
56.4 |
1.29 |
32 |
32 |
32 |
40.15 |
39.23 |
1.02 |
32 |
32 |
64 |
87.46 |
98.62 |
0.89 |
128 |
32 |
32 |
99 |
90.83 |
1.09 |
128 |
32 |
64 |
222.62 |
262 |
0.85 |
The following table is a comparison on a fr2en translation model which is a Transformer-big with a
beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4,
and FP16 models are tested on Tesla T4.
batch_size |
seq_len |
tf-fp32, ms |
lightseq-fp32, ms |
lightseq-fp16, ms |
lightseq-fp32/tf-fp32, speedup |
lightseq-fp16/lightseq-fp32, speedup |
lightseq-fp16/tf-fp32, speedup |
1 |
6 |
303 |
47 |
27 |
6.44 |
1.74 |
11.22 |
1 |
12 |
399 |
63 |
38 |
6.33 |
1.66 |
10.5 |
1 |
18 |
702 |
108 |
59 |
6.5 |
1.83 |
11.9 |
1 |
24 |
1071 |
167 |
82 |
6.41 |
2.04 |
13.06 |
1 |
36 |
1234 |
192 |
105 |
6.42 |
1.83 |
11.75 |
1 |
46 |
1445 |
227 |
110 |
6.36 |
2.06 |
13.14 |
1 |
58 |
1887 |
303 |
142 |
6.22 |
2.13 |
13.29 |
1 |
70 |
2771 |
428 |
197 |
6.47 |
2.17 |
14.07 |
2 |
6 |
317 |
57 |
32 |
5.56 |
1.78 |
9.91 |
2 |
12 |
418 |
73 |
39 |
5.72 |
1.87 |
10.72 |
2 |
18 |
723 |
131 |
66 |
5.51 |
1.98 |
10.95 |
2 |
24 |
1113 |
201 |
91 |
5.53 |
2.21 |
12.23 |
2 |
36 |
1276 |
234 |
104 |
5.45 |
2.25 |
12.27 |
2 |
46 |
1521 |
282 |
121 |
5.39 |
2.33 |
12.57 |
2 |
58 |
2004 |
371 |
159 |
5.4 |
2.33 |
12.6 |
2 |
70 |
2965 |
542 |
221 |
5.47 |
2.45 |
13.42 |
4 |
6 |
326 |
61 |
39 |
5.34 |
1.56 |
8.36 |
4 |
12 |
433 |
85 |
47 |
5.09 |
1.81 |
9.21 |
4 |
18 |
761 |
154 |
77 |
4.94 |
2 |
9.88 |
4 |
24 |
1195 |
245 |
113 |
4.87 |
2.17 |
10.58 |
4 |
36 |
1391 |
282 |
128 |
4.93 |
2.2 |
10.87 |
4 |
46 |
1679 |
339 |
153 |
4.95 |
2.22 |
10.97 |
4 |
58 |
2232 |
455 |
199 |
4.9 |
2.29 |
11.22 |
4 |
70 |
3406 |
673 |
285 |
5.06 |
2.36 |
11.95 |
8 |
6 |
364 |
76 |
43 |
4.78 |
1.77 |
8.47 |
8 |
12 |
470 |
110 |
56 |
4.27 |
1.96 |
8.39 |
8 |
18 |
854 |
205 |
91 |
4.16 |
2.25 |
9.38 |
8 |
24 |
1381 |
318 |
139 |
4.34 |
2.29 |
9.94 |
8 |
36 |
1628 |
378 |
156 |
4.3 |
2.42 |
10.44 |
8 |
46 |
1989 |
459 |
193 |
4.33 |
2.38 |
10.31 |
8 |
58 |
2683 |
617 |
254 |
4.34 |
2.43 |
10.56 |
8 |
70 |
4251 |
949 |
382 |
4.47 |
2.48 |
11.13 |
The following table is a comparison on a en2zh translation model which is a
Transformer-deep(Compared with Transformer-big, it has 16 layers of encoder and other configurations
remain the same) with a
beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4,
and FP16 models are tested on Tesla T4.
batch_size |
seq_len |
tf-fp32, ms |
lightseq-fp32, ms |
lightseq-fp16, ms |
lightseq-fp32/tf-fp32, speedup |
lightseq-fp16/lightseq-fp32, speedup |
lightseq-fp16/tf-fp32, speedup |
1 |
12 |
544 |
86 |
43 |
6.32 |
2 |
12.65 |
1 |
24 |
914 |
131 |
66 |
6.97 |
1.98 |
13.85 |
1 |
36 |
1290 |
200 |
93 |
6.45 |
2.15 |
13.87 |
1 |
48 |
1836 |
233 |
106 |
7.89 |
2.2 |
17.32 |
1 |
72 |
3456 |
482 |
212 |
7.17 |
2.27 |
16.3 |
1 |
84 |
2626 |
431 |
193 |
6.09 |
2.23 |
13.61 |
2 |
12 |
566 |
100 |
50 |
5.66 |
2 |
11.32 |
2 |
24 |
842 |
158 |
70 |
5.32 |
2.26 |
12.03 |
2 |
36 |
1287 |
247 |
103 |
5.21 |
2.4 |
12.5 |
2 |
48 |
1504 |
288 |
118 |
5.22 |
2.44 |
12.75 |
2 |
72 |
3131 |
611 |
240 |
5.12 |
2.55 |
13.05 |
2 |
84 |
2789 |
546 |
217 |
5.1 |
2.52 |
12.85 |
4 |
12 |
590 |
118 |
58 |
5 |
2.03 |
10.17 |
4 |
24 |
885 |
187 |
89 |
4.73 |
2.1 |
9.94 |
4 |
36 |
1380 |
301 |
127 |
4.58 |
2.37 |
10.87 |
4 |
48 |
1622 |
352 |
149 |
4.6 |
2.36 |
10.89 |
4 |
72 |
3492 |
763 |
311 |
4.57 |
2.45 |
11.23 |
4 |
84 |
3145 |
687 |
282 |
4.57 |
2.44 |
11.15 |
8 |
12 |
631 |
150 |
66 |
4.2 |
2.27 |
9.56 |
8 |
24 |
979 |
248 |
103 |
3.94 |
2.41 |
9.5 |
8 |
36 |
1584 |
412 |
156 |
3.84 |
2.64 |
10.15 |
8 |
48 |
1880 |
477 |
186 |
3.94 |
2.56 |
10.11 |
8 |
72 |
4218 |
1069 |
404 |
3.94 |
2.65 |
10.44 |
8 |
84 |
3831 |
976 |
373 |
3.92 |
2.62 |
10.27 |
The following table is a comparison between Hugging Face BERT-base model and LightSeq model on Tesla T4 using FP16.
batch_size |
seq_len |
Hugging Face(ms) |
lightseq(ms) |
lightseq speedup |
1 |
16 |
15.23 |
2.19 |
6.95 |
1 |
32 |
16.24 |
1.99 |
8.16 |
1 |
64 |
19.32 |
2.35 |
8.22 |
1 |
128 |
16.57 |
2.98 |
5.56 |
1 |
256 |
23.99 |
4.60 |
5.22 |
8 |
16 |
13.06 |
3.47 |
3.76 |
8 |
32 |
13.27 |
4.46 |
2.98 |
8 |
64 |
23.02 |
7.43 |
3.10 |
8 |
128 |
59.35 |
17.27 |
3.44 |
8 |
256 |
117.06 |
40.74 |
2.87 |
32 |
16 |
29.27 |
12.38 |
2.36 |
32 |
32 |
54.90 |
17.68 |
3.11 |
32 |
64 |
109.13 |
36.20 |
3.01 |
32 |
128 |
260.13 |
66.03 |
3.94 |
32 |
256 |
498.84 |
145.57 |
3.43 |