Improve DeepSpeed Stage 3 Throughput #16

SeanNaren · 2022-07-11T15:57:43Z

On 8 A100 with this deepspeed config, below is the measured TFLOPs:

deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36

Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s

Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!

There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.

cc @tjruwase @jeffra

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DeepSpeed Stage 3 Throughput #16

Improve DeepSpeed Stage 3 Throughput #16

SeanNaren commented Jul 11, 2022

Improve DeepSpeed Stage 3 Throughput #16

Improve DeepSpeed Stage 3 Throughput #16

Comments

SeanNaren commented Jul 11, 2022