Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DeepSpeed Stage 3 Throughput #16

Open
SeanNaren opened this issue Jul 11, 2022 · 0 comments
Open

Improve DeepSpeed Stage 3 Throughput #16

SeanNaren opened this issue Jul 11, 2022 · 0 comments

Comments

@SeanNaren
Copy link
Owner

On 8 A100 with this deepspeed config, below is the measured TFLOPs:

deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36
Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s

Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!

There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.

cc @tjruwase @jeffra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant