You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!
There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.
On 8 A100 with this deepspeed config, below is the measured TFLOPs:
Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!
There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.
cc @tjruwase @jeffra
The text was updated successfully, but these errors were encountered: