[hybrid] pp+dp support fp16 allreduce #34762

wangxicoding · 2021-08-10T08:18:07Z

PR types

Performance optimization

PR changes

Others

Describe

Hybrid support fp16 allreduce. Usage:

import paddle.distributed.fleet as fleet

strategy = fleet.DistributedStrategy()
strategy.sharding = True
strategy.sharding_configs = {
    "sharding_degree": 1,
    "mp_degree": 1,
    "pp_degree": 2,
    "dp_degree": 2,
}
strategy.pipeline = True
strategy.pipeline_configs = {
    "schedule_mode": "1F1B",
    "micro_batch_size": 2,
    "accumulate_steps": 4,
}
strategy.amp = True
strategy.fp16_allreduce = True

Test

Test in 16node*8cards 32G V100, with Ernie3.0 model.

Model config:

	value
hidden size	8192
num attention heads	128
num hidden layers	76
num sharing layers	64
branch hidden size	768
branch num attention heads	16

Hybrid configs, with fused_allreduce 128MB:

dp	mp	pp	micro bsz	global bsz
2	8	8	2	256

Performance:

fp16_allreduce	throughput(tokens/s)	improve
false	13394
true	15224	13.6%

JZ-LIANG

LGTM

JZ-LIANG · 2021-08-11T07:24:50Z

python/paddle/fluid/optimizer.py

        """
        Create a new merged gradient for each parameter and accumulate the
        corresponding gradient to it.
        """
        merged_gradient_names = []
        first_opt_op_idx = None

+        merged_suffix = '@MERGED@FP16' if fp16_allreduce else '@MERGED'


should explain the suffix for gradname for later maintainer, we now have two many suffix for grad, immediately grad, accumulated grad, casted grad, etc

ok, add in next PR

sandyhouse

LGTM

wangxicoding added 4 commits August 10, 2021 14:50

hybird add fp16 allreduce

b692949

fix

5d82dd2

coverage

f3aecc5

fix strip

67d5ec1

wangxicoding requested review from JZ-LIANG, gongweibao and sandyhouse August 11, 2021 05:40

JZ-LIANG approved these changes Aug 11, 2021

View reviewed changes

sandyhouse approved these changes Aug 11, 2021

View reviewed changes

wangxicoding merged commit 4d7af37 into PaddlePaddle:develop Aug 11, 2021

wangxicoding deleted the hybird_fp16_allreduce branch August 11, 2021 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hybrid] pp+dp support fp16 allreduce #34762

[hybrid] pp+dp support fp16 allreduce #34762

wangxicoding commented Aug 10, 2021 •

edited

Loading

JZ-LIANG left a comment

JZ-LIANG Aug 11, 2021

wangxicoding Aug 11, 2021

sandyhouse left a comment

[hybrid] pp+dp support fp16 allreduce #34762

[hybrid] pp+dp support fp16 allreduce #34762

Conversation

wangxicoding commented Aug 10, 2021 • edited Loading

PR types

PR changes

Describe

Test

JZ-LIANG left a comment

Choose a reason for hiding this comment

JZ-LIANG Aug 11, 2021

Choose a reason for hiding this comment

wangxicoding Aug 11, 2021

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

wangxicoding commented Aug 10, 2021 •

edited

Loading