SWA with distributed training #22

milliema · 2021-02-17T06:46:53Z

In case of distributed training, e.g. DDP, each gpu will only process a minibatch, and the bn statistics computed in each gpu are different.
When SWA is adopted, we need to conduct 1 more epoch for bn_update, in this epoch should we use sync bn to average the bn statistics from all gpus?
And is there any other modifications we need to make for DDP training?

izmailovpavel · 2021-02-23T00:41:10Z

Hi @milliema I'd say you should do the same thing that is normally done with the batchnorm statistics in the end of parallel training, I imagine you are syncing the statistics between the copies of the model? I personally did not look into distributed SWA a lot, but here is a potentially useful reference: https://openreview.net/forum?id=rygFWAEFwS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWA with distributed training #22

SWA with distributed training #22

milliema commented Feb 17, 2021 •

edited

Loading

izmailovpavel commented Feb 23, 2021

SWA with distributed training #22

SWA with distributed training #22

Comments

milliema commented Feb 17, 2021 • edited Loading

izmailovpavel commented Feb 23, 2021

milliema commented Feb 17, 2021 •

edited

Loading