Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add micro batches #148

Merged
merged 4 commits into from
May 17, 2022
Merged

add micro batches #148

merged 4 commits into from
May 17, 2022

Conversation

lukas-blecher
Copy link
Owner

Proposed in issue #147

@lukas-blecher lukas-blecher linked an issue May 13, 2022 that may be closed by this pull request
@lukas-blecher
Copy link
Owner Author

Maybe scale all gradients by args.micro_batchsize/args.batchsize because right now the gradients are summed over for each mirco batch, resulting in a larger gradient norm on average.
But I've tested it out on a toy model and this constant factor did not hinder convergence.
This can also be compensated by choosing other betas and initial learning rate in case of adam optimizer.

Still might be better for consistency.
Add

for p in model.parameters():
    p.grad=p.grad*microbatch/args.batchsize

before

opt.step()

@TITC
Copy link
Collaborator

TITC commented May 14, 2022

As always, learned from your code and comment.

I found some code on some websites and they do not directly average gradient but average the loss before backward propagation.

I think they come up to same result. Although it named gradient descent but in fact it's Directional derivative. Because the direction is fixed when the architecture is determined.

image

I not sure if this is the correct understanding, I want listening your opinion. @lukas-blecher

@lukas-blecher
Copy link
Owner Author

Yes, I think you're right it is equivalent.
Scaling the loss would be computationally more efficient

@lukas-blecher lukas-blecher merged commit 6a91f0f into main May 17, 2022
@lukas-blecher lukas-blecher deleted the micro-batch branch May 17, 2022 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The result of retraining is not good
2 participants