Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with migrating to pytorch 0.4 #29

Open
mingyuliutw opened this issue Jul 27, 2018 · 5 comments
Open

Issue with migrating to pytorch 0.4 #29

mingyuliutw opened this issue Jul 27, 2018 · 5 comments

Comments

@mingyuliutw
Copy link
Collaborator

We discover a couple of issues (slower training speed and degraded output image quality) when migrating our code from pytorch 0.3 to 0.4. We are working on fixing the issues. For now, we recommend that using munit_pytorch0.3.

@mingyuliutw
Copy link
Collaborator Author

Speed issue is now fixed in commit f972e42.

@Cuky88
Copy link

Cuky88 commented Jul 28, 2018

After making the changes you did, I get the following error when resuming the training:

Traceback (most recent call last):
  File "train.py", line 64, in <module>
    iterations = trainer.resume(checkpoint_directory, hyperparameters=config) if opts.resume else 0
  File "/devel/MUNIT-master/MUNIT-master/trainer.py", line 186, in resume
    self.gen_a.load_state_dict(state_dict['a'])
  File "/opt/anaconda/lib/python2.7/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for AdaINGen:
        Unexpected key(s) in state_dict: "enc_content.model.0.norm.running_mean", "enc_content.model.0.norm.running_var", "enc_content.model.1.norm.running_mean", "enc_content.model.1.norm.running_var", "enc_content.model.2.norm.running_mean", "enc_content.model.2.norm.running_var", "enc_content.model.3.model.0.model.0.norm.running_mean", "enc_content.model.3.model.0.model.0.norm.running_var", "enc_content.model.3.model.0.model.1.norm.running_mean", "enc_content.model.3.model.0.model.1.norm.running_var", "enc_content.model.3.model.1.model.0.norm.running_mean", "enc_content.model.3.model.1.model.0.norm.running_var", "enc_content.model.3.model.1.model.1.norm.running_mean", "enc_content.model.3.model.1.model.1.norm.running_var", "enc_content.model.3.model.2.model.0.norm.running_mean", "enc_content.model.3.model.2.model.0.norm.running_var", "enc_content.model.3.model.2.model.1.norm.running_mean", "enc_content.model.3.model.2.model.1.norm.running_var", "enc_content.model.3.model.3.model.0.norm.running_mean", "enc_content.model.3.model.3.model.0.norm.running_var", "enc_content.model.3.model.3.model.1.norm.running_mean", "enc_content.model.3.model.3.model.1.norm.running_var".

How can I use already trained model with this modification? Training from scratch is working fine.
I guess this issue comes from the changed Layer Normalization?

Do you have any idea why output quality is degraded?

@mingyuliutw
Copy link
Collaborator Author

@Cuky88 The degraded performance resulted from migrating to pytorch 0.4 is likely caused by the instance normalization parameter. We accidentally set track_running_stats=True in networks.py. This means that it will use the tracked means and vars in the test time. However, this is NOT what we used when we developed the code. In the new commit we have set this argument to false. I think this would resolve the issue. I am verifying the hypothesis. Once it is verified, I will add more details.

@qilimk
Copy link

qilimk commented Jul 31, 2018

@mingyuliutw I trained this model for 200,000 iterations several days ago and it took almost 4days. Your work looks so good and I really want to reproduce the results.

  • How many images in the training set should be a good choice?

  • How long should I expect it takes when training this model 1M iterations by using new code?

My GPU is Tesla V100-SXM2 16g.

@qilimk
Copy link

qilimk commented Aug 1, 2018

@Cuky88 I used 2500 images as training set and the results looked not so good apparently. I am trying a new dataset which has 50,000 images, hope to get a better result.
How about your results? It looks like your iterations are small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants