Specify map location when loading model #272

ejm714 · 2023-05-05T14:06:59Z

Fixes #270
Fixes #271

netlify · 2023-05-05T14:07:06Z

✅ Deploy Preview for silly-keller-664934 ready!

Name	Link
🔨 Latest commit	`754afa2`
🔍 Latest deploy log	https://app.netlify.com/sites/silly-keller-664934/deploys/645ac38c8b0a7800081422b6
😎 Deploy Preview	https://deploy-preview-272--silly-keller-664934.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

github-actions · 2023-05-05T14:07:42Z

🚀 Deployed on https://deploy-preview-272--silly-keller-664934.netlify.app

codecov · 2023-05-05T18:38:36Z

Codecov Report

Merging #272 (754afa2) into master (333f874) will not change coverage.
The diff coverage is 100.0%.

Additional details and impacted files

@@          Coverage Diff           @@
##           master    #272   +/-   ##
======================================
  Coverage    87.6%   87.6%           
======================================
  Files          26      26           
  Lines        2178    2178           
======================================
  Hits         1908    1908           
  Misses        270     270

Impacted Files	Coverage Δ
zamba/models/efficientnet_models.py	`100.0% <100.0%> (ø)`
zamba/models/model_manager.py	`84.1% <100.0%> (ø)`
zamba/models/slowfast_models.py	`88.8% <100.0%> (ø)`
zamba/pytorch_lightning/utils.py	`97.7% <100.0%> (ø)`

pjbull · 2023-05-08T22:48:45Z

Tests fail for me on a GPU machine with these logs:

===================================================================== FAILURES ======================================================================
_________________________________________________________ test_not_use_default_model_labels _________________________________________________________

dummy_trained_model_checkpoint = PosixPath('/tmp/pytest-of-bull/pytest-0/dummy-model-dir3/my_model/version_0/dummy_model.ckpt')

    def test_not_use_default_model_labels(dummy_trained_model_checkpoint):
        """Tests that training a model using labels that are a subset of the model species but
        with use_default_model_labels=False replaces the model head."""
        original_model = DummyZambaVideoClassificationLightningModule.from_disk(
            dummy_trained_model_checkpoint
        )

        model = instantiate_model(
            checkpoint=dummy_trained_model_checkpoint,
            scheduler_config="default",
            labels=pd.DataFrame([{"filepath": "gorilla.mp4", "species_gorilla": 1}]),
            use_default_model_labels=False,
        )

>       assert (model.head.weight != original_model.head.weight).all()
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

~~Not sure, but maybe the from_disk path needs to also check the device and load to the appropriate one?~~

Actually, I checked with PDB and here are the device each is on. Looks like maybe instantiate_model did not do the right thing in this case since that is on cpu when a gpu is available.

(Pdb) model.device
device(type='cpu')
(Pdb) original_model.device
device(type='cuda', index=0)

ejm714 · 2023-05-09T19:24:18Z

zamba/models/model_manager.py

@@ -110,10 +110,8 @@ def instantiate_model(
            return resume_training(
                scheduler_config=scheduler_config,
                hparams=hparams,
-                species=species,


[cleanup] removed unused params

ejm714 · 2023-05-09T19:24:56Z

@pjbull I learned a couple useful things in debugging this.

1)

A LightningModule already keep track of whether a gpu is available. We need to specify map_location to satisfy torch but can use the self.device property when we are finetuning rather than figuring out if there's a gpu separately. This simplifies the code changes needed by a little bit.

2)

instantiate_model has inconsistent behavior on whether the model that is returned is on the CPU or GPU.

If we are replacing the head or training from scratch (i.e. not just doing a plain load_from_checkpoint), the model is returned on CPU
If we are just loading from a checkpoint to resume training, the model will be returned on a GPU if a GPU is available
None of this functionally matters because putting the model onto the desired device is handled by pl.Trainer in train_model and predict_model. In addition, the purpose of the affected tests is to check the weights are correct (i.e. that the head has been correctly replaced or not), not the location of the model. Therefore, I just added a .to("cpu") for the single case where this discrepancy occurs.

Our options are to:

enforce that the model is always loaded onto CPU for instantiate_model
leave the discrepancy as is since devices will be sorted out by pl.Trainer

ejm714 · 2023-05-09T22:05:41Z

I've gone ahead and made some simplifications.

Now, anytime we're loading from a checkpoint in the code, we use the .from_disk method which all models have. This always loads onto CPU.

Moving to GPU is handled by the devices parameter in pl.Trainer.

zamba/zamba/models/model_manager.py

Lines 282 to 294 in 1ebcd41

    
           accelerator, devices = configure_accelerator_and_devices_from_gpus(train_config.gpus) 
        
           trainer = pl.Trainer( 
        
               accelerator=accelerator, 
        
               devices=devices, 
        
               max_epochs=train_config.max_epochs, 
        
               logger=tensorboard_logger, 
        
               callbacks=callbacks, 
        
               fast_dev_run=train_config.dry_run, 
        
               strategy=DDPStrategy(find_unused_parameters=False) 
        
               if (data_module.multiprocessing_context is not None) and (train_config.gpus > 1) 
        
               else "auto", 
        
           )

The utility function configure_accelerator_and_devices_from_gpus is where we ensure the user specified number of gpus gets passed to the Trainer (and the accelerator matches). In the pydantic validation earlier, we've ensured that you can't specify devices that you don't have.

Tests are passing locally in a fresh python 3.9 environment on a GPU and non-GPU machine.

pjbull

It still seems a little weird to force cpu in all the from_disk calls, and seems maybe preferable to avoid that if possible. If not, the rest of the change looks great to me!

pjbull · 2023-05-11T00:54:10Z

zamba/pytorch_lightning/utils.py

-        return cls.load_from_checkpoint(path)
+    def from_disk(cls, path: os.PathLike, **kwargs):
+        # note: we always load models onto CPU; moving to GPU is handled by `devices` in pl.Trainer
+        return cls.load_from_checkpoint(path, map_location="cpu", **kwargs)


Maybe this is fine, but forcing cpu here still feels a little weird to me. What if someone isn't using the trainer? Then they have to know and move devices themselves?

Just wondering if letting things happen automatically is it a problem?

Just wondering if letting things happen automatically is it a problem?

With the recent torch changes, we need to specify a map location when loading from checkpoint (map_location cannot be None; that is the source of the failing tests). Trying to figure out if the user has a GPU and wants to use it at this point in the code just adds redundancy and complexity in a way that is not helpful (it's hard to summarize succinctly here but that is the route we initially tried and it was not a good option).

What if someone isn't using the trainer?

Whether they're using our train_model code or writing their own PTL code, they'll still be using the trainer. If for some reason they're using our code to load the model but then not using PTL, then yes, they'll have to know and move devices.

specify map location when loading model

01dd85b

more map_location

ae380bd

ejm714 requested a review from pjbull May 5, 2023 18:53

ejm714 added 2 commits May 9, 2023 11:38

use self.device instead

4da92da

cleanup

6c211f1

ejm714 commented May 9, 2023

View reviewed changes

specify cpu

34af345

ejm714 and others added 5 commits May 9, 2023 13:36

put back hparams

64fc0dd

remove comment

4f912c8

remove newline

9fa7e86

simplfications

c3e08e2

cleanup

754afa2

pjbull approved these changes May 11, 2023

View reviewed changes

ejm714 merged commit ee693e7 into master May 12, 2023

ejm714 deleted the device-update branch May 12, 2023 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify map location when loading model #272

Specify map location when loading model #272

ejm714 commented May 5, 2023

netlify bot commented May 5, 2023 •

edited

Loading

github-actions bot commented May 5, 2023 •

edited

Loading

codecov bot commented May 5, 2023 •

edited

Loading

pjbull commented May 8, 2023 •

edited

Loading

ejm714 May 9, 2023

ejm714 commented May 9, 2023 •

edited

Loading

ejm714 commented May 9, 2023

pjbull left a comment

pjbull May 11, 2023

ejm714 May 11, 2023

Specify map location when loading model #272

Specify map location when loading model #272

Conversation

ejm714 commented May 5, 2023

netlify bot commented May 5, 2023 • edited Loading

✅ Deploy Preview for silly-keller-664934 ready!

github-actions bot commented May 5, 2023 • edited Loading

codecov bot commented May 5, 2023 • edited Loading

Codecov Report

pjbull commented May 8, 2023 • edited Loading

ejm714 May 9, 2023

Choose a reason for hiding this comment

ejm714 commented May 9, 2023 • edited Loading

1)

2)

ejm714 commented May 9, 2023

pjbull left a comment

Choose a reason for hiding this comment

pjbull May 11, 2023

Choose a reason for hiding this comment

ejm714 May 11, 2023

Choose a reason for hiding this comment

netlify bot commented May 5, 2023 •

edited

Loading

github-actions bot commented May 5, 2023 •

edited

Loading

codecov bot commented May 5, 2023 •

edited

Loading

pjbull commented May 8, 2023 •

edited

Loading

ejm714 commented May 9, 2023 •

edited

Loading