A Unified Checkpoint System Design #3339

FrankLeeeee · 2023-03-30T06:10:20Z

FrankLeeeee
Mar 30, 2023

Overview

As we are developing new features for the Colossal-AI system, we find it difficult to save/load model/optimizer checkpoints. It is because that different features need to handle the save/load logic on its own without a common protocol. This is not a good idea as we don't want it to be that the model trained with one feature can only be loaded with the same feature. This limits the usage of the checkpoint and hinders the integration with the community.

Therefore, it is important to design a unified checkpoint system as a protocol. Some important factors should be considered.

Checkpoint format should be feature/parallelism agnostic, which means that the checkpoint saved by one feature can be loaded for another feature. This makes sure that the checkpoint does not contain any feature-related information, but only the model parameters.
Support large-scale model saving/loading. We need to consider cases such as sharded model and sharded parameters.
Integration with the AI ecosystem, e.g. supports huggingface model weight format

Background

First of all, we should understand which use cases this unified system will cater to. Let's assume we have a model like this:

When we train this model, this model can have different ways of placement over GPUs. Let's assume we are training with 2 GPUs.

Data Parallel (the model is replicated over GPUs)

Tensor Parallel (the model weights are sharded over GPUs)

Pipeline Parallel (the model is sharded by layer over GPUs)

Auto Parallel (some parameters are sharded, some are not, depending on the search results)

Therefore, the unified checkpoint system should support at least the features mentioned above.

Currently, there are mainly two ways to save/load the model checkpoints:

PyTorch native way using state_dict(). For example, PyTorch saves the whole model into a single file called model.pth

Hugging Face way by sharding the model by parameters. There are many weight files, each weight files stores a collection of the parameters. A index.json file is used to specify which parameter goes into which file. An example can be https://huggingface.co/facebook/opt-66b/blob/main/pytorch_model.bin.index.json.

Design

According to the information mentioned above, what we need to support can be expressed as a matrix:

-	Only PyTorch Tensors	Contains Distributed Tensors
Single Weight File	yes	yes
Sharded Weight File	yes	yes

Therefore, we will have the following APIs to cover these usages:

class CheckpointIO:

    def load_model(self,
                   model: Union[nn.Module, ModelWrapper],
                   checkpoint: str,
                   strict: bool = True) -> Union[nn.Module, ModelWrapper]:
          ...

    def save_model(self,
                               model: Union[nn.Module, ModelWrapper],
                               checkpoint: str,
                               shard: bool = False,
                               gather_dtensor: bool = True,
                               prefix: str = None,
                               size_per_shard: int = 1024):
          ...

    def load_optimizer(self, optimizer: Optimizer, checkpoint: str):
           ...

    def save_optimizer(self,
                                     optimizer: Optimizer,
                                     checkpoint: str,
                                     shard: bool = False,
                                     gather_dtensor: bool = True,
                                     prefix: str = None,
                                     size_per_shard: int = 1024):
        ...

We can focus on the load_model and save_model methods.

save_model has two arguments to define its checkpoint format:

shard (bool): Will save in the HF sharded model format with a index.json file if True. Otherwise, the model weight will be saved in a single file.
gather_dtensor (bool): will gather the distributed tensor back to its global tensor if True. Otherwise, save it as a tensor shard.

To better explain the outcome of different cases, I will use file structure to illustrate:

1. shard = False, gather_dtensor = True

- checkpoints
   - model.pth

2. shard = True, gather_dtensor = True

- checkpoints
   - model.bin.index.json
   - model_0001-of-0003.bin
   - model_0002-of-0003.bin
   - model_0003-of-0003.bin

3. shard = False, gather_dtensor = False. The dtensors will be stored in an individual folder and each tensor sharded is numbered. The saved tensor format can be:

{
    "shard_info": Any
    "tensor": torch.Tensor
}

The file structure will look like:

- checkpoints
   - model.bin.index.json
   - model.pth
   - dtensor
      - linear_weight.bin.1
      - linear_weight.bin.2
      - linear_bias.bin.1
      - linear_bias.bin.2

The weight map key-value pair for dtensor in the index.json file will look like linear_weight: linear_weight.bin.*

4. shard = True, gather_dtensor = False

- checkpoints
   - model.bin.index.json
   - model_0001-of-0003.bin
   - model_0002-of-0003.bin
   - model_0003-of-0003.bin
   - dtensor
      - linear_weight.bin.1
      - linear_weight.bin.2
      - linear_bias.bin.1
      - linear_bias.bin.2

When loading models, we only support cases of shard = True, gather_detnsor = False and shard = False, gather_dtensor = False. Therefore, merging dtensors to a global tensor can be done offline via our CLI.

binmakeswell · 2023-03-30T06:20:45Z

binmakeswell
Mar 30, 2023
Maintainer

Great! We received a lot of issues about it : (

1 reply

FrankLeeeee Mar 30, 2023
Author

Should be able to fix it by next week.

FrankLeeeee · 2023-03-30T06:58:52Z

FrankLeeeee
Mar 30, 2023
Author

@YuliangLiu0306 perhaps you want to suggest some methods to manage the shard_info when gather_dtensor = False.

4 replies

YuliangLiu0306 Mar 30, 2023

Sure, shard_info should be set as an attribute of parameter during model initializing.

Then, we could retrive the shard_info like this:

def generate_shard_info(module: torch.nn.Module)->Dict[torch.nn.Parameter,  Layout]:
     shard_info_dict = {}
     for param in module.parameters():
          # we may use name of param as the key to aviod some error caused by sharing weight
          shard_info_dict[param] = getattr(param, 'shard_info', default_shard_info)
      return shard_info_dict

FrankLeeeee Mar 30, 2023
Author

Besides, when shard=True, we have to design a new format. E.g. param_groups can be saved to the index json. And xxx.bin saves the states.

Can we merge the sharded tensors back to its global tensor according to the shard_info without launching distributed environments?

YuliangLiu0306 Mar 30, 2023

We can, the related operator become concate instead of all_gather/gather.
For example, if we get N key-value pairs from N dtensor files, the keys should be tensor object carrying different values and the values should be same, such as RS0. Then, we could concate the local tensors into global tensor respect to the layout and the rank index obtaining from file name.

FrankLeeeee Mar 30, 2023
Author

Okay.

FrankLeeeee · 2023-03-30T07:03:56Z

FrankLeeeee
Mar 30, 2023
Author

One question left to think is that how to save optimizer states for sharded tensors, e.g. in auto parallel and zero?

3 replies

ver217 Mar 30, 2023
Maintainer

A state dict of optimizer may be like this:

The param_groups is replicated, and the state may be sharded.

And for state, there are two main cases:

the state has the same shape as parameter, like exp_avg and exp_avg_sq.
the state does not have the same shape as parameter, like step.

For the first case, we have to store the shard info individually. And for the second case, we didn't know how to handle it. However, in the most cases, they are replicated among all processes. Maybe we can only store them on master. And they are usually small, so storing them on each process is also acceptable.

ver217 Mar 30, 2023
Maintainer

Besides, when shard=True, we have to design a new format. E.g. param_groups can be saved to the index json. And xxx.bin saves the states.

FrankLeeeee Mar 30, 2023
Author

Sure, HF does not really design this so I guess we can design the schema of index.json on our own for optimizer states.

binmakeswell · 2023-03-30T07:28:28Z

binmakeswell
Mar 30, 2023
Maintainer

Related issue #3250 should gather the optimizer weights before save funciton.

0 replies

FrankLeeeee · 2023-03-30T07:57:04Z

FrankLeeeee
Mar 30, 2023
Author

Assuming we only have 4 linear layers and they all don't have bias. The linear layer in yellow is a DTensor. The checkpoint will look like (index.json is not shown):

0 replies

FrankLeeeee · 2023-03-30T09:02:50Z

FrankLeeeee
Mar 30, 2023
Author

This design implementation will be tracked in the Kanban https://github.com/orgs/hpcaitech/projects/19.

0 replies

ver217 · 2023-05-09T07:35:35Z

ver217
May 9, 2023
Maintainer

A possible sharded optimizer checkpoint

A state dict of optimizer may be like this:

There are three types of file:

index file, which saves file mapping info.
group file, which saves param groups info. Validation can be done via this file.
state file, which saves optimizer states.

Index file

File name may be like pytorch_optim.bin.index.json.

File content may be like:

{
  "param_groups": "pytorch_optim_group.bin",
  "weight_map": {
      "0": "pytorch_optim-00001.bin",
      "1": "pytorch_optim-00002.bin"
  }
}

Group file

Generally speaking, param_groups doesn't contain tensors. Thus, memory usage of param_groups is small, and we don't need to shard this.

File name may be like pytorch_optim_group.bin.

File content may be like:

[{'lr': 0.001,
   'betas': (0.9, 0.999),
   'eps': 1e-08,
   'weight_decay': 0,
   'amsgrad': False,
   'maximize': False,
   'foreach': None,
   'capturable': False,
   'params': [0]},
  {'lr': 0.001,
   'betas': (0.9, 0.999),
   'eps': 1e-08,
   'weight_decay': 0,
   'amsgrad': False,
   'maximize': False,
   'foreach': None,
   'capturable': False,
   'params': [1]}]

It saves param_groups of the original state dict.

State files

Optimizer states may be large and we need to shard them.

File name may be like pytorch_optim-00001.bin, pytorch_optim-00002.bin, ...

File content may be like:

{1: {'step': tensor(1.),
   'exp_avg': tensor([0.0750, 0.0381, 0.0591, 0.0473, 0.0298, 0.0659, 0.0052, 0.0653, 0.0714,
           0.0618, 0.0388, 0.0288, 0.0140, 0.0349, 0.0391, 0.0459, 0.0867, 0.0453,
           0.0629, 0.0130]),
   'exp_avg_sq': tensor([5.6205e-04, 1.4529e-04, 3.4900e-04, 2.2345e-04, 8.8624e-05, 4.3449e-04,
           2.6612e-06, 4.2593e-04, 5.0988e-04, 3.8246e-04, 1.5073e-04, 8.2851e-05,
           1.9700e-05, 1.2154e-04, 1.5302e-04, 2.1050e-04, 7.5246e-04, 2.0536e-04,
           3.9539e-04, 1.6980e-05])}}

It saves part of state of the original state dict.

Shortcomings

As the key of each param is number, we don't know the param is belong to which model. If using pipeline parallelism, it's hard to recover the sharded state dict.

0 replies

ArthurJiang · 2023-08-15T07:51:53Z

ArthurJiang
Aug 15, 2023

Hello @FrankLeeeee I'm trying to enable the hugging face remote class in ColossalAI, by now everything is ok, but I just found the 3D version CheckpointIO is not implemented in example/llama: booster/pulgin/three_dim_parallel.py, I want to implement, so which example or tutorial I can follow?

0 replies

superleo · 2023-10-28T04:25:49Z

superleo
Oct 28, 2023

What's the mechanism of checkpoint save& load during multiple node training? Rank1 collect all the data and save it? or all the nodes save the sharded model, rank1 only save the index json.

3 replies

Fridge003 Oct 31, 2023

In most cases, master rank will collect data from other ranks and do the saving. But in pipeline parallel setting, each rank will save the sharded model belonging to its own pipeline stage. So it depends on the parallel strategy used, for different strategy the saving behavior is also different.

superleo Nov 14, 2023

thank you Fridge003. Another question is the user senario of load step, how to make each rank load the specified partial checkpoint file to avoid load whole files.

Fridge003 Nov 15, 2023

In the checkpoint folder, there's an index file that stores a mapping from each weight to the file it's located in. So each rank can know which files to load with its part of parameters.

A Unified Checkpoint System Design #3339

Overview

Background

Design

1. shard = False, gather_dtensor = True

2. shard = True, gather_dtensor = True

3. shard = False, gather_dtensor = False. The dtensors will be stored in an individual folder and each tensor sharded is numbered. The saved tensor format can be:

4. shard = True, gather_dtensor = False

Replies: 9 comments · 11 replies

binmakeswell Mar 30, 2023 Maintainer

FrankLeeeee Mar 30, 2023 Author

FrankLeeeee Mar 30, 2023 Author

FrankLeeeee Mar 30, 2023 Author

FrankLeeeee Mar 30, 2023 Author

FrankLeeeee Mar 30, 2023 Author

ver217 Mar 30, 2023 Maintainer

ver217 Mar 30, 2023 Maintainer

FrankLeeeee Mar 30, 2023 Author

binmakeswell Mar 30, 2023 Maintainer

FrankLeeeee Mar 30, 2023 Author

FrankLeeeee Mar 30, 2023 Author

ver217 May 9, 2023 Maintainer

A possible sharded optimizer checkpoint

Index file

Group file

State files

Shortcomings

Replies: 9 comments 11 replies

binmakeswell
Mar 30, 2023
Maintainer

FrankLeeeee Mar 30, 2023
Author

FrankLeeeee
Mar 30, 2023
Author

FrankLeeeee Mar 30, 2023
Author

FrankLeeeee Mar 30, 2023
Author

FrankLeeeee
Mar 30, 2023
Author

ver217 Mar 30, 2023
Maintainer

ver217 Mar 30, 2023
Maintainer

FrankLeeeee Mar 30, 2023
Author

binmakeswell
Mar 30, 2023
Maintainer

FrankLeeeee
Mar 30, 2023
Author

FrankLeeeee
Mar 30, 2023
Author

ver217
May 9, 2023
Maintainer