Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Default process group has not been initialized, please make sure to call init_process_group #3315

Open
4 tasks done
wangyu-ustc opened this issue Dec 27, 2024 · 1 comment

Comments

@wangyu-ustc
Copy link

wangyu-ustc commented Dec 27, 2024

System Info

accelerate == 1.2.0
deepspeed == 0.16.2

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

  1. I copied the example from examples/nlp_example.py and renamed it to main.py in a new folder then I replaced the optimizer and scheduler in the file with DummyOptim and DummyScheduler to make it compatible with deepspeed configs. This is fine, the code is running properly. But once I add the following it started to raise errors in the title:
accelerator.wait_for_everyone()
  state_dict = accelerator.get_state_dict()
  accelerator.unwrap_model(model).save_pretrained(
      f"{args.output_dir}",
      is_main_process=accelerator.is_main_process,
      save_function=accelerator.save,
      state_dict=state_dict,
  )
  1. Put the following file into the same folder
    (1) stage2.json
{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
  1. Put the following content into config.yaml:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_config_file: stage2.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
  dynamo_mode: default
  dynamo_use_dynamic: true
  dynamo_use_fullgraph: true
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then run accelerate launch --config_file config.yaml main.py, it will raise the error saying ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Expected behavior

I expect accelerator.wait_for_everyone to work fine.

@parasurama
Copy link

parasurama commented Jan 12, 2025

running in to the same issue, but not using deepspeed. This happened after I updated accelerate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants