Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_dreambooth_lora_flux validation RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same #9476

Open
squewel opened this issue Sep 19, 2024 · 9 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@squewel
Copy link

squewel commented Sep 19, 2024

Describe the bug

When train_dreambooth_lora_flux attempts to generate images during validation, RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same is thrown

Reproduction

Just follow the steps from README_flux.md for DreamBooth LoRA with text-encoder training:

export OUTPUT_DIR="trained-flux-dev-dreambooth-lora"

accelerate launch train_dreambooth_lora_flux.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="bf16" \
  --train_text_encoder\
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --seed="0" \
  --push_to_hub```

### Logs

```shell
09/19/2024 23:08:58 - INFO - __main__ - Running validation... ███████████████████████████████████████████████████| 7/7 [00:00<00:00, 13.76it/s]
 Generating 4 images with prompt: a photo of sks dog
W0919 23:12:39.471000 139969377689600 torch/fx/experimental/symbolic_shapes.py:4449] [0/3] xindex is not in var_ranges, defaulting to unknown range.
W0919 23:17:03.532000 139969377689600 torch/fx/experimental/symbolic_shapes.py:4449] [0/4] xindex is not in var_ranges, defaulting to unknown range.
Traceback (most recent call last):
  File "/workspace/flux-diffusers/diffusers/examples/dreambooth/train_dreambooth_lora_flux.py", line 1890, in <module>
    main(args)
  File "/workspace/flux-diffusers/diffusers/examples/dreambooth/train_dreambooth_lora_flux.py", line 1810, in main
    images = log_validation(
  File "/workspace/flux-diffusers/diffusers/examples/dreambooth/train_dreambooth_lora_flux.py", line 189, in log_validation
    images = [pipeline(**pipeline_args, generator=generator).images[0] for _ in range(args.num_validation_images)]
  File "/workspace/flux-diffusers/diffusers/examples/dreambooth/train_dreambooth_lora_flux.py", line 189, in <listcomp>
    images = [pipeline(**pipeline_args, generator=generator).images[0] for _ in range(args.num_validation_images)]
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/flux-diffusers/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 762, in __call__
    image = self.vae.decode(latents, return_dict=False)[0]
  File "/workspace/flux-diffusers/diffusers/src/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/workspace/flux-diffusers/diffusers/src/diffusers/models/autoencoders/autoencoder_kl.py", line 321, in decode
    decoded = self._decode(z).sample
  File "/workspace/flux-diffusers/diffusers/src/diffusers/models/autoencoders/autoencoder_kl.py", line 292, in _decode
    dec = self.decoder(z)
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/flux-diffusers/diffusers/src/diffusers/models/autoencoders/vae.py", line 291, in forward
    sample = self.conv_in(sample)
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 458, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/workspace/flux-diffusers/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

System Info

Diffusers:

- Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.25.0
- Transformers version: 4.44.2
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- Bitsandbytes version: not installed
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA L40, 46068 MiB

Accelerate config:

compute_environment: LOCAL_MACHINE                                                                                          
debug: false                                                                                                                
distributed_type: 'NO'                                                                                                      
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Who can help?

@sayakpaul @linoytsaban

@squewel squewel added the bug Something isn't working label Sep 19, 2024
@squewel squewel changed the title train_dreambooth_lora_flux validation RuntimeError train_dreambooth_lora_flux validation RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same Sep 19, 2024
@icsl-Jeon
Copy link
Contributor

Even the below leads to OOM

# autocast_ctx = torch.autocast(accelerator.device.type) if not is_final_validation else nullcontext()

@kishlaykumar1995
Copy link

Had the same issue. As a temporary fix, I added the code to convert the latent to bfloat16 on line 762 of pipeline_flux.py and it worked. But this was just temporary, and I don't know if it was the correct thing to do or not

@xngli
Copy link

xngli commented Sep 26, 2024

Even the below leads to OOM

# autocast_ctx = torch.autocast(accelerator.device.type) if not is_final_validation else nullcontext()

Uncomment this line and comment the one below; that resolved the issue for me

@linoytsaban
Copy link
Collaborator

@sayakpaul do you recall why we have this line commented in log_validation? might also be an issue with other scripts and or related to #9419

# autocast_ctx = torch.autocast(accelerator.device.type) if not is_final_validation else nullcontext()
autocast_ctx = nullcontext()

@linoytsaban
Copy link
Collaborator

seems the same issue as in #9548, #9549

@sayakpaul
Copy link
Member

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 26, 2024
@luchaoqi
Copy link
Contributor

Not sure if this problem is totally resolved, I was trying to use

    autocast_ctx = torch.autocast(accelerator.device.type) if not is_final_validation else nullcontext()
    # autocast_ctx = nullcontext()

but got error and black images similar to #9549

12/16/2024 01:05:28 - INFO - __main__ - Running validation...
 Generating 4 images with prompt: a photo of sks person at 50 years old.
/playpen-nas-ssd/luchao/software/miniconda3/envs/diffuser/lib/python3.10/site-packages/diffusers/image_processor.py:147: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")

Then I tried the PR fix in #9565 but still have the error

RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

The code I run:

 accelerate launch  train_dreambooth_lora_flux.py \
  --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"  \
  --instance_data_dir="xxx" \
  --output_dir="xxx" \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks person" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="a photo of sks person at 50 years old" \
  --validation_epochs=25 \
  --seed="0" \
  --lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"

@github-actions github-actions bot removed the stale Issues that haven't received updates label Dec 16, 2024
Copy link

github-actions bot commented Jan 9, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

7 participants