Now you can use this images mixing in official diffusers repo.
This approach allows you to combine two images using standard diffusion models without any prior models
.
Modified and extended existing clip guided stable diffusion algorithm to work with images.
WARNING: It's hard to get a good result of image mixing the first time.
All examples you can find in ./jupyters folder:
File Name | Description |
---|---|
example-no-CoCa.ipynb | Short minimal example for images mixing. The weakness of this approach is that you should write prompts for each image. |
example-stable-diffusion-2-base.ipynb | Example with stable-diffusion-2-base. For prompt generation CoCa is used. |
example-load-by-parts.ipynb | Example where each diffusers module is loading separately. |
example-find-best-mix-result.ipynb | Step by step explained how to get the parameters for mixing. (By complete enumeration of each parameter. xD) |
example-as-augmentation.ipynb.ipynb | Using image mixing for image augmentation. Summer to winter example. |
Algorithm based on idea of clip guided stable diffusion img2img. But with some modifications:
- Now two images and (optionaly) two prompts (description of each image) are expected.
- Using interpolated (content-style) CLIP image embedding. (CLIP text embedding in original)
- Using interpolated (content-style) text embedding for guidance. (text embedding in original)
- (Optionaly) Using CoCa model for generation image description
Using different coefficients you can select type of mixing: from style to content or from content to style. Parameters description see below.
Style to prompt
and Prompt to style
give different result. Example.
git clone https://github.com/TheDenk/images_mixing.git
cd images_mixing
pip -r install requirements.txt
import torch
from PIL import Image
from diffusers import DiffusionPipeline
from transformers import CLIPFeatureExtractor, CLIPModel
# Loading additional models
feature_extractor = CLIPFeatureExtractor.from_pretrained(
"laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
)
clip_model = CLIPModel.from_pretrained(
"laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16
)
# Pipline creating
mixing_pipeline = DiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
custom_pipeline="./images_mixing.py",
clip_model=clip_model,
feature_extractor=feature_extractor,
torch_dtype=torch.float16,
)
mixing_pipeline.enable_attention_slicing()
mixing_pipeline = mixing_pipeline.to("cuda")
# Pipline running
generator = torch.Generator(device="cuda").manual_seed(117)
content_image = Image.open('./images/boromir.jpg').convert("RGB")
style_image = Image.open('./images/gigachad.jpg').convert("RGB")
pipe_images = mixing_pipeline(
content_prompt='boromir',
style_prompt='gigachad',
num_inference_steps=50,
content_image=content_image,
style_image=style_image,
noise_strength=0.6,
slerp_latent_style_strength=0.8,
slerp_prompt_style_strength=0.2,
slerp_clip_image_style_strength=0.2,
guidance_scale=9.0,
batch_size=1,
clip_guidance_scale=100,
generator=generator,
).images
pipe_images[0]
With Segment anything you can effectively augmenting a dataset of images (Jupyter notebook example).
Each slerp_
parameter has an impact on both images - style and content (more style - less content and and vice versa)
Parameter Name | Description |
---|---|
slerp_latent_style_strength | parameter has an impact on start noised latent space. Calculate as spherical distance between latent spaces of style image and content image. |
slerp_prompt_style_strength | parameter has an impact on each diffusion iteration as usual prompt and for clip-guided algorithm. Calculate with CLIP text model as spherical distance between clip text embeddings of style prompt and content prompt. |
slerp_clip_image_style_strength | parameter has an impact on each diffusion iteration for clip-guided algorithm. Calculate with CLIP image model as spherical distance between clip image embeddings of style image and content image. |
noise_strength | just noise coefficient. Less value - more original information from start latent space. Recommended minimum value - 0.5, maximum - 0.7. |
noise_strength=0.5
slerp_latent_style_strength=0.8
slerp_prompt_style_strength=0.2
slerp_clip_image_style_strength=0.2
noise_strength=0.5
slerp_latent_style_strength=0.2
slerp_prompt_style_strength=0.8
slerp_clip_image_style_strength=0.8
Issues should be raised directly in the repository. For professional support and recommendations please [email protected].