WIP guide on style training with [filewords]. #443
Replies: 13 comments 27 replies
-
|
Beta Was this translation helpful? Give feedback.
-
I'll try to tell you what I know.
|
Beta Was this translation helpful? Give feedback.
-
Has the UI changed since this was written? I can't find or seems to be other fields? |
Beta Was this translation helpful? Give feedback.
-
It's all good man I followed yours and another one and between the two I
managed to hack together a model! I'll upload it to huggingface later
cheers man
…On Mon, 12 Dec 2022, 11:03 pm GunnarQuist, ***@***.***> wrote:
Yes, the UI seems to get an overhaul at least once a week. This guide
needs to be updated.
I like the improvements. But until the updates stabilize and the UI
remains unchanged, I can't write my own guide. I started writing one but
the UI changed. And then it changed again. So for now I've stopped writing
a guide until I can learn how to train using v2-1 at 512 and 768.
Keep up the good work, folks! I appreciate what you are doing.
—
Reply to this email directly, view it on GitHub
<#443 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AX26FIIE73ZKORISGLM4RCLWM32AFANCNFSM6AAAAAASWMMAGE>
.
You are receiving this because you are subscribed to this thread.Message
ID:
<d8ahazard/sd_dreambooth_extension/repo-discussions/443/comments/4377247@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
That was pretty quickly outdated. Be careful when cropping your images. I gave regularization/class images a shot. Keen to try some of the new features, but don't know what they do. |
Beta Was this translation helpful? Give feedback.
-
My personal experience was that regularization images made my model more flexible, but I prefer to see that if you want full flexibility, you need to mix a strong DB model with a weight of 0.75 with a base model that is close to your theme. (Model A is the base model, Model B is yours.) So if general, then sd version 1.5, but if adult, then NovelAI. |
Beta Was this translation helpful? Give feedback.
-
@piyarsquare, I'm still in the process of researching the impact of all standard settings. But I would like you to conduct one experiment. Try raising batch_size to 2, 4 or 8, and leave the rest of the parameters. Then compare the two models. When testing this parameter with its increase, I got the best quality, especially in faces and anatomy. But the training will take longer. |
Beta Was this translation helpful? Give feedback.
-
@piyarsquare, First thank you for the guide! For the Dataset Directory field under the Concepts tab, what is the format of the directory path we put into the field? 1: "C:_CODE_Github\stable-diffusion-webui\ __inputs\images_and_captions" OR 2: "__inputs\images_and_captions" OR 3: "/__inputs/images_and_captions" I'm in Windows 11 |
Beta Was this translation helpful? Give feedback.
-
@piyarsquare From my understanding, you're using [filewords] incorrectly. As described in the filewords and prompt boxes, they should be as follows... When using [filewords] In the Filewords section: In the Prompts section: Also, the following should help. and if you're using the 7GB EMA file you should check "extract EMA" Unchecked if you're using the 4GB checkpoint. And mirroring this in the advanced tab i.e. "use EMA" You can generate the filewords descriptions in the train tab under Preprocess images and choose your flavour of interrogations (deepbooru seems to give the best results) and double check them/edit where necessary. |
Beta Was this translation helpful? Give feedback.
-
Just wondering, does anyone use the Instance token and the Class token? How exactly does it work? I don't really understand what I'm replacing it with yet. Is it so that if I type in a keyword, the text I type in here is interpreted by the model? I'd be interested to see how much it improves the editing. |
Beta Was this translation helpful? Give feedback.
-
In the last few days I have been looking for logical relationships by adding extra parameters to the @cerega66 formula. I have undertaken a difficult task, because I am not training for a person or a style, but for a given pose. So there are several different people in each picture, and the pose is from several different angles. I trained 16 images with 2.5e-6 LR using the formula, with 512 steps saved. In general, 1.5x steps were best, so 1536 and 2048 steps were the right ones, 1024 were under-exposed, the higher ones were over-exposed. Let the unique keyword be xyz. I tried the following options:
I was looking for a correlation between the following goals:
I tried several different poses with different data sets. Since this seems like a lot of variation, I will highlight the context. I'm still not completely clear on how the Instance token and Class token work, but what I have noticed is that they have a big impact on the editing. Let's see what variations I tried with the following constant parameters (the rest varied):
Shuffle tags and Horizontal flip were off in most cases where I turned them on, I'll let you know! Here are the variations:
For the other variations, the models did not perform well:
That's it for now, it looks like I get the most flexible model when I don't have Horizontal flip, I use keywords in the Instance prompt, I use Class images for the general appearance, tokens are filled in. The [filewords] help a lot with editing if they are tagged correctly, and are the ones you want to use later. If it's an object, it should be highlighted, if it's features, it should be highlighted, but mix them with more related words, and that's why it's better to have lots of images, because if you use it only once, it will be tied to the image (see this post of mine). So it's worth taking the time to do it, or instead of using [filewords], you should enter the keywords in the Instance prompt with a comma. Another tip: another way to improve the editability is to train a stronger model (so 2048 steps is good in our example), which I combine with the base model with 0.75 and 0.8 weights. This way the mixed model will take over the keywords, yet still be flexible and editable. |
Beta Was this translation helpful? Give feedback.
-
I have modified this [filewords] post, now I know with almost complete certainty what it is for, and I have tried to explain clearly what the advantages and disadvantages are, but it is definitely recommended to use it with the class images! |
Beta Was this translation helpful? Give feedback.
-
Anyone figure out to increase the max token length by more than 300? |
Beta Was this translation helpful? Give feedback.
-
NEWLY UPDATED (so probably already outdated...)
I tried to train a Simpsons style with best practices gleaned from this site. This is an outline from my workflow. Anything that was uncertain, I marked with a ❓. I would be happy to hear any insights. I will amend/have amended this post with your feedback, denoted with a ❗.
Thanks to @d8ahazard for all of this unbelievable fun we've been having. But do you think you could type a little slower? It's getting hard for us to keep up.
Also, there is an excellent post from @cerega66 detailing the effects of a wide range of parameters. I will update this post to reflect their results soon.
Note:
I use xyz as the new keyword in this example.
❗ This is a bad keyword since it is composed of two tokens ❗
You should pick a short keyword that is rarely used and is a single token. For my project, I used the asim keyword from the keyword lists from this reddit post. I do not know if the ordering of the list matters as the post suggests(❓). I selected a four-letter word from this github list, by text searching for pairs of letters that have something to do with my subject, "The Simpsons." I searched for sim and found asim was a single token near the bottom of this list. I tried asim out in the Tokenizer tab and it confirmed that it is a single token. I also generated a few test images with just "asim" as the prompt and they struck me as pretty vague. Be warned, if you do not test your token, you may end up with the Tottenham Hotspur Football Club. (thfc, also not a great token...)
DO NOT USE "xyz" as your keyword.
However, as a variable name we will use xyz as a placeholder in this post.
❗ According to @cerega66, using a two-token keyword will double your training time.
The image dataset.
I did this in Photoshop.
Here I made some mistakes. I used the used a crop preset set to 512px by 512 px. Even though I selected regions that were larger than that in size, photoshop downscaling algorithm would often add a "halo" to the edges, particularly, the black edges common to cartoon characters. I think the best thing to do is to crop without scaling, but second to that you can choose a downscaler that minimizes artifacts. GIMP has a "NoHalo" option. In photoshop, I got the following results from Professor Frink. I think all of them show some halo effect though nearest neighbor has the least and bilinear looks like the best trade off :
Select images and crop them so that you can describe them easily in the next section.
In windows, select all the files in the image directory.
Select “rename.”
Type xyz and hit return.
All the files should be labeled xyz (1).png through xyz (100).png.
The captions.
I followed this reddit post.
Example caption: "A smiling Caucasian 40-year-old man with thick glasses and buck teeth dressed in a blue NASA jumpsuit with brown boots standing next to a garden with birds of paradise plants and ferns growing inside a large room. A robot arm picks flowers in the foreground."
Important hang-up: Needed a “utf-8” output file so used:
At the end, each caption is matched to the file xyz (1).txt through xyz (100).txt and describe the correspondingly labeled .png files.
4. Put them all into the same directory for training: /path/to/xyz
Training.
For now, we are going to stick with the old-fashioned non-LORA stuff.
I am testing the LORA encoding out now. I don't really know what I'm doing, but I'm doing it much faster.
Go to the Create Model tab.
3.Extract EMA Weights = unchecked. ❓
Select your new model
Go to the Concepts tab.
We will be training one concept "xyz style" so we will only use the Concept 1 tab. Set maximum training steps to -1.
Dataset Directory = /path/to/xyz (directory containing xyz (n).png and .txt files.)
Classification Dataset Directory = blank
Leave these empty. This is useful when swapping in a keyword to your captions. For example, if your captions were "A closeup photograph of a woman eating a pie." and you are training that particular woman to keyword xyz, you would put xyz for "instance token" and woman for the "class token."
Instance Prompt = xyz style [filewords]
Class Prompt = empty, we are not using prior preservation (this time❓)
NOTE: ❗ I tried running the same model but with prior preservation.
See final section for notes on Prior Preservation. For style training, it is the current wisdom that prior preservation is not needed. Without prior preservation, training runs 50% faster. But it may be that you get better results in half the steps. For more details, see below.
Sample Image Prompt = xyz style. Mark Twain.
Sample Prompt Template File = path to file with a list of prompts to randomly select from during sample generation. Leave blank to use Sample Image Prompt.
Sample Image Negative Prompt = watermark, text, signature, cross-eyed. (What you do not want to see in your samples.)
Go to the Parameters tab.
For Settings
Training steps per image (Epochs) = 100
Max Training Steps = 0 (let epochs determine the number of steps)
Pause After N Epochs = 0
Amount of time to pause between Epochs, in seconds = 0
Use Lifetime Steps/Epochs when Saving: checked
Save Preview/Ckpt Every epoch: unchecked (that would be 100 ckpt in this case.)
Save checkpoint frequency = 1000
Save Preview(s) frequency = 1000
Batch Size = 1
Class Batch Size = 1
My 3090 Ti refuses anything higher.
Learning rate = 1.72e-6 (thanks @cerega66!)*will explore this more in time.
Lora unet Learning Rate = does not matter because we're not using lora.
Lora Text Encoder Learning Rate = does not matter because we're not using lora.
Scale learning rate = unchecked ❓
❗ This section allows you to vary the learning rates over the course of training. One may wish to start with a high learning rate and then lower the rate as training proceeds. This is similar to the process of annealing where the temperature of glass or steel is reduced over time to obtain a more stable product. I may experiment with this in the future, but this is not high on my list.
Learning rate scheduler = constant (does not matter if SLR unchecked)
Learning rate warmup steps = 0 (does not matter if SLR unchecked)
Resolution 512
Center crop = off
Apply Horizontal flip = off (do not want to sacrifice asymmetry of my images)
Pretrained VAE Name or Path = blank
Use concept list = unchecked
Concept List = blank
For Advanced
Use CPU only (SLOW) = unchecked
Use LORA = unchecked
Use EMA = checked (can't say why or what this does)
Use 8 bit Adam = ❗ checked less memory
Mixed Precision = fp16 (experimenting with bp16 which seems to be better suited to this type of work.)
Memory Attention = ❗ xformers!! I can now run a batch size of 4 (maybe more, waiting impatiently)
Don’t Cache Latents = checked
Train Text Encoder = checked
Prior loss weight = 1
Pad Tokens = checked
Shuffle tags = unchecked (but should it be❓)
Max token length = 300, because some of my captions are very long.
Gradient Checkpointing = checked, but probably not needed if more than 4GB of VRAM still free. I will try without this next time.
Hit Train!
On my 3090 Ti takes about 1.25 hours for 10,000 training steps = 100 epochs.
Preliminary results look good:
However, some generations give double pupils or no pupils.
I included some images in the training data with multiple characters and I think that may have been a mistake.
Prior Preservation.
I have tried a run with prior preservation turned on.
To do that, we have the following minor deviations from the above notes:
Our only changes are to the Concept tab.
That is all. When the training begins, if there are no images in your classification dataset, the program will generate them from the text files in your training directory. That way, your classification images reflect "A smiling Caucasian 40-year-old man with thick glasses and buck teeth dressed in a blue NASA jumpsuit with brown boots standing next to a garden with birds of paradise plants and ferns growing inside a large room. A robot arm picks flowers in the foreground."
And Stable diffusion tries to match asim style to Frink without changing this guy too much.
Testing
The results are difficult to judge. I have not yet found a process for comparing models that makes me happy. I think the right tool is infinity grid generator. It generates an html with a nice interface where you can explore different seeds, parameters and models.
Screenshot of the web-interface:
Here are some images at 10K steps without prior preservation.
Same prompts and seeds at 10K steps with PP.
Which are better? The prior preservation model has a "locked-in" feeling where different seeds seem to generate very similar outputs. Another user suggested mixing the model back into the base.
These are mixed in at 75%. The cicadas defy the tag. but the others do have greater variation.
Mixed at 85%, we start getting cartoon bugs.
I also compare over people and landscapes but I do not have a satisfying metric for comparison.
I would like to hear what others do.
Beta Was this translation helpful? Give feedback.
All reactions