[Pre-Final] A large research of the main training parameters *Update* #547
Replies: 19 comments 68 replies
-
there is also an instruction for LORA? |
Beta Was this translation helpful? Give feedback.
-
Thanks for writing the summary, it makes the whole process much clearer. |
Beta Was this translation helpful? Give feedback.
-
So, class images should be a more specific approximation of your character than class prompt? Could you clarify what you mean by not the same "type"? Basically, you're just trying to abstract/generalize your training materials into a greater variety with the class images? |
Beta Was this translation helpful? Give feedback.
-
I did not know how gradient accumulation and batch size interact. Thank you for your sharing your research. |
Beta Was this translation helpful? Give feedback.
-
Have you tried with and without extract EMA? |
Beta Was this translation helpful? Give feedback.
-
thanks a lot for this, i am trying to use the same paramters i used on 1.5 to train in 2.1 but the results are pretty bad on 2.1 i loose a lot of lexbility even with low learning rates... i can't see to get the same results as the default 2.1 model after training even if i don't invoke my prompt ... i am trying to train the face of a person with 40 images ... 100 steps per image i tried increasing and decreasing the learning rate didn't seem to matter, tried adding filewords to describe each image ... also didn't seem to work the model loses a lot of accuracy and images look very low quality |
Beta Was this translation helpful? Give feedback.
-
This is awesome, thank you! Would you mind posting your git hash of the repo you used for this testing? What diffuser version was it as well if you dont mind. |
Beta Was this translation helpful? Give feedback.
-
So, FWIW, the issue with "It's not saving" is that it is saving, but
checkpoint generation is not using the saved snapshots version.
I have this fixed in the ImageBuilder+ branch, just trying to tidy up a few
more things, then it'll be pushed to main. My apologies for the confusion.
…On Mon, Dec 19, 2022 at 6:28 PM nanafy ***@***.***> wrote:
This is awesome, thank you! Would you mind posting your git hash of the
repo you used for this testing? What diffuser version was it as well if you
dont mind.
—
Reply to this email directly, view it on GitHub
<#547 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMO4NBCUDNPVXKH35EAY2LWOD4T3ANCNFSM6AAAAAAS7MCISI>
.
You are receiving this because you are subscribed to this thread.Message
ID:
<d8ahazard/sd_dreambooth_extension/repo-discussions/547/comments/4453345@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
I have rewritten this post. The use of [filewords] now makes perfect sense and helps a lot with editing. Most important: use keywords and concepts in your TXT file! Do this for two reasons:
The keywords used in the [filewords] TXT are linked to the sample image, so please keep the following in mind:
In summary: if you don't have to, do NOT use filewords. The more you have, the more you have to type in the prompt to get the result. This is good if you want to see a lot of parameters on an image, but only specific parameters, because otherwise it will be bound to the image! So only use filewords that you want to specify in the prompt (e.g. front view, bottom view, top view, etc.) Suddenly that's it, so I hope you understand how to use TXT and [filewords]. |
Beta Was this translation helpful? Give feedback.
-
Here is a Google Sheet document I made for me https://docs.google.com/spreadsheets/d/1AGOfcC_IPuiZq6nhjStvKoTDwJKFRxifIProHpMwysw/edit?usp=sharing Make a copy and edit the values |
Beta Was this translation helpful? Give feedback.
-
When you say, "frame," what exactly do you mean? Is that a dataset image? One of the various images of the subject that one tries to train? I searched through this thread and did not find a stated definition. Forgive me for asking such an extremely basic question. I've never seen the term "frame" used in any description of Dreambooth training. This is the only thread--that I'm currently aware of--where that term is used. I just want to have a clear understanding. I read your earlier version of your research and I thought it was interesting. I am eager to absorb your most recent update. |
Beta Was this translation helpful? Give feedback.
-
Your formulas worked very well for my SD 1.5 trainings. Thank you for your hard work. It is very helpful. I hope you--or someone with your perseverance--will investigate similar formulas for SD 2.1 training at 768 resolution. |
Beta Was this translation helpful? Give feedback.
-
Finally finished reading this page. Great stuff. First, if I understand the paper correctly, I found a couple small typos:
Example: you wrote "hard = (1e-5 / LR) * 2" therefore --> hard = 20, not 15
These written values don't match the table: FP16 BF16 Also, one thing I noticed is with the table with auto-calculations. You used "max_training_steps." Which is how many steps where actual training takes place. It might be easier for new users to use epochs, since that is what is in the extension. On my copy of your excel sheet, I quickly added a converter to epochs, based on batch size. A couple of questions:
You're work is amazing. Unfortunately it doesn't quite transfer to what I'm working on at the moment. You're work seems to focus on training dreambooth concepts, and all the math is for training a concept. I'm currently doing more of a major fine-tuning. So the math doesn't quite work for me. My fine-tuning currently uses hundreds of images, and modifies many concepts. It includes: boy/girl/male/female, adult/child/teenager, student/teacher, preschool/elementary student, class level, facial/hair/clothes' descriptions, body part/posture descriptions, environment descriptions, and probably more. I plan to expand to thousands of images. The goal is to feed the model pictures of my school/classes/students/teachers/etc, and have it produce images that look like they came from my school. I'm currently working on a way to establish a baseline for comparing my model with your math, so that I can use your math. I also need to decide how I want to use unique tokens, class tokens, and class frames. I'm also looking for a guide on fine-tuning a large model, but I can't find one. |
Beta Was this translation helpful? Give feedback.
-
Great information to elaborate parameters for dreambooth training, thank you for sharing. I also have two more questions. If you can help, it will be highly appreciated.
Thanks. |
Beta Was this translation helpful? Give feedback.
-
@cerega66 Another question is any extra different tricks for training on SD2.1-512, 2.1-768? Because I used the same way which I used for SD1.5 to train style for SD2.1, the results turned badly comparing with SD1.5. A lot of characters of style disappeared... |
Beta Was this translation helpful? Give feedback.
-
I've been trying so hard to wrap my head around the formulas, but my attention span and learning is wonky. I need a place to start, if I've 50 training images, where to do I begin with the formulas. I can train 512 at batch 5 and GA 5 max on my 3090 TI max. I just need assistance understanding and applying the formulas and/or the math charts provided. Traditionally, I've used polynomial on a single concept at a time LR 1e-6. It's been more plug and play and hope for the best. Appreciate if anyone is willing to assist. |
Beta Was this translation helpful? Give feedback.
-
I really appreciate the gold mine of knowledge, saves us a ton of time vainly experimenting, on this severely underdocumented hobby. We should have more collective knowledgebases, like a wiki. I'd like to share a neat trick which is that you can turn on grad checkpointing to reduce a ton of VRAM and fit 64 batch size of 256 imgs on a 24gb gpu. It's awesome and it trains in about the same time as 10 BS with the option turned off. however, the part about increasing BS decreasing learning doesnt seem to apply to my LORA training in Kohya SS. It actually increased it and now the output is way overblown. e.g. i have 256 images and 64BS* 4GA. I tried decreasing the learning rate by a factor of 1000 and it still isn't enough. Maybe we can find a formula for LORA that involves altering the LR instead of the max training steps. edit: i'm pretty sure the LR parameter is broken in Kohya SS. 1e-4 gives the same result as 1e-30 and 1e0. Going to need to tweak Unet and Tenc separately. (why is there even that param? is it supposed to scale the other two in some weird way?) |
Beta Was this translation helpful? Give feedback.
-
hi, is there an updated version of this guide? it has some outdated concepts like gradient accumulations being recommended to set at for example. on Google Colab which kills the connection after 6 hours, you will want your batch size to be as high as it can get, with an accumulations around 20. with a batch size of 6 and 20 accumulations, you can generate a single checkpoint in about 6 hours. you want as few checkpoints generating in a single day as possible. the more slowly it learns and the larger the batch size, the more effective the training run is. it helps avoid catastrophic loss, and retains model flexibility. additionally, freezing the text encoder layers from the end helps the most when you freeze all but the last two layers. if you freeze all but the last 7, you can train textures more easily but it's too easy to over-fit. running a special training session with a more thawed text encoder and then freezing it to continue training is a great strategy that has worked well for me. |
Beta Was this translation helpful? Give feedback.
-
Very nice job, thanks a lot ! |
Beta Was this translation helpful? Give feedback.
-
NEW UPDATE
Table with mathematics:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true
Custom lr scheduler:
https://drive.google.com/file/d/1KiDQZvOdHy2lvad4BFZUW4wDiDdIl4Lr/view?usp=share_link
I have some questions, and if someone can help me with them, I will be grateful.
Everything listed below was tested by me on various models of versions 1.4, 1.5 and 2.1-512, 2.1-768.
FOREWORD
This research is an attempt to find out all the possible influences of the parameters in Dreambooth training to understand the principle of training, as well as to identify formulas for "correct" training, by minimizing overfitting, maximizing the concept and optimizing in time. That is, to reveal the formulas in which we will simply submit our number of images and get the number of steps we need. This will save us time and disk space, as there will be no need to create savepoints and make comparisons.
Notes:
Main parameters
instance_prompt
This parameter determines which tokens will be used to invoke your concept.
class_prompt
Used to specify a class when using Prior_Preservation_Loss.
learning_rate
Affects how detailed your concept will be. That is, the lower the learning_rate, the more details can be highlighted and the more can be edited. But the more details you select, the more steps you need to read all the details from the frame.
Number of training steps (max_train_steps)
Specifies the overall pass of the training.
Full pass steps and compensation
The number of steps of a full pass is the number of steps during the passage of which we will get the full conclusion of our concept. That is, if we go through this number of steps with 1 image in the concept, then when generating the concept, we should get this image. For 1e-5, this is 200 steps, if you use Text Encoder, then 100 (more about Text encoder will be in a separate section). For other LRs, this value changes in direct proportion, that is, for 1e-6 it is 2000.
Recalculation to other LRs is not difficult, the formula looks like:
SPF = 200 * (1e-5 / LR)
for text encoder:
SPF = 100 * (1e-5 / LR)
where:
LR - learning_rate at which you are going to train
SPF - number of steps per frame
However, when changing LR, there is a problem that when generating with high CFG values, images contain distortions, that is, elements on them begin to break, and the lower lr, the lower the CFG value at which this begins to develop. This is most likely due to the deviation during training of the LR value from the base value of LR during the initial training of the model, since the same problem is observed when the frame resolution is increased, only there it is expressed differently, but the principle is similar.
Improving quality -> more steps needed -> problems
The basic parameters for training the model are as follows:
When we improve any of these parameters (reduce LR, increase resolution), then we have to go through more steps to completely repeat the concept and this leads to problems. And these problems begin to increase immediately on the next step after the value for 1e-5 and the more steps we go through, the stronger these problems. But such things do not occur at 1e-5 even if we go through a very large number of steps. Therefore, to solve such problems, I propose to compensate for this by increasing the number of frames, and not the number of steps per frame. The method itself will be discussed in more detail in the "Advanced Dreambooth Formula" section.
The deterioration of one of these parameters (we increase LR, we decrease the resolution) does not give anything positive.
Formulas for training
Dreambooth Standard Formula
So let's first take a look at the standard training formula listed in the article: https://arxiv.org/abs/2208.12242
To begin with, we will consider the following form of instance_prompt:
"instance_prompt": "[identifier]"
This is what the formula looks like:
max_train_steps = FN * 200 * 1e-5
Where:
FN - your number of frames per training, limit no more than 5
max_train_steps - number of training steps
This formula has limitations - this is the number of frames, not more than 5, the recommended 3-5, although it works with 1, and the constant LR is 1e-5.
When generated using our [identifier] we will get copies of our images with some variation and editability. If you increase the number of frames, then the variability and editability will be better. However, if we simply increase the number of frames, then we will face the problem that when generating images, we will get a blurry image. This is due to the fact that we oversaturate our token with information. The frame limit I found, at which the oversaturation is not strong, is 5.
If we scale LR, with the same number of frames, but at the same time we recalculate the number of steps, we will get output distortion at high CFG values and the lower LR, the lower the CFG value with distortion.
Class based training
instance_prompt looks like this:
"instance_prompt": "a [identifier] [class noun]"
[identifier] is your unique identifier
[class noun] - this is the class on the basis of which we will train, a more general idea of the training object is indicated.
Example:
"instance_prompt": "sks woman"
[identifier]-sks
[class noun] – woman
However, when training, both with and without a class, 2 problems arise: Overfitting and language drift.
Both of these problems are partially solved by using Prior Preservation Loss.
Prior Presservation Loss
The use of this option allows us to partially solve the problems listed above, but only for training on the basis of the class, and as my experience showed, both problems are not solved completely and not in all cases. To do this, we need to generate a certain number of images of our class. For this, Class_Prompt is used, which looks:
"Class_Prompt": "A [class noun]"
Example:
"Instance_Prompt": "A Sks Woman"
"Class_Prompt": "A Woman"
Formulas for calculating the number of class images
num_class_images = FN * 200
For Text Encoder:
num_class_images = FN * 100
where:
FN - your number of personnel for training
prior_loss_weight
Option of the weight of class images. By default = 1. So far in research.
Advanced Dreambooth Formula
Attention! The following formulas I have given are nothing more than an experiment, although they show fairly good results.
This formula was created by me for training with any amount, but there are still some limitations. It can still be improved.
The idea is that we will rely on the starting value of LR - 1e-5, and a constant value of steps per frame (SPF), 200 without TE (about TE in a separate section). And as the number of frames increases, we will scale the LR, and we will leave the SPF the same. This will give us the opportunity to use any number of frames and not lead to distortion. For regulation, we introduce another coefficient F.
Coefficient F - determines the strength of training, the basis of its values will be the maximum number of frames for LR = 1e-5, that is, it will take values from 0.2 to 5, where 1 is the average value. The higher the value, the stronger the repetition of the concept, but the more the Overfitting problem is expressed, respectively, the lower the value, the worse the concept is repeated, but the less the Overfitting problem is expressed.
List of F:
low_end 0.20
ultra_soft 0.25
very_soft 0.33
soft 0.50
medium 1.00
hard 2.00
very hard 3.00
ultra_hard 4.00
high_end 5.00
All formulas below are calculated with batch_size=1 and gradient_accumulation_steps=1, there will be separate sections with separate formulas about them.
This is how the formulas for calculating the number of frames for training with a certain F for a certain LR look like:
very_soft = (1e-5 / LR) * 0.33
soft = (1e-5 / LR) * 0.5
medium=1e-5/LR
hard = (1e-5 / LR) * 2
very_hard = (1e-5/LR) * 3
where:
LR - learning_rate at which you are going to train
very_soft - very_hard - the number of frames required to determine the training mode (strength)
Example:
LR=1e-6
medium = 10
hard = 20
The number of steps for training is determined simply:
max_train_steps = FN * 200
where:
FN - your number of frames per training
max_train_steps - number of training steps
We can also calculate the lr we need based on our number of frames:
LR = 1e-5 / (FN / F)
where:
FN - your number of frames per training
F - training strength value
LR - learning_rate at which you are going to train
Example:
FN = 125
F = 1
LR = 8e-8
Now a little about editability, at low F values it is achieved through the model itself, and at high F values through a variety of training images themselves, so it makes sense to use a high F only with more images if editability and repeatability are important to you at the same time.
You can use the table with auto-calculations, which I posted here:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true
Text encoder
Allows you to train a text encoder, but requires much more video memory. When we train it, we get many benefits, but the requirements for its use become much more stringent.
Advantages:
Training with and without Text encoder cannot be compared in terms of results, that is, training with Text encoder will always be better in repeating the concept than without.
The increase in the number of frames also occurs due to LR and F scaling.
Formulas:
max_train_steps = FN * 100
where:
FN - your number of frames per training
max_train_steps - number of training steps
Resolution and training on model 768 and other resolutions other than base resolutions
Specifies what resolution is used for training frames. Allows you to train at a resolution different from the base value of the model. Increasing above the baseline value of the model reduces the speed of the training and also consumes more video memory.
Both for training on a model with a resolution higher than 512, and for training a higher resolution on a model with a lower resolution, it is necessary to increase the SPF.
To train on the 768:
SPF = 450 without TE
SPF = 225 s TE
And then you can carry out all the calculations as for the 512 model, only with these SPFs.
For training above baseline resolution, the SPF can also be calculated separately. To do this, you must first calculate the magnification factor.
RC=(RES / RES_B) ^ 2
Where:
RES _B - base model resolution
RES - the resolution at which you train
RC - resolution magnification factor
Then we calculate the SPF:
SPF = SPF_B * RC
Where:
SPF_B - standard number of steps per frame for the model
RC - resolution magnification factor
Training above baseline resolution has its pros and cons.
Pros:
Cons:
You can also use the table with auto calculations, which I posted here:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true
Simply enter your training resolution in the custom field, regardless of the model's base resolution.
(Experimental) Dynamic Resolution
Attention! This method does not guarantee 100% generation of images in higher resolutions without artifacts. The higher the resolution from the base one, the more likely they are to be generated.
Training with a resolution higher than the base resolution allows you to generate images at this resolution, but there is a problem with concept generation at the base resolution, the images come out cropped. This can be corrected by training the model at once in the base and higher resolutions.
To do this, we need to make training frames in the base resolution and the resolution up to which we will train. The only problem is that you need a higher SPF to train higher resolution images. Since we cannot set how many steps to go through a certain frame, we will be a little tricky. You just need to make copies of the training frames in the amount of RC.
For example, to train at 512 and 768 resolution 10 images, you need 10 frames at 512 resolution and 20 at 768 resolution (10 x 2 copies).
To train 512, 768 and 1024 you need 10 frames in 512, 20 in 768 (10 x 2 copies), 40 in 1024 (10 x 4 copies).
It is also necessary to disable resize in the training script so that the images do not change their resolution.
The number of steps is calculated as an average for the resolutions in which we will train. For example, for 512 and 768 it will be: (1000 + 2250) / 2 = 1625.
However, this method only works for batch size = 1. To use other values greater than 1, you need to remove all shuffling in the script, as well as arrange frames so that all resolutions are ordered and not mixed. This is necessary due to the fact that when you get frames with different resolutions into the batch, you will get an error in the dimensions of the tensors.
After training in dynamic resolution, it becomes possible to generate images without hires fix not only in these resolutions, but also in intermediate ones, as well as in combinations.
Generation examples can be found here: #547 (reply in thread)
Attention types
At the moment there are 4 types: default, xformers, flash_attention and sub_quad. I can say so far about default, xformers and flash_attention. I haven't configured sub_quad yet.
Number formats
FP64
Double-precision data type.
Range is 2.23e-308 to 1.80e308.
Epsilon 0.0000000000000002
Significant numbers - 15 - 17
FP32
Standard training data type. Has the best parameters for training (if you do not take fp64 and fp80 in comparison)
The range is from 1e-38 to 3e38.
Epsilon 0.00000012
Significant numbers - 6 - 9
TF32
Stripped down version of FP32 to 19 bits.
The range is from 1e-38 to 3e38.
Epsilon 0.00097656.
Significant numbers - 3 - 4
Cannot be used with mixed_precision=FP16 and BF16.
mixed_precision
FP16
Standard 16 bit float.
Range from 5.9e-8 to 6.5e4.
Epsilon 0.00097656.
Significant numbers - 3 - 4
BF16
More stripped down version of FP32 to 16 bits.
The range is from 1e-38 to 3e38.
Epsilon 0.00781250
Significant numbers - 2 - 3
Comparison of number format
So, if the quality is in descending order: FP64, FP32, TF32, FP16, BF16.
If the speed in descending order: BF16, FP16, TF32, FP32, FP64.
The minimum Range value defines the minimum accepted value for LR.
Sign. digits determines the accuracy of training for a specific data type.
For example, for LR=1e-2, the accuracy is 100% for all types, for LR=1e-3, BF16 will already do internal rounding, for 1e-4, only FP32 and FP64 are not rounded. FP32 starts rounding below 1e-6, FP64 below 1e-15.
The better the type, the more correctly it repeats the small details of the concept at very low LR.
If you try to visually compare these data types, then most likely you will not see a difference if you do not understand what and how to compare. Most likely you will only be able to see the difference between FP16 and BF16. It is quite difficult to notice the difference between FP16 and TF32 or FP32.
xformers problem
UPDATE!!! This problem fix in 0.0.17!!!
xformers have a problem in training with FP64, FP32 and TF32. It refuses to run training on these data types on rtx30xx and rtx40xx graphics cards, but works fine on A100. This issue is in xformers itself and at the time of writing the issue is still unresolved and will not be resolved in future updates. Therefore, if you want to train on FP64, FP32 and TF32, then use another attention.
gradient_checkpointing
Creates a gradient pre-accumulation point. Reduces video memory consumption but reduces speed. Does not affect the quality of training in any way, but may allow the use of other additional parameters.
train_batch_size
Specifies how many frames per step will be used. Requires more memory, also reduces speed.
Should take a value, when dividing FN by which, we will get an integer value.
This requirement is due to the fact that when the value is greater than 1, your images are divided into groups, and if the division is not even (one group will not have an addition), this will lead to uneven training.
There can't be more than FN.
So, now about how it works. The more value it (BS) takes, the better the concept training goes, the best value is train_batch_size = FN. By increasing its value, the training of our concept is achieved in fewer steps. Only the dependence is not linear, for example, when training 5 images with BS=1, I will take 1000 steps, with BS=5, the same number of steps will give me distortion at high CFG values, that is, the number of steps should be less. If I just reduce the amount by 5 times to 200, then there will be a clear undertraining.
Formulas:
First, let's determine the values of incremental steps per frame.
Without TE:
SPF_A = SPF / 1.5.
Example, for SPF=200, SPF_A=134.
For TE:
SPF_A = SPF
First, we define incremental steps:
ADS = SPF_A * (BS - 1)
where:
SPF_A - incremental steps per frame
BS is the value of train_batch_size
ADS - incremental number of steps
After counting the number of training steps:
max_train_steps = (FN * SPF + ADS) / √BS
where:
FN - your number of frames per training
SPF - number of steps per frame
ADS - incremental number of steps
BS is the value of train_batch_size
max_train_steps - number of training steps
Despite the fact that we ended up reducing the number of frames, the training takes longer, due to a large drop in speed. However, this gives us much more repeatability of the concept without increasing the impact of Overfitting. For example, training at BS=4, with LR calculated for F=1, is equal to training at F=2, but the effect of Overfitting will remain as at F=1. That is, we got one more tool for adjustment.
You also need to understand that increasing BS does not give a straight-line efficiency.
For example, 10 frames training with TE.
With BS=1, max_train_steps=1000, and it will take me 10 minutes to train.
With BS=2, max_train_steps=778, and it will take me 11 minutes to train.
With BS=5, max_train_steps=626, and it will take me 16 minutes to train.
With BS=10, max_train_steps=601, and it will take me 37 minutes to train.
As for me, the best ratio here is BS=5.
Now let's look at 100 frames.
With BS=1, max_train_steps=10000, and it will take me 1 hour 45 minutes to train.
With BS=2, max_train_steps=7142, and it will take me 1 hour 39 minutes to train.
With BS=4, max_train_steps=5150, and it will take me 1 hour 57 minutes to train.
With BS=5, max_train_steps=4651, and it will take me 2 hours and 1 minute to train.
With BS=10, max_train_steps=3447, and it will take me 3 hours and 32 minutes to train.
As you can see, with BS=2 I even gained in time, and the ratio of time increase between BS=1 and BS=10 became noticeably lower, this time the time increased only 2 times, and with 10 frames it was 3-4 times.
gradient_accumulation_steps
Similar to train_batch_size emulator.
If you, while increasing train_batch_size, are faced with the fact that you do not have enough video memory, then this parameter can help you. It creates a gradient accumulation area and then adds it to the total weight. Allowing you to emulate using a larger train_batch_size value. It also requires a slightly larger number of steps than you calculate, but there is no need to recalculate them, just indicate what you got, and the accelerator itself will add the additional necessary ones (for me).
For example, train_batch_size=1 and gradient_accumulation_steps=4 will give us an emulation of train_batch_size=4.
With train_batch_size=4 and gradient_accumulation_steps=4 we will get train_batch_size=16 emulation.
But I don't just call it emulation, not replacement. The problem is that the accumulation does not repeat 100% the same gradient as train_batch_size. For example train_batch_size=1 and gradient_accumulation_steps=4 is not equal to train_batch_size=4 but is very close to it. Therefore, I do not recommend using this setting above 1.
The rules for choosing a value are as follows:
Increasing the value of gradient_accumulation_steps increases the total value of training steps, unlike train_batch_size. But the effect is the same as with train_batch_size. This is due to the fact that for values greater than 1, not all are training steps.
For example, with gradient_accumulation_steps=10, 9 steps are accumulative and 10 are training steps. So for 10 frames and 1000 steps, we will go through 1900 steps, but only 190 of them will be training steps. This also means that in the end we need to go through more steps than with a value of 1.
Formulas:
max_train_steps = FN * SPF * √GA
where:
FN - your number of frames per training
SPF - number of steps per frame
GA is the value of gradient_accumulation_steps
max_train_steps - number of training steps
If you use train_batch_size and gradient_accumulation_steps at the same time, then the formulas will be as follows:
ADS = SPF_A * (BS - 1)
max_train_steps = (FN * SPF + ADS) / √BS * √GA
where:
SPF_A - incremental steps per frame
BS is the value of train_batch_size
ADS - incremental number of steps
FN - your number of frames per training
SPF - number of steps per frame
GA is the value of gradient_accumulation_steps
max_train_steps - number of training steps
You can use the table with auto-calculations, which I posted here:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true
sample_batch_size
Affects how many images per group will be generated for the class. Doesn't affect training at all.
lr_scheduler
A parameter that determines by which function your LR will change during training. There are 6 standard types.
But first, a little about warmup.
warmup (lr_warmup_steps)
Determines how many steps from the start there is an increase from the minimum LR to the one you specified, by a linear method. On all interactive charts presented below, this is parameter a. Everyone has it except constant.
A little more about graphics. s is 100% of your steps, l is lr specified in learning_rate, red graph is warmup, blue is function.
Now about lr_scheduler types.
STANDARD
constant
Your speed does not change throughout the entire training period.
constant_with_warmup
https://www.desmos.com/Calculator/jjl4wx9gsz
Same thing, but there is warmup.
linear
https://www.desmos.com/Calculator/tzugnzgyge
Standard linear method.
cosine
https://www.desmos.com/Calculator/ezfyo3rvav
Change through cosine.
cosine_with_restarts
https://www.desmos.com/Calculator/5wfa3rqdyy
Works as cosine by default. However, there are parameters that change the length of the period.
polynomial
https://www.desmos.com/Calculator/lwdt9uqsc5
Changing through a polynomial, in our case, is not much different from linear.
CUSTOM
Custom can find in this doc:
https://drive.google.com/file/d/1KiDQZvOdHy2lvad4BFZUW4wDiDdIl4Lr/view?usp=share_link
So, the question arises, what is it and how to use it. In short, this is something like anti-aliasing in terms of editability of details and in terms of quality. We refine some small details by lowering lr during training.
I recommend using it only when you have few frames for training, and there is no way to increase their number. Otherwise, they are more useless, since with a large number of frames, the required number of steps is increased quite a lot, and the effect is rather minimal. For example, if you start training with 1e-5 (3-5 frames).
If you compare constant and any other lr_scheduler, then constant will always win with a lower lr than others, but longer in terms of training.
To use them, we need to know the value of the number of steps for constant, and then multiply this number by the multiplier specifically lr_scheduler (they will be lower). In this case, the same repeatability as with constant will be achieved.
max_train_steps = s * m
where:
s - steps for constant calculated by the formulas above
m - a multiplier of another lr_scheduler
Attention! Using lr_scheduler with too high a multiplier can generate distortion at high and even low CFGs.
lr_scheduler step multipliers
constant - 1
linear/cosine/cosine_with_restarts-2
polynomial - ~2
not_cache_latents
Disables caching of your training data to video memory, thereby reducing its consumption, but reduces the speed of training. It doesn't make sense if train_batch_size = FN, since we get 1 batch per training as a result, and this option disables caching of batches that are not involved in training at this step.
set_grads_to_none
Sets Optimizer.zero_grad to set_to_none to True. This reduces the consumption of video memory and also increases the speed of the training. I did not find those places in the code where this option would have an effect. When making comparisons, I also did not notice that this option had an impact on the quality of the training.
use_8bit_adam
When training, we use 8bit adam instead of 32bit adam. Significantly reduces the amount of video memory used. Using 8bit adam gives a small loss in accuracy due to the fact that calculations are made in uint8, and in 32bit adam in float32, however, the percentage of losses is quite low and in some places is not noticeable at all.
By default 32bit adam is used by torch and 8bit adam by bitsandbytes, but bitsandbytes also has 32bit adam which is faster than torch and also uses slightly less memory.
scale_lr
Changes the speed of the training according to the following formula:
learning_rate = learning_rate * gradient_accumulation_steps * train_batch_size * num_processes
num_processes - number of gpu
pad_tokens
In case your training tokens do not fill your tensor, then this parameter pads the end-of-line token to the length of the model. It does not have a strong effect, but when generating images, the frames submitted for training will be repeated more strongly.
num_train_epochs
If max_train_steps is not specified, then the steps are calculated from this parameter. The number of steps in an epoch is calculated as follows:
num_update_steps_per_epoch = FN / gradient_accumulation_steps
max_train_steps = num_train_epochs * num_update_steps_per_epoch
if you are using Prior Preservation Loss, then FN = number of class images.
save_interval
With what period of steps to make preliminary saves.
save_min_steps
The minimum step from which the model starts to be saved according to the save_interval parameter.
Concepts and filewords
Concepts are frames united by the same set of tokens.
Filewords are frames that each have their own set of tokens.
Filewords are actually the same concepts, more precisely, multi-concepts, only each concept contains only one frame. When training the concept, all the information from each frame is trained into one "area" and when there are too many of them, at a static speed, the problem of information oversaturation arises. That is why the advanced formula scales the speed depending on the number of frames. In the case of Filewords, there is no need to change the speed, it is set to static, 1e-5, and the number of steps is simply calculated. However, you need to understand that if, when using Filewords, there are two or more frames with the same set of tokens, then they will be implicitly combined into a concept.
Comparison between concept and Filewords:
Also, you can look more info about use filewords in this threads #443, #844 or in these comments #547 (comment) #443 (comment).
concept_list
By default it looks like:
In the extension from d8ahazard it looks like:
But I will consider only 1 option, since we are interested in the parameters that affect the training.
When you use multiple concepts, then you should calculate the steps based on the total sum of the number of frames of these concepts.
Example:
I have two concepts, and if let's say I have 5 frames in the first one, and 5 in the second one, then I have to calculate the number of steps based on their sum. (5 + 5) * 200 = 2000 steps.
But lr for training is calculated based on the number of frames in one concept, that is, in this case for 5 frames.
Also, as you can see, I have different directories for class images. This is to prevent class images from repeating during training.
Many people have encountered the problem of mixing concepts when using concept_list. That is, the details of one concept are present in another. This happens for several reasons:
For example, the first has 6 frames, the second has 4. When generated, the second will give us images with elements of the first.
To avoid this problem, you need to use the same number of frames in concepts.
Concepts will still be mixed, put up with it, what I described above will only help mitigate the consequences.
Various additional information
Next will be a section with various additional information that I, for various reasons, did not include in the previous ones.
Additional information on prompt and concept_list
So, let's figure out in which case the token is class, and in which it is unique.
Example: "instance_prompt": "A sks woman"
Here "sks" is a unique token and "woman" is a class token. But how is it determined?
It's simple, if the token is found on training frames, then it is class, if not, then it is unique. But there are much more class tokens than we specify in instance_prompt. If you are training a woman with black hair, then "black hair" becomes an obvious class token. That is why I wrote that using Prior Preservation Loss does not completely fix the problems that arise, since there are much more classes found in the images.
Also, if you specify a class token in instance_prompt, for example, "instance_prompt": "A sks woman with black hair", and then call the generation using the following call "A sks woman", this will cause your generated images to contain your concept, but she doesn't have to have black hair. This is because the specified classes in instance_prompt are required to be called in order to fully repeat the concept.
All found class tokens on images not specified in instance_prompt are automatically invoked when your unique token is called.
For example, the concept of a woman with black hair, "instance_prompt": "sks". Calling generate on the token "sks" will automatically call other tokens found in the images, in our case it is "woman with black hair", but only if we do not specify the opposite tokens ourselves, for example, "sks with blonde hair".
Also instance_prompt can contain several unique tokens.
Example: "instance_prompt": "A sks woman, by artistname"
In this case, in addition to "sks", "by artistname" is also a unique token. Although for us this is clearly an indication of the author's drawing style, for the model it is not. With both tokens, she will give us the same concept of a woman. In order for us to train the style, we need to use either a concept with images of different characters from the same author, and then instance_prompt will look like: "instance_prompt": "woman, by artistname", or simply "by artistname". Or, if we want to invoke both certain characters and the style of the author, we need to use several concepts.
Example:
In this case, we have two unique tokens for calling characters, "sks" and "wss" and a token for calling the author's drawing style "by artistname", but when generating characters, we do not need to specify the token "by artistname", but the call "woman, by artistname" will not necessarily give out the characters from the concept, but will try to repeat the style of the author.
Beta Was this translation helpful? Give feedback.
All reactions