[Pre-Final] A large research of the main training parameters Update #547

cerega66 · 2022-12-15T07:06:40Z

cerega66
Dec 15, 2022

NEW UPDATE

Updated formulas for calculating frame resolutions. Also updates with permissions were carried out in the table with mathematics.
The custom lr_scheduler has been moved to a separate document to reduce the length of the article.
Updated description for use_8bit_adam
Added descriptions for the following options: not_cache_latents, set_grads_to_none, pad_tokens
Added description of comparison of concepts and filewords
Added experimental training method with dynamic resolution

Table with mathematics:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true

Custom lr scheduler:
https://drive.google.com/file/d/1KiDQZvOdHy2lvad4BFZUW4wDiDdIl4Lr/view?usp=share_link

I have some questions, and if someone can help me with them, I will be grateful.

comparing sub_quad with other attention

Everything listed below was tested by me on various models of versions 1.4, 1.5 and 2.1-512, 2.1-768.

FOREWORD

This research is an attempt to find out all the possible influences of the parameters in Dreambooth training to understand the principle of training, as well as to identify formulas for "correct" training, by minimizing overfitting, maximizing the concept and optimizing in time. That is, to reveal the formulas in which we will simply submit our number of images and get the number of steps we need. This will save us time and disk space, as there will be no need to create savepoints and make comparisons.

Notes:

I can't iterate through all the parameter values among themselves, it's just not possible.
This research does not provide material on how to train certain concepts. Each concept is a special case. Here I consider general concepts. However, these formulas obtained in this research can be a starting point in your training.
It also does not mean practicing the "ideal" concept. For everyone, the concept of "ideal concept" is different. For me, it's when we don't get overtrained and undertrained.
This research will be updated and corrected.
The given material may contain errors, inaccuracies. If you have more accurate information and can provide it, I will be grateful.
If using these formulas you have problems with your training or other results that are different than expected, along with your question, please write down the full training parameters so that I can help you in any way.
Also, this research does not contain specific examples.
For research, I use the stable version from ShivamShrirao. This is due to the fact that d8ahazard often releases updates, and they may contain bugs that can affect statistics.

Main parameters

instance_prompt

This parameter determines which tokens will be used to invoke your concept.

class_prompt

Used to specify a class when using Prior_Preservation_Loss.

learning_rate

Affects how detailed your concept will be. That is, the lower the learning_rate, the more details can be highlighted and the more can be edited. But the more details you select, the more steps you need to read all the details from the frame.

Number of training steps (max_train_steps)

Specifies the overall pass of the training.

Full pass steps and compensation

The number of steps of a full pass is the number of steps during the passage of which we will get the full conclusion of our concept. That is, if we go through this number of steps with 1 image in the concept, then when generating the concept, we should get this image. For 1e-5, this is 200 steps, if you use Text Encoder, then 100 (more about Text encoder will be in a separate section). For other LRs, this value changes in direct proportion, that is, for 1e-6 it is 2000.

Recalculation to other LRs is not difficult, the formula looks like:
SPF = 200 * (1e-5 / LR)
for text encoder:
SPF = 100 * (1e-5 / LR)
where:
LR - learning_rate at which you are going to train
SPF - number of steps per frame

However, when changing LR, there is a problem that when generating with high CFG values, images contain distortions, that is, elements on them begin to break, and the lower lr, the lower the CFG value at which this begins to develop. This is most likely due to the deviation during training of the LR value from the base value of LR during the initial training of the model, since the same problem is observed when the frame resolution is increased, only there it is expressed differently, but the principle is similar.
Improving quality -> more steps needed -> problems

The basic parameters for training the model are as follows:

Resolution 512
LR 1e-5 (the .yaml file says speed 1e-4, but training at speeds above 1e-5 does not give good results)

When we improve any of these parameters (reduce LR, increase resolution), then we have to go through more steps to completely repeat the concept and this leads to problems. And these problems begin to increase immediately on the next step after the value for 1e-5 and the more steps we go through, the stronger these problems. But such things do not occur at 1e-5 even if we go through a very large number of steps. Therefore, to solve such problems, I propose to compensate for this by increasing the number of frames, and not the number of steps per frame. The method itself will be discussed in more detail in the "Advanced Dreambooth Formula" section.
The deterioration of one of these parameters (we increase LR, we decrease the resolution) does not give anything positive.

Formulas for training

Dreambooth Standard Formula

So let's first take a look at the standard training formula listed in the article: https://arxiv.org/abs/2208.12242
To begin with, we will consider the following form of instance_prompt:
"instance_prompt": "[identifier]"

This is what the formula looks like:
max_train_steps = FN * 200 * 1e-5
Where:
FN - your number of frames per training, limit no more than 5
max_train_steps - number of training steps

This formula has limitations - this is the number of frames, not more than 5, the recommended 3-5, although it works with 1, and the constant LR is 1e-5.
When generated using our [identifier] we will get copies of our images with some variation and editability. If you increase the number of frames, then the variability and editability will be better. However, if we simply increase the number of frames, then we will face the problem that when generating images, we will get a blurry image. This is due to the fact that we oversaturate our token with information. The frame limit I found, at which the oversaturation is not strong, is 5.
If we scale LR, with the same number of frames, but at the same time we recalculate the number of steps, we will get output distortion at high CFG values and the lower LR, the lower the CFG value with distortion.

Class based training

instance_prompt looks like this:
"instance_prompt": "a [identifier] [class noun]"
[identifier] is your unique identifier
[class noun] - this is the class on the basis of which we will train, a more general idea of the training object is indicated.

Example:
"instance_prompt": "sks woman"
[identifier]-sks
[class noun] – woman

However, when training, both with and without a class, 2 problems arise: Overfitting and language drift.

Overfitting is when the images we generate strongly repeat the ones we submitted for training and are almost not edited. The more frames we use without changing LR, the more this problem is expressed.
language drift is forgetting the model of other concepts associated with the class on which we trained. The more often a class appears in training images, the more it forgets other concepts.

Both of these problems are partially solved by using Prior Preservation Loss.

Prior Presservation Loss

The use of this option allows us to partially solve the problems listed above, but only for training on the basis of the class, and as my experience showed, both problems are not solved completely and not in all cases. To do this, we need to generate a certain number of images of our class. For this, Class_Prompt is used, which looks:
"Class_Prompt": "A [class noun]"
Example:
"Instance_Prompt": "A Sks Woman"
"Class_Prompt": "A Woman"

Formulas for calculating the number of class images
num_class_images = FN * 200
For Text Encoder:
num_class_images = FN * 100
where:
FN - your number of personnel for training

prior_loss_weight

Option of the weight of class images. By default = 1. So far in research.

Advanced Dreambooth Formula

Attention! The following formulas I have given are nothing more than an experiment, although they show fairly good results.

This formula was created by me for training with any amount, but there are still some limitations. It can still be improved.
The idea is that we will rely on the starting value of LR - 1e-5, and a constant value of steps per frame (SPF), 200 without TE (about TE in a separate section). And as the number of frames increases, we will scale the LR, and we will leave the SPF the same. This will give us the opportunity to use any number of frames and not lead to distortion. For regulation, we introduce another coefficient F.
Coefficient F - determines the strength of training, the basis of its values will be the maximum number of frames for LR = 1e-5, that is, it will take values from 0.2 to 5, where 1 is the average value. The higher the value, the stronger the repetition of the concept, but the more the Overfitting problem is expressed, respectively, the lower the value, the worse the concept is repeated, but the less the Overfitting problem is expressed.
List of F:
low_end 0.20
ultra_soft 0.25
very_soft 0.33
soft 0.50
medium 1.00
hard 2.00
very hard 3.00
ultra_hard 4.00
high_end 5.00

All formulas below are calculated with batch_size=1 and gradient_accumulation_steps=1, there will be separate sections with separate formulas about them.

This is how the formulas for calculating the number of frames for training with a certain F for a certain LR look like:
very_soft = (1e-5 / LR) * 0.33
soft = (1e-5 / LR) * 0.5
medium=1e-5/LR
hard = (1e-5 / LR) * 2
very_hard = (1e-5/LR) * 3
where:
LR - learning_rate at which you are going to train
very_soft - very_hard - the number of frames required to determine the training mode (strength)

Example:
LR=1e-6
medium = 10
hard = 20

The number of steps for training is determined simply:
max_train_steps = FN * 200
where:
FN - your number of frames per training
max_train_steps - number of training steps

We can also calculate the lr we need based on our number of frames:
LR = 1e-5 / (FN / F)
where:
FN - your number of frames per training
F - training strength value
LR - learning_rate at which you are going to train

Example:
FN = 125
F = 1
LR = 8e-8

Now a little about editability, at low F values it is achieved through the model itself, and at high F values through a variety of training images themselves, so it makes sense to use a high F only with more images if editability and repeatability are important to you at the same time.

You can use the table with auto-calculations, which I posted here:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true

Text encoder

Allows you to train a text encoder, but requires much more video memory. When we train it, we get many benefits, but the requirements for its use become much more stringent.

when using it, we reduce the number of steps per frame (SPF) from 200 to 100
you can’t use different styles of the concept, if without the Text encoder training the concept in different styles at the output gave normal images, then this will not work with the Text encoder, as it will simply break your concept.

Advantages:

concepts are more correct and better.
you need fewer steps, which reduces the time of training.
Training with and without Text encoder cannot be compared in terms of results, that is, training with Text encoder will always be better in repeating the concept than without.

The increase in the number of frames also occurs due to LR and F scaling.
Formulas:
max_train_steps = FN * 100
where:
FN - your number of frames per training
max_train_steps - number of training steps

Resolution and training on model 768 and other resolutions other than base resolutions

Specifies what resolution is used for training frames. Allows you to train at a resolution different from the base value of the model. Increasing above the baseline value of the model reduces the speed of the training and also consumes more video memory.
Both for training on a model with a resolution higher than 512, and for training a higher resolution on a model with a lower resolution, it is necessary to increase the SPF.
To train on the 768:
SPF = 450 without TE
SPF = 225 s TE

And then you can carry out all the calculations as for the 512 model, only with these SPFs.
For training above baseline resolution, the SPF can also be calculated separately. To do this, you must first calculate the magnification factor.
RC=(RES / RES_B) ^ 2
Where:
RES _B - base model resolution
RES - the resolution at which you train
RC - resolution magnification factor

Then we calculate the SPF:
SPF = SPF_B * RC
Where:
SPF_B - standard number of steps per frame for the model
RC - resolution magnification factor

Training above baseline resolution has its pros and cons.
Pros:

The ability to generate images of your concept at the resolution at which you trained, without using hires. fix.
Better repeatability of fine details and better overall quality.

Cons:

When generating, duplication sometimes occurs and / or artifacts appear.
There is no possibility to generate high-quality images of your concept at a resolution below the trained one.

You can also use the table with auto calculations, which I posted here:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true

Simply enter your training resolution in the custom field, regardless of the model's base resolution.

(Experimental) Dynamic Resolution

Attention! This method does not guarantee 100% generation of images in higher resolutions without artifacts. The higher the resolution from the base one, the more likely they are to be generated.

Training with a resolution higher than the base resolution allows you to generate images at this resolution, but there is a problem with concept generation at the base resolution, the images come out cropped. This can be corrected by training the model at once in the base and higher resolutions.
To do this, we need to make training frames in the base resolution and the resolution up to which we will train. The only problem is that you need a higher SPF to train higher resolution images. Since we cannot set how many steps to go through a certain frame, we will be a little tricky. You just need to make copies of the training frames in the amount of RC.
For example, to train at 512 and 768 resolution 10 images, you need 10 frames at 512 resolution and 20 at 768 resolution (10 x 2 copies).
To train 512, 768 and 1024 you need 10 frames in 512, 20 in 768 (10 x 2 copies), 40 in 1024 (10 x 4 copies).
It is also necessary to disable resize in the training script so that the images do not change their resolution.
The number of steps is calculated as an average for the resolutions in which we will train. For example, for 512 and 768 it will be: (1000 + 2250) / 2 = 1625.
However, this method only works for batch size = 1. To use other values greater than 1, you need to remove all shuffling in the script, as well as arrange frames so that all resolutions are ordered and not mixed. This is necessary due to the fact that when you get frames with different resolutions into the batch, you will get an error in the dimensions of the tensors.
After training in dynamic resolution, it becomes possible to generate images without hires fix not only in these resolutions, but also in intermediate ones, as well as in combinations.

Generation examples can be found here: #547 (reply in thread)

Attention types

At the moment there are 4 types: default, xformers, flash_attention and sub_quad. I can say so far about default, xformers and flash_attention. I haven't configured sub_quad yet.

default - very slow and uses a lot of memory.
xformers - Fast, medium video memory usage.
flash_attention - slower than xformers, but faster than default. Compared to xformers, uses less video memory.

Number formats

FP64

Double-precision data type.
Range is 2.23e-308 to 1.80e308.
Epsilon 0.0000000000000002
Significant numbers - 15 - 17

FP32

Standard training data type. Has the best parameters for training (if you do not take fp64 and fp80 in comparison)
The range is from 1e-38 to 3e38.
Epsilon 0.00000012
Significant numbers - 6 - 9

TF32

Stripped down version of FP32 to 19 bits.
The range is from 1e-38 to 3e38.
Epsilon 0.00097656.
Significant numbers - 3 - 4
Cannot be used with mixed_precision=FP16 and BF16.

mixed_precision

FP16

Standard 16 bit float.
Range from 5.9e-8 to 6.5e4.
Epsilon 0.00097656.
Significant numbers - 3 - 4

BF16

More stripped down version of FP32 to 16 bits.
The range is from 1e-38 to 3e38.
Epsilon 0.00781250
Significant numbers - 2 - 3

Comparison of number format

Format	Range	Epsilon	Sign. digits
FP64	2.23e-308 - 1.80e308	0.0000000000000002	15 - 17
FP32	1е-38 - 3е38	0.00000012	6 - 9
TF32	1е-38 - 3е38	0.00097656	3 - 4
FP16	5.9e-8 - 6.5e4	0.00097656	3 - 4
BF16	1е-38 - 3е38	0.00781250	2 - 3

So, if the quality is in descending order: FP64, FP32, TF32, FP16, BF16.
If the speed in descending order: BF16, FP16, TF32, FP32, FP64.

The minimum Range value defines the minimum accepted value for LR.

Sign. digits determines the accuracy of training for a specific data type.
For example, for LR=1e-2, the accuracy is 100% for all types, for LR=1e-3, BF16 will already do internal rounding, for 1e-4, only FP32 and FP64 are not rounded. FP32 starts rounding below 1e-6, FP64 below 1e-15.

The better the type, the more correctly it repeats the small details of the concept at very low LR.

If you try to visually compare these data types, then most likely you will not see a difference if you do not understand what and how to compare. Most likely you will only be able to see the difference between FP16 and BF16. It is quite difficult to notice the difference between FP16 and TF32 or FP32.

xformers problem

UPDATE!!! This problem fix in 0.0.17!!!
xformers have a problem in training with FP64, FP32 and TF32. It refuses to run training on these data types on rtx30xx and rtx40xx graphics cards, but works fine on A100. This issue is in xformers itself and at the time of writing the issue is still unresolved and will not be resolved in future updates. Therefore, if you want to train on FP64, FP32 and TF32, then use another attention.

gradient_checkpointing

Creates a gradient pre-accumulation point. Reduces video memory consumption but reduces speed. Does not affect the quality of training in any way, but may allow the use of other additional parameters.

train_batch_size

Specifies how many frames per step will be used. Requires more memory, also reduces speed.
Should take a value, when dividing FN by which, we will get an integer value.
This requirement is due to the fact that when the value is greater than 1, your images are divided into groups, and if the division is not even (one group will not have an addition), this will lead to uneven training.
There can't be more than FN.

So, now about how it works. The more value it (BS) takes, the better the concept training goes, the best value is train_batch_size = FN. By increasing its value, the training of our concept is achieved in fewer steps. Only the dependence is not linear, for example, when training 5 images with BS=1, I will take 1000 steps, with BS=5, the same number of steps will give me distortion at high CFG values, that is, the number of steps should be less. If I just reduce the amount by 5 times to 200, then there will be a clear undertraining.
Formulas:
First, let's determine the values of incremental steps per frame.
Without TE:
SPF_A = SPF / 1.5.
Example, for SPF=200, SPF_A=134.
For TE:
SPF_A = SPF
First, we define incremental steps:
ADS = SPF_A * (BS - 1)
where:
SPF_A - incremental steps per frame
BS is the value of train_batch_size
ADS - incremental number of steps

After counting the number of training steps:
max_train_steps = (FN * SPF + ADS) / √BS
where:
FN - your number of frames per training
SPF - number of steps per frame
ADS - incremental number of steps
BS is the value of train_batch_size
max_train_steps - number of training steps

Despite the fact that we ended up reducing the number of frames, the training takes longer, due to a large drop in speed. However, this gives us much more repeatability of the concept without increasing the impact of Overfitting. For example, training at BS=4, with LR calculated for F=1, is equal to training at F=2, but the effect of Overfitting will remain as at F=1. That is, we got one more tool for adjustment.
You also need to understand that increasing BS does not give a straight-line efficiency.
For example, 10 frames training with TE.
With BS=1, max_train_steps=1000, and it will take me 10 minutes to train.
With BS=2, max_train_steps=778, and it will take me 11 minutes to train.
With BS=5, max_train_steps=626, and it will take me 16 minutes to train.
With BS=10, max_train_steps=601, and it will take me 37 minutes to train.
As for me, the best ratio here is BS=5.
Now let's look at 100 frames.
With BS=1, max_train_steps=10000, and it will take me 1 hour 45 minutes to train.
With BS=2, max_train_steps=7142, and it will take me 1 hour 39 minutes to train.
With BS=4, max_train_steps=5150, and it will take me 1 hour 57 minutes to train.
With BS=5, max_train_steps=4651, and it will take me 2 hours and 1 minute to train.
With BS=10, max_train_steps=3447, and it will take me 3 hours and 32 minutes to train.
As you can see, with BS=2 I even gained in time, and the ratio of time increase between BS=1 and BS=10 became noticeably lower, this time the time increased only 2 times, and with 10 frames it was 3-4 times.

gradient_accumulation_steps

Similar to train_batch_size emulator.
If you, while increasing train_batch_size, are faced with the fact that you do not have enough video memory, then this parameter can help you. It creates a gradient accumulation area and then adds it to the total weight. Allowing you to emulate using a larger train_batch_size value. It also requires a slightly larger number of steps than you calculate, but there is no need to recalculate them, just indicate what you got, and the accelerator itself will add the additional necessary ones (for me).
For example, train_batch_size=1 and gradient_accumulation_steps=4 will give us an emulation of train_batch_size=4.
With train_batch_size=4 and gradient_accumulation_steps=4 we will get train_batch_size=16 emulation.

But I don't just call it emulation, not replacement. The problem is that the accumulation does not repeat 100% the same gradient as train_batch_size. For example train_batch_size=1 and gradient_accumulation_steps=4 is not equal to train_batch_size=4 but is very close to it. Therefore, I do not recommend using this setting above 1.

The rules for choosing a value are as follows:

FN / (train_batch_size * gradient_accumulation_steps) must give an integer
train_batch_size * gradient_accumulation_steps cannot be greater than FN

Increasing the value of gradient_accumulation_steps increases the total value of training steps, unlike train_batch_size. But the effect is the same as with train_batch_size. This is due to the fact that for values greater than 1, not all are training steps.

For example, with gradient_accumulation_steps=10, 9 steps are accumulative and 10 are training steps. So for 10 frames and 1000 steps, we will go through 1900 steps, but only 190 of them will be training steps. This also means that in the end we need to go through more steps than with a value of 1.
Formulas:
max_train_steps = FN * SPF * √GA
where:
FN - your number of frames per training
SPF - number of steps per frame
GA is the value of gradient_accumulation_steps
max_train_steps - number of training steps

If you use train_batch_size and gradient_accumulation_steps at the same time, then the formulas will be as follows:
ADS = SPF_A * (BS - 1)
max_train_steps = (FN * SPF + ADS) / √BS * √GA
where:
SPF_A - incremental steps per frame
BS is the value of train_batch_size
ADS - incremental number of steps
FN - your number of frames per training
SPF - number of steps per frame
GA is the value of gradient_accumulation_steps
max_train_steps - number of training steps

You can use the table with auto-calculations, which I posted here:
https://docs.google.com/spreadsheets/d/1rUx7bPS0CKdJhb012uoGKjcqWdrpcXjv/edit?usp=share_link&ouid=118139618916251275759&rtpof=true&sd=true

sample_batch_size

Affects how many images per group will be generated for the class. Doesn't affect training at all.

lr_scheduler

A parameter that determines by which function your LR will change during training. There are 6 standard types.
But first, a little about warmup.

warmup (lr_warmup_steps)

Determines how many steps from the start there is an increase from the minimum LR to the one you specified, by a linear method. On all interactive charts presented below, this is parameter a. Everyone has it except constant.
A little more about graphics. s is 100% of your steps, l is lr specified in learning_rate, red graph is warmup, blue is function.
Now about lr_scheduler types.

STANDARD

constant

Your speed does not change throughout the entire training period.

cosine_with_restarts

https://www.desmos.com/Calculator/5wfa3rqdyy
Works as cosine by default. However, there are parameters that change the length of the period.

polynomial

https://www.desmos.com/Calculator/lwdt9uqsc5
Changing through a polynomial, in our case, is not much different from linear.

CUSTOM

Custom can find in this doc:
https://drive.google.com/file/d/1KiDQZvOdHy2lvad4BFZUW4wDiDdIl4Lr/view?usp=share_link

So, the question arises, what is it and how to use it. In short, this is something like anti-aliasing in terms of editability of details and in terms of quality. We refine some small details by lowering lr during training.
I recommend using it only when you have few frames for training, and there is no way to increase their number. Otherwise, they are more useless, since with a large number of frames, the required number of steps is increased quite a lot, and the effect is rather minimal. For example, if you start training with 1e-5 (3-5 frames).
If you compare constant and any other lr_scheduler, then constant will always win with a lower lr than others, but longer in terms of training.
To use them, we need to know the value of the number of steps for constant, and then multiply this number by the multiplier specifically lr_scheduler (they will be lower). In this case, the same repeatability as with constant will be achieved.
max_train_steps = s * m
where:
s - steps for constant calculated by the formulas above
m - a multiplier of another lr_scheduler

Attention! Using lr_scheduler with too high a multiplier can generate distortion at high and even low CFGs.

lr_scheduler step multipliers

constant - 1
linear/cosine/cosine_with_restarts-2
polynomial - ~2

not_cache_latents

Disables caching of your training data to video memory, thereby reducing its consumption, but reduces the speed of training. It doesn't make sense if train_batch_size = FN, since we get 1 batch per training as a result, and this option disables caching of batches that are not involved in training at this step.

set_grads_to_none

Sets Optimizer.zero_grad to set_to_none to True. This reduces the consumption of video memory and also increases the speed of the training. I did not find those places in the code where this option would have an effect. When making comparisons, I also did not notice that this option had an impact on the quality of the training.

use_8bit_adam

When training, we use 8bit adam instead of 32bit adam. Significantly reduces the amount of video memory used. Using 8bit adam gives a small loss in accuracy due to the fact that calculations are made in uint8, and in 32bit adam in float32, however, the percentage of losses is quite low and in some places is not noticeable at all.
By default 32bit adam is used by torch and 8bit adam by bitsandbytes, but bitsandbytes also has 32bit adam which is faster than torch and also uses slightly less memory.

scale_lr

Changes the speed of the training according to the following formula:
learning_rate = learning_rate * gradient_accumulation_steps * train_batch_size * num_processes
num_processes - number of gpu

pad_tokens

In case your training tokens do not fill your tensor, then this parameter pads the end-of-line token to the length of the model. It does not have a strong effect, but when generating images, the frames submitted for training will be repeated more strongly.

num_train_epochs

If max_train_steps is not specified, then the steps are calculated from this parameter. The number of steps in an epoch is calculated as follows:
num_update_steps_per_epoch = FN / gradient_accumulation_steps
max_train_steps = num_train_epochs * num_update_steps_per_epoch
if you are using Prior Preservation Loss, then FN = number of class images.

save_interval

With what period of steps to make preliminary saves.

save_min_steps

The minimum step from which the model starts to be saved according to the save_interval parameter.

Concepts and filewords

Concepts are frames united by the same set of tokens.
Filewords are frames that each have their own set of tokens.
Filewords are actually the same concepts, more precisely, multi-concepts, only each concept contains only one frame. When training the concept, all the information from each frame is trained into one "area" and when there are too many of them, at a static speed, the problem of information oversaturation arises. That is why the advanced formula scales the speed depending on the number of frames. In the case of Filewords, there is no need to change the speed, it is set to static, 1e-5, and the number of steps is simply calculated. However, you need to understand that if, when using Filewords, there are two or more frames with the same set of tokens, then they will be implicitly combined into a concept.
Comparison between concept and Filewords:

The concept is better in editability and in variability.
The concept is easier to train as it doesn't need to be tokenized every frame.
Filewords can teach many concepts at once.
Filewords is better at replicating the original footage.
In Filewords, the editability is directly related to the number of frames.

Also, you can look more info about use filewords in this threads #443, #844 or in these comments #547 (comment) #443 (comment).

concept_list

By default it looks like:

         "instance_prompt": "A sks woman",
         "class_prompt": "A woman",
         "instance_data_dir": "/trainings/sks",
         "class_data_dir": "/trainings/class"

In the extension from d8ahazard it looks like:

     "instance_data_dir": "E:\\dev\\sd_db\\clothes_samples\\bossdyn",
     "class_data_dir": "E:\\dev\\sd_db\\pants",
     "instance_prompt": "a man wearing bossdyn pants",
     "class_prompt": "a man wearing pants",
     "save_sample_prompt": "a man wearing bossdyn pants",
     "save_sample_template": "",
     "instance_token": "",
     "class_token": "",
     "num_class_images": 20,
     "class_negative_prompt": "",
     "class_guidance_scale": 7.5,
     "class_infer_steps": 40,
     "save_sample_negative_prompt": "",
     "n_save_sample": 1,
     "sample_seed": -1,
     "save_guidance_scale": 7.5,
     "save_infer_steps": 40

But I will consider only 1 option, since we are interested in the parameters that affect the training.
When you use multiple concepts, then you should calculate the steps based on the total sum of the number of frames of these concepts.
Example:

[
   {
         "instance_prompt": "A sks woman",
         "class_prompt": "A woman",
         "instance_data_dir": "/trainings/sks",
         "class_data_dir": "/trainings/class1"
   },
   {
         "instance_prompt": "A wss woman",
         "class_prompt": "A woman",
         "instance_data_dir": "/trainings/wss",
         "class_data_dir": "/trainings/class2"
   }
]

I have two concepts, and if let's say I have 5 frames in the first one, and 5 in the second one, then I have to calculate the number of steps based on their sum. (5 + 5) * 200 = 2000 steps.
But lr for training is calculated based on the number of frames in one concept, that is, in this case for 5 frames.
Also, as you can see, I have different directories for class images. This is to prevent class images from repeating during training.
Many people have encountered the problem of mixing concepts when using concept_list. That is, the details of one concept are present in another. This happens for several reasons:

language drift. When training some concepts, one of them can overtrain a certain token, and when the second concept generates an image, referring to this token, it will issue elements of the first one.
The preponderance of one or more concepts. Since dreambooth training occurs on the principle of average, and when one of the concepts of training frames has more than the other, then when the second is called, the elements of the first will begin to come across.
For example, the first has 6 frames, the second has 4. When generated, the second will give us images with elements of the first.
To avoid this problem, you need to use the same number of frames in concepts.

Concepts will still be mixed, put up with it, what I described above will only help mitigate the consequences.

Various additional information

Next will be a section with various additional information that I, for various reasons, did not include in the previous ones.

Additional information on prompt and concept_list

So, let's figure out in which case the token is class, and in which it is unique.
Example: "instance_prompt": "A sks woman"
Here "sks" is a unique token and "woman" is a class token. But how is it determined?
It's simple, if the token is found on training frames, then it is class, if not, then it is unique. But there are much more class tokens than we specify in instance_prompt. If you are training a woman with black hair, then "black hair" becomes an obvious class token. That is why I wrote that using Prior Preservation Loss does not completely fix the problems that arise, since there are much more classes found in the images.
Also, if you specify a class token in instance_prompt, for example, "instance_prompt": "A sks woman with black hair", and then call the generation using the following call "A sks woman", this will cause your generated images to contain your concept, but she doesn't have to have black hair. This is because the specified classes in instance_prompt are required to be called in order to fully repeat the concept.
All found class tokens on images not specified in instance_prompt are automatically invoked when your unique token is called.
For example, the concept of a woman with black hair, "instance_prompt": "sks". Calling generate on the token "sks" will automatically call other tokens found in the images, in our case it is "woman with black hair", but only if we do not specify the opposite tokens ourselves, for example, "sks with blonde hair".
Also instance_prompt can contain several unique tokens.
Example: "instance_prompt": "A sks woman, by artistname"
In this case, in addition to "sks", "by artistname" is also a unique token. Although for us this is clearly an indication of the author's drawing style, for the model it is not. With both tokens, she will give us the same concept of a woman. In order for us to train the style, we need to use either a concept with images of different characters from the same author, and then instance_prompt will look like: "instance_prompt": "woman, by artistname", or simply "by artistname". Or, if we want to invoke both certain characters and the style of the author, we need to use several concepts.
Example:

[
   {
         "instance_prompt": "A sks woman, by artistname",
         "class_prompt": "A woman",
         "instance_data_dir": "/trainings/sks",
         "class_data_dir": "/trainings/class1"
   },
   {
         "instance_prompt": "A wss woman, by artistname",
         "class_prompt": "A woman",
         "instance_data_dir": "/trainings/wss",
         "class_data_dir": "/trainings/class2"
   }
]

In this case, we have two unique tokens for calling characters, "sks" and "wss" and a token for calling the author's drawing style "by artistname", but when generating characters, we do not need to specify the token "by artistname", but the call "woman, by artistname" will not necessarily give out the characters from the concept, but will try to repeat the style of the author.

seoeaa · 2022-12-15T11:59:05Z

seoeaa
Dec 15, 2022

there is also an instruction for LORA?

3 replies

cerega66 Dec 15, 2022
Author

I can’t say anything about LoRa, because I haven’t done any testing yet, and I’m unlikely to take it up in the near future.

seoeaa Dec 15, 2022

thanks, very good research

seoeaa Dec 15, 2022

it would be nice to automate your experience) into a ready-made calculator

mykeehu · 2022-12-15T15:11:54Z

mykeehu
Dec 15, 2022

Thanks for writing the summary, it makes the whole process much clearer.
What I'm also curious about is what the [filewords] parameters will affect, so when you tag what's in the image, you're not just pushing it in front of "do this, whatever". I'm thinking about tagging because then I might have more control over the components and details of the loaded element. I've never tried it before (I'm not good at English), and as I've read here and in reddit and discord groups it would be important, but it could also strongly influence the final result.
But I'm patiently waiting, because what you've tried so far is a huge progress in understanding the generation.

1 reply

cerega66 Dec 15, 2022
Author

[filewords] is also a great subject for experimentation. There are a lot of things that I have questions about. For example, what happens if I specify in them a token that is not in the model. I haven't seen db introduce new tokens yet. In general, this is a complex topic for which I will not soon undertake.

JashoBell · 2022-12-15T18:10:53Z

JashoBell
Dec 15, 2022

Suitable Images
They should resemble your concept as much as possible, but not be the same type.

Specify the most abstract class in class_prompt. For example, for a character, this is “person”. Or you can leave it out altogether if you want to get rid of class_prompt ambiguities.

So, class images should be a more specific approximation of your character than class prompt? Could you clarify what you mean by not the same "type"? Basically, you're just trying to abstract/generalize your training materials into a greater variety with the class images?

5 replies

cerega66 Dec 16, 2022
Author

class_prompt and class images don't have any solid connection between them. The only thing they have in common is that class_prompt tokens are used to tokenize frames from class images, and if you want to eliminate the restrictions and uncertainties created by class_prompt, you can leave it blank. You can also use any images in class images. They don't have to be generated. The only requirement for them is that they should be similar to your concept, but have the maximum variety in detail.

By the same type, I mean the general scene in the frame. For example, if all your class images contain a man in a full-length suit and the main differences are the hair color and his position in the frame, then these are the same type of frames. Since the information diversity in them is clearly less than 50%, even less than 20%. And if you have a man in a frame in different clothes in different poses from different viewing angles, then these are not the same type of frames, since the information diversity in them is more than 80%.

mykeehu Dec 16, 2022

Talking about class pictures, let's clarify with the following example:

I want to train a type of horse, but not a given horse, only the breed
in this case I probably don't enter the animal class, but the class of the horses
what if the base model doesn't know the class of the horses, but I have class pictures of the horses. Can I then specify the class of the horses, backed up by sample images?
the instance prompt be "photo of xyz horse" and the class prompt "photo of a horse"?

cerega66 Dec 16, 2022
Author

You can not specify anything at all in class_prompt in this case, all you need is to put images of horses in class images.
If you specify a token in the class_prompt that is not in the model, this will end unpredictably. Since you are telling the training that this token is necessarily present on all images, and without it, there is no way.

the instance prompt be "photo of xyz horse" and the class prompt "photo of a horse"?
Read the tokenization section I wrote recently and how these fields work.

mykeehu Dec 16, 2022

Thank you, that's what I meant, if there is no class prompt but there are class images, they can work independently, creating a new class for the model :)

cerega66 Dec 16, 2022
Author

I can't guarantee that this will result in a new class being created. But in theory it is possible.

piyarsquare · 2022-12-16T01:52:44Z

piyarsquare
Dec 16, 2022

I did not know how gradient accumulation and batch size interact. Thank you for your sharing your research.

0 replies

piyarsquare · 2022-12-16T18:01:28Z

piyarsquare
Dec 16, 2022

Have you tried with and without extract EMA?

1 reply

cerega66 Dec 16, 2022
Author

Not yet. And I’m not sure that I’ll be able to research it, since I may not have enough VRAM.

zahdab · 2022-12-17T06:16:46Z

zahdab
Dec 17, 2022

thanks a lot for this, i am trying to use the same paramters i used on 1.5 to train in 2.1 but the results are pretty bad on 2.1 i loose a lot of lexbility even with low learning rates... i can't see to get the same results as the default 2.1 model after training even if i don't invoke my prompt ... i am trying to train the face of a person with 40 images ... 100 steps per image i tried increasing and decreasing the learning rate didn't seem to matter, tried adding filewords to describe each image ... also didn't seem to work the model loses a lot of accuracy and images look very low quality

0 replies

nanafy · 2022-12-20T00:28:30Z

nanafy
Dec 20, 2022

This is awesome, thank you! Would you mind posting your git hash of the repo you used for this testing? What diffuser version was it as well if you dont mind.

1 reply

cerega66 Dec 20, 2022
Author

As I wrote, I'm using the stable latest version from ShivamShrirao for testing. Diffusers 0.6.0.dev0, old. The new ones are enticing but I won't update just yet as I don't have any issues and don't want to beat the research stats. But I don't think that the version of the diffusers has any effect on the training, but I will somehow compare 6.0.0 with the latest.

d8ahazard · 2022-12-20T00:51:26Z

d8ahazard
Dec 20, 2022
Maintainer

So, FWIW, the issue with "It's not saving" is that it is saving, but checkpoint generation is not using the saved snapshots version. I have this fixed in the ImageBuilder+ branch, just trying to tidy up a few more things, then it'll be pushed to main. My apologies for the confusion.

…

On Mon, Dec 19, 2022 at 6:28 PM nanafy ***@***.***> wrote: This is awesome, thank you! Would you mind posting your git hash of the repo you used for this testing? What diffuser version was it as well if you dont mind. — Reply to this email directly, view it on GitHub <#547 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMO4NBCUDNPVXKH35EAY2LWOD4T3ANCNFSM6AAAAAAS7MCISI> . You are receiving this because you are subscribed to this thread.Message ID: <d8ahazard/sd_dreambooth_extension/repo-discussions/547/comments/4453345@ github.com>

1 reply

piyarsquare Dec 20, 2022

Could you please clarify, what should we not use until the ImageBuilder+ branch is merged?

mykeehu · 2022-12-22T16:28:08Z

mykeehu
Dec 22, 2022

I have rewritten this post. The use of [filewords] now makes perfect sense and helps a lot with editing.

Most important: use keywords and concepts in your TXT file! Do this for two reasons:

you want to use it later to recreate parts of the image (e.g. garment)
to create custom views (e.g. top view, front view, bottom view, etc.), type them manually in the TXT

The keywords used in the [filewords] TXT are linked to the sample image, so please keep the following in mind:

use at least two, but preferably more, keywords for the images you want to use when generating them. As I said, keywords are bound to the elements of the image, so if you use that word in the generation, it will be based on the sample image and will be generated from it. Be it background, subject, person. But sometimes it will reproduce almost the whole sample image based on a keyword, so it's a good idea to use the words for multiple images
you can also enter specific words to add style to the photo, such as camera angle of view, lighting, etc. This is very good for special views like panoramic images, top view images, front view
if you want a special thing (like a location, say a branded restaurant), then tag it with the same word, in TXT on those images, so it will definitely be generated later.
if you want to use the same word on all images, don't type it in the TXT, but in the instance prompt, separated by commas, for example: photo of xyz babe, lagoon restaurant, wicker chair, [filewords]. The Class prompt is then: photo of a babe
you can leave words in there that are not part of the style or object, so they can be extracted from the image, but remember, someone may use them later and it will be based on the sample image!
be sure to use class images (at least 200 or more), because they dilute the samples, avoiding matching the exact look, especially if you have few images! The class images will blend the features of your sample images, so only similar images will be generated, and possibly only some elements will be copied from the sample images.

In summary: if you don't have to, do NOT use filewords. The more you have, the more you have to type in the prompt to get the result. This is good if you want to see a lot of parameters on an image, but only specific parameters, because otherwise it will be bound to the image! So only use filewords that you want to specify in the prompt (e.g. front view, bottom view, top view, etc.)

Suddenly that's it, so I hope you understand how to use TXT and [filewords].

1 reply

mykeehu Dec 24, 2022

I've posted some extra training experiences here about [filewords], I've had some interesting training experiences.

BigBlueGX · 2023-01-09T06:56:28Z

BigBlueGX
Jan 9, 2023

Here is a Google Sheet document I made for me

https://docs.google.com/spreadsheets/d/1AGOfcC_IPuiZq6nhjStvKoTDwJKFRxifIProHpMwysw/edit?usp=sharing

Make a copy and edit the values

2 replies

eadnams22 Jan 11, 2023

What’s going on with D16? It’s referencing P5, which is empty.

BigBlueGX Jan 11, 2023

oh that.

D16 is max_train_steps = ps*np. Where D10 = ps and np = the number of periods (or epochs), from 0 to infinity, in increments of 0.125.

"np - the number of periods (or epochs), from 0 to infinity, in increments of 0.125.
The higher the value, the stronger your concept, but the worse the editability.
For private objects like characters, I recommend 2 - 3.
For styles, 1 may be enough."

SO if you multiply D101 for styles and D102 or D10*3 for stronger less editable concepts.

I have set it now to 3.

FugueSegue · 2023-01-24T00:18:53Z

FugueSegue
Jan 24, 2023

When you say, "frame," what exactly do you mean? Is that a dataset image? One of the various images of the subject that one tries to train? I searched through this thread and did not find a stated definition.

Forgive me for asking such an extremely basic question. I've never seen the term "frame" used in any description of Dreambooth training. This is the only thread--that I'm currently aware of--where that term is used.

I just want to have a clear understanding. I read your earlier version of your research and I thought it was interesting. I am eager to absorb your most recent update.

5 replies

cerega66 Jan 24, 2023
Author

When I write "frame" I mean one training image. It's just that for me this term is better than "image" if we consider it technically or mathematically.

FugueSegue Jan 24, 2023

I understand. Thank you for clarifying.

bghira Jun 9, 2023

frame in this context is a sample.

GoulartNogueira Jun 30, 2023

I had the same question.
This word "frame" could be included among the definition.
What do you think, @cerega66 ?

Awesome article, btw

bghira Jun 30, 2023

the term 'frame' should be used to refer to a frame denoising or noising (backwards / forwards pass)

the term 'sample' should be used to refer to a sample of noise from the data distribution during backward pass, or the sample being fed into the model during forward pass.

@cerega66 can you update your text to reflect this?

FugueSegue · 2023-01-31T01:42:08Z

FugueSegue
Jan 31, 2023

Your formulas worked very well for my SD 1.5 trainings. Thank you for your hard work. It is very helpful.

I hope you--or someone with your perseverance--will investigate similar formulas for SD 2.1 training at 768 resolution.

31 replies

DuskXi Feb 9, 2023

I think there is a difference in the preparation process of batch data. For example, there are two processes of composing batch, one is to divide the data array into batch in order, and the other is to distribute three resolutions evenly in batch. Since I haven't experimented with this, I can't parse out the inference as to which method leads to what effect.

Furthermore, there are several different classifications of the first data preparation method, one that does not order the array at all, so we do not know the distribution of the data in the batch, another that sorts the array by resolution, which is more sequential, and a third that may have a similar result to the second of the two methods described in the previous paragraph, which is to choose the data array by random sampling to compose the batch.

cerega66 Feb 9, 2023
Author

In theory, the second method, when there are 3 identical images in a batch, but in different resolutions, is more correct, but I'm not sure that the gradient will be calculated correctly. In addition, for different resolutions for one image, a different number of steps is needed, for 512-100.768-225.1024-400. If we can still correctly calculate the gradient, then in theory we can simply average the number of steps.
But at the moment, the implementation of batch looks like the first method, but there is one thing. When we got a list of all images with training tokens, shuffling is applied to them. And I'm not going to rewrite this method just yet.
I have already encountered the problem that each resolution requires a different number of steps. But we cannot set how many steps to pass through a particular image. So I did it easier. For 768 I made 2 copies of each image, getting a total of 20 (10 x 2 copies), for 1024 I made 4 copies, getting 40 images (10 x 4 copies). As a result, I now have 70: 10 - 512, 20 - 768, 40 - 1024. They are still training, so I can not say what effect this will give, but, in theory, this will allow images from 768 and 1024 to occur more often in training.

cerega66 Feb 10, 2023
Author

So, the last attempt showed the best result.
The model can generate in 512, 768 and 1024 with out hires fix.

512

768

1024

Generation also works in intermediate resolutions, for example, 640. But you can also combine them.

512x1024

1024x512

512x768

768x512

But, the further we move away from the base resolution, the more chances for artifacts. And when using 1024 artifacts are often present. But at 768 the artifact generation is very weak, most likely it makes no sense to deviate so far from the base resolution. Most likely the best limit is half of the base limit and it makes sense for 512 models to train only in 512, 768 or 512, 640, 768.

cerega66 Feb 10, 2023
Author

As I wrote, when forming a batch from images of different resolutions, it leads to errors in the dimension of the tensor. That is, the batch must contain images of the same size. To implement its use, I had to introduce an option to disable shuffling, and also make it so that the images load in a strict order.

DuskXi Feb 12, 2023

What I wrote in my reply yesterday may have been caused by my misunderstanding of the previous content, as I found out today when I tried and experimented with it, and found that neither the cat nor the stack function could support the difference in the length of other dimensions outside the merged dimension

minienglish1 · 2023-02-10T09:05:40Z

minienglish1
Feb 10, 2023

Finally finished reading this page. Great stuff.

First, if I understand the paper correctly, I found a couple small typos:

under advanced dreambooth formula:

Example:
LR=1e-6
medium = 10
hard = 15

you wrote "hard = (1e-5 / LR) * 2" therefore --> hard = 20, not 15

under mixed precision, you wrote out the values and made a table:

These written values don't match the table:

FP16
Range from 5.9e-8 to 6.5e4.

BF16
Epsilon 0.00781250

Also, one thing I noticed is with the table with auto-calculations. You used "max_training_steps." Which is how many steps where actual training takes place. It might be easier for new users to use epochs, since that is what is in the extension. On my copy of your excel sheet, I quickly added a converter to epochs, based on batch size.

A couple of questions:

Bucketing has recently been introduced. It seems like it creates mini-batches, and therefore would affect the batch numbers you can use. FN and frames per bucket would all have have to divisible by the batch size. Is this correct?
Bucketing aspect ratios. How does training simultaneously with multiple different ratios affect all of this? Different ratios use different amount of pixels. I don't know enough about the model to analyze this.
Is sub_quad attention already implemented in this extension? I know it's not selectable. I checked the code, and looks like it's been added. But I'm not a programmer. Can I edit the json file and put in "sub_quad" to use it?

You're work is amazing. Unfortunately it doesn't quite transfer to what I'm working on at the moment. You're work seems to focus on training dreambooth concepts, and all the math is for training a concept. I'm currently doing more of a major fine-tuning. So the math doesn't quite work for me.

My fine-tuning currently uses hundreds of images, and modifies many concepts. It includes: boy/girl/male/female, adult/child/teenager, student/teacher, preschool/elementary student, class level, facial/hair/clothes' descriptions, body part/posture descriptions, environment descriptions, and probably more. I plan to expand to thousands of images. The goal is to feed the model pictures of my school/classes/students/teachers/etc, and have it produce images that look like they came from my school.

I'm currently working on a way to establish a baseline for comparing my model with your math, so that I can use your math. I also need to decide how I want to use unique tokens, class tokens, and class frames.

I'm also looking for a guide on fine-tuning a large model, but I can't find one.

1 reply

cerega66 Feb 10, 2023
Author

Thanks for finding the typos, I'll fix them as soon as I update the article.

I can't say anything about Bucketing. It was only added in this extension. And I did not find instructions from the developer what it is and how it works. Other repositories do not have this, and initially it did not exist in the db. I tried to study the code, but I didn't understand what it was for.

Yes, I'm only focused on concepts right now, but in the future I will also study such things that you need. In theory, if you have hundreds or thousands of images that have all their own sets of tokens that do not match, then you do not need to scale LR. You need to set it to 1e-5 and just calculate the number of steps for all images. This comes from the idea that instead of one concept, it's like you're training thousands, each with only one image.

I agree that epochs are more convenient, but they are limited. If you have 10 steps in an epoch, then you will not be able to go only 5 steps. And this is a limitation that I do not like. And since I mostly just experiment, I need complete freedom of action. In addition, sometimes people over an era can imply a different number of steps and misunderstandings can arise here, and steps are always constant and this is easier to understand.

terrificdm · 2023-02-18T18:25:52Z

terrificdm
Feb 18, 2023

Great information to elaborate parameters for dreambooth training, thank you for sharing. I also have two more questions. If you can help, it will be highly appreciated.

For "style" training ,can I just use instance_prompt without using class_prompt and class images? From my practices, I didn't see Prior_Preservation_Loss increase quality to results. And I also have a little bit confusion about "prompt and concept_list" section of your article. From my previously style training, I used "photo xzy style" for instance_prompt with no class_prompt and no class images, it worked ok. Is that a correct way, any advice?
For style training, did you ever try training with full body images? I used 20 full body images in the same artist style for training and wanted to train the specific style. But I always got distorted face once I produced a full body image(half body image went well)... Any advice for such kind of training?

Thanks.

5 replies

cerega66 Feb 19, 2023
Author

Yes, you can only use instance_prompt without class_prompt and I see no point in using class_prompt in this case. When I training styling in instance_prompt I only specify the author's name: "instance_prompt": "artist".
When an object, I also specify the class: "instance_prompt": "A sks woman" for the standard model, "instance_prompt": "sks, girl" for the anime model and others. But I never use class images, because they are not effective for me.
I try to mix images both full length and close up. The fact that the face is distorted to its full body is generally a problem in accuracy and there are many reasons. For best accuracy, train in fp32 with 32 bit adam. When generating images in web ui, use the --no-half option on startup. But in this case, a lot of memory is required and the result will still not be ideal.

terrificdm Feb 19, 2023

Got it, thanks. And I had tried training with mixed images before, full body/half body/close up. But the results were not satisfied as you mentioned.

minienglish1 Feb 19, 2023

For best accuracy, train in fp32 with 32 bit adam.

In the dreambooth extension, does 32 bit adam automatically load if you don't use 8bit Adam, xformers, and mixed precision? Or does 32 bit adam need to be installed and selected?

cerega66 Feb 19, 2023
Author

The default extension uses 32 bit adam and fp32. If you put 8 bits of adam, then there will be 8 bits of adam. If you select mixed precision then fp32 will be replaced by fp16 or bf16. There is also tf32, but as far as I know, it has not yet been added to the latest version of the extension.

minienglish1 Feb 20, 2023

Thanks.

terrificdm · 2023-02-21T06:21:20Z

terrificdm
Feb 21, 2023

@cerega66 Another question is any extra different tricks for training on SD2.1-512, 2.1-768? Because I used the same way which I used for SD1.5 to train style for SD2.1, the results turned badly comparing with SD1.5. A lot of characters of style disappeared...

10 replies

terrificdm Feb 21, 2023

I meant, results from SD1.5 are more close to original images, especially for human faces which have more cyberpunk characters. e.g. semi-mechanical style face. If you produced with half body or full body images, you will see more semi-mechanical characters. But for results from SD2.1, the human face and body are just normal real people, the training seemed only reflected on hair, dressing, and image's color. It didn't change too much for people selves.

cerega66 Feb 21, 2023
Author

I will try to repeat the style more strongly on 2.1, but I'm not sure if it will come out yet, maybe this is the limitation of the model itself.

terrificdm Feb 21, 2023

Thanks. Seems model2.x inclines more to produce realistic people... Might be the limitation as you said.

cerega66 Feb 21, 2023
Author

The model stubbornly resists. Increasing the number of steps by 2 times gave better results (v2), but the degradation of the model already begins there. Perhaps this is a feature of the model. But I still remain of my opinion that these are most likely special restrictions on learning that were built into model 2.x. I have not seen more than one fully trained model based on 2.x. They talked about some restrictions, maybe there are some more.

terrificdm Feb 21, 2023

Well noted, thanks again.

myndxero · 2023-04-19T02:41:12Z

myndxero
Apr 19, 2023

I've been trying so hard to wrap my head around the formulas, but my attention span and learning is wonky. I need a place to start, if I've 50 training images, where to do I begin with the formulas. I can train 512 at batch 5 and GA 5 max on my 3090 TI max. I just need assistance understanding and applying the formulas and/or the math charts provided. Traditionally, I've used polynomial on a single concept at a time LR 1e-6. It's been more plug and play and hope for the best.

Appreciate if anyone is willing to assist.

0 replies

godlovesdavid · 2023-05-13T18:48:11Z

godlovesdavid
May 13, 2023

I really appreciate the gold mine of knowledge, saves us a ton of time vainly experimenting, on this severely underdocumented hobby. We should have more collective knowledgebases, like a wiki.

I'd like to share a neat trick which is that you can turn on grad checkpointing to reduce a ton of VRAM and fit 64 batch size of 256 imgs on a 24gb gpu. It's awesome and it trains in about the same time as 10 BS with the option turned off.

however, the part about increasing BS decreasing learning doesnt seem to apply to my LORA training in Kohya SS. It actually increased it and now the output is way overblown. e.g. i have 256 images and 64BS* 4GA. I tried decreasing the learning rate by a factor of 1000 and it still isn't enough. Maybe we can find a formula for LORA that involves altering the LR instead of the max training steps.

edit: i'm pretty sure the LR parameter is broken in Kohya SS. 1e-4 gives the same result as 1e-30 and 1e0. Going to need to tweak Unet and Tenc separately. (why is there even that param? is it supposed to scale the other two in some weird way?)

0 replies

bghira · 2023-06-09T20:04:25Z

bghira
Jun 9, 2023

hi, is there an updated version of this guide? it has some outdated concepts like gradient accumulations being recommended to set at 1. for Dreambooth and general fine-tuning, you want this value to be as high as you can tolerate.

for example. on Google Colab which kills the connection after 6 hours, you will want your batch size to be as high as it can get, with an accumulations around 20. with a batch size of 6 and 20 accumulations, you can generate a single checkpoint in about 6 hours.

you want as few checkpoints generating in a single day as possible. the more slowly it learns and the larger the batch size, the more effective the training run is. it helps avoid catastrophic loss, and retains model flexibility.

additionally, freezing the text encoder layers from the end helps the most when you freeze all but the last two layers. if you freeze all but the last 7, you can train textures more easily but it's too easy to over-fit. running a special training session with a more thawed text encoder and then freezing it to continue training is a great strategy that has worked well for me.

1 reply

cerega66 Jun 10, 2023
Author

Yes, there are updates, but they are more private than general, so I will try to describe them here, but first I will answer your questions.
I recommend using gradient accumulations equal to 1 when you have the opportunity to set a high batch size, due to the fact that using it introduces some inaccuracy, if you cannot set a high batch size then you can use gradient accumulations.
I can't say anything about freezing the encoder. It didn't exist when I wrote this guide. Perhaps this is really useful, but I can not say, since I do not have time now to conduct more such research as well as desires. I tried to find more general guides on how to use it, but I see more specific cases and I'm not interested because they only work well in those cases. In addition, I do not experience serious problems with the repetition of small details.

Now some information. It has more of a collective image of my workouts, of which there were already more than 1000 or even more.

Let's start with batch size. In my opinion this is the most important setting. Increasing its value gives a much better repeat quality, but the dependence of quality on value is not straightforward. The most noticeable difference is between values 1 and 2, between 2 and 3 it is still high, but smaller, and so on. There is a difference between 10 and 50, but it is not so high anymore. Therefore, I see no reason to set high values, since the ratio of quality and resource consumption with high batch size values is getting worse, so I chose the optimal range for myself from 10 to 20, depending on the number of source images of the concept being taught.

Now about the models. Finding the model best suited to train your concept has the highest priority. The better the model can generate something similar to your concept, the better it is for training it. Here you can divide the models into two categories, if you do not take into account the standard 1.x and 2.x: mixed and fine-tuned.

Mixed ones are the worst for me. They have a random jumble of token weights and quanta, which means that some tokens can give the same thing, or give an unpredictable result.

fine-tunes are the best, but there are very few of these models (I would say that the ratio of mixes to fine-tunes tends to be 100 to 1 and getting lower).

There is also a note about resolution. There are fine-tuned models that were trained on the standard 1.5, but trained on a resolution of 768, which allows you to generate images in 768 without a high-res fix. Training your concept on this model but with 512 images allows you to also generate images of your concept in 768 without any problems.

A little more about speed. A constant value of speed does not carry anything bad, however, it increases the requirement for the quality of training images than the dynamic one that I wrote in the manual. However, if everyone generally recommends using 1e-6, then I prefer to use 5e-7.

That's all for now.

16892434 · 2024-11-12T00:50:30Z

16892434
Nov 12, 2024

Very nice job, thanks a lot !

0 replies

[Pre-Final] A large research of the main training parameters *Update* #547

NEW UPDATE

FOREWORD

Main parameters

instance_prompt

class_prompt

learning_rate

Number of training steps (max_train_steps)

Full pass steps and compensation

Formulas for training

Dreambooth Standard Formula

Class based training

Prior Presservation Loss

prior_loss_weight

Advanced Dreambooth Formula

Text encoder

Resolution and training on model 768 and other resolutions other than base resolutions

(Experimental) Dynamic Resolution

Attention types

Number formats

FP64

FP32

TF32

mixed_precision

FP16

BF16

Comparison of number format

xformers problem

gradient_checkpointing

train_batch_size

gradient_accumulation_steps

sample_batch_size

lr_scheduler

warmup (lr_warmup_steps)

STANDARD

constant

constant_with_warmup

linear

cosine

cosine_with_restarts

polynomial

CUSTOM

lr_scheduler step multipliers

not_cache_latents

set_grads_to_none

use_8bit_adam

scale_lr

pad_tokens

num_train_epochs

save_interval

save_min_steps

Concepts and filewords

concept_list

Various additional information

Additional information on prompt and concept_list

Replies: 19 comments · 68 replies

cerega66 Dec 15, 2022 Author

cerega66 Dec 15, 2022 Author

cerega66 Dec 16, 2022 Author

cerega66 Dec 16, 2022 Author

cerega66 Dec 16, 2022 Author

cerega66 Dec 16, 2022 Author

cerega66 Dec 20, 2022 Author

d8ahazard Dec 20, 2022 Maintainer

cerega66 Jan 24, 2023 Author

cerega66 Feb 9, 2023 Author

cerega66 Feb 10, 2023 Author

cerega66 Feb 10, 2023 Author

[Pre-Final] A large research of the main training parameters Update #547

Replies: 19 comments 68 replies

cerega66 Dec 15, 2022
Author

cerega66 Dec 15, 2022
Author

cerega66 Dec 16, 2022
Author

cerega66 Dec 16, 2022
Author

cerega66 Dec 16, 2022
Author

cerega66 Dec 16, 2022
Author

cerega66 Dec 20, 2022
Author

d8ahazard
Dec 20, 2022
Maintainer

cerega66 Jan 24, 2023
Author

cerega66 Feb 9, 2023
Author

cerega66 Feb 10, 2023
Author

cerega66 Feb 10, 2023
Author