-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing YOLOv5 Overfitting on COCO from scratch #746
Comments
After re-reading Karpathy's recipe, he points out that dropout does not play nicely with batchnorm, which makes perfect sense, but had not occurred to me to consider. My previous dropout experiments were not on the final output but 1 or 2 layers prior, with the usual batchnorm and activations following, and later output layers inheriting dropouts of earlier layers. I think I will try an experiment with direct dropout on each of the 3 outputs, which do not pass through batchnorm layers, and do not accumulate. I'll try this with nn.Dropout2d(0.1) on YOLOv5l initially. EDIT: This would be on the inputs to Detect(), prior to the output nn.Conv2d() |
Hi @glenn-jocher . You are doing a great job! I'll try to help. Firstly, I think you should pin this question on the top of the issues. So, it has more visibility.
I'm not an expert in object detection but dropout doesn't play nicely with ConvNets when is applied in between, at least, for classification problems. Dropout should be only use on the last linear layer. You use DropPath or DropBlock) instead of dropout to add regularization in the middle of the network. Rwightman repository has both of them implemented. He uses to train all EfficientNet net styles. On the other hand, as data augmentation, you could as change image lightning and contrast as it's done by fastai and Rwightman image model repository. Albumentations has a lot of them an more. You could try CutMix instead of MixUP. I though MixUP wasn't the best option for object detectors, at least, to train the backbone. One last option it is trying to synchronize overfitting between losses. If you reduce the contribution of Objectness loss to the total loss, you may delay it's overfitting. Finally, non related to regularization, you could try SWA (stochastic gradient averaging) introduced by Pytorch in 1.6. They just released a great post about it. I don't know how to applies into object detection but it could be worthy to try. EDIT: you may take a look for Kaggle Global Wheat Competition. YoloV5 was popular there before was banned from the winning competition due to the GPL license. |
@hal-314 thank you for the comments! I've implemented dropout like this, on the Detect() layer inputs: self.drop = nn.Dropout2d(0.1)
def forward(self, x):
# x = x.copy() # for profiling
z = [] # inference output
self.training |= self.export
for i in range(self.nl):
x[i] = self.m[i](self.drop(x[i])) # conv I did try cutmix as well, but saw a small mAP drop, though it's infinitely tuneable so its possible I adopted poor settings for it. Finetuning is very different than training from scratch though, and does seem to benefit from nearly all of these changes (mixup, scale jitter, dropout etc). I've also observed the same in regards to loss components, that reductions in a loss component gain, such as SWA (now officially supported in torch 1.6) looks very interesting. We already use EMA (officially supported in TF), which performs a nearly identical function, though in a different manner. I think it's definitely worth trying to swap it for SWA to see the effect. Yes, the Kaggle competition really brought YOLOv5 into the limelight. A lot of people were surprised that it outperformed EfficientDet. A primary factor is probably image size. Here we train and test at 640, which is far, far below what the larger EfficientDet models train at, 1200-1500 pixels. So I think in trying to finetune a D6, D7 etc model you will only get optimal results at these inflated image sizes, which a lot of the competitors were probably reticent to try due to speed/memory constraints. In the end our main goal is not to make the most accurate architecture though, our main goal is to strike the best compromise that would make this the most usable (and user friendly :) architecture for most tasks. There is a great ablation study on YOLOv5 wheat detection here: https://www.kaggle.com/c/global-wheat-detection/discussion/172436 |
@glenn-jocher I think that using this dropout like this is fine in the Detect. However, you may be using too little for the bigger networks. For reference, mobilenetV2 uses a dropout of 0.2 in the head. EfficientNet and EfficientNet-Lite family uses 0.2 in the smaller nets + DropPath in all the residual blocks. Here are the training commands for EfficientNet family that achieve SOTA results (DropPath is called Finally, other new nets like TResNet doesn't use any type of dropout but the employ a strong data augmentation + imagenet has more images than coco. |
@hal-314 that's really interesting, I did not know about drop-path. Does this just zero some of the residual channels, i.e. is it the same as nn.Dropout2d() on the bottleneck residuals? I just realized yesteray also (after considering the dropout impact on BN), that our default training regime is very different than the default validation regime:
I tried to modify the mosaic shape to also apply to random rectangle batches with 640 on their long side. The code to do this is here, it acts once per epoch: Lines 231 to 234 in 5e0b90d
In general though, it may make sense to do a BN pass on the validation set, or on the training set in --rect mode for construction of BN statistics after training is complete. I believe SWA includes this final step as well. This may help BN better match the validation space rather than the training space. |
@glenn-jocher DropPath removes the bottleneck (it becomes an identity) with probability of p for each sample in the batch. So, you are reducing the network depth for the sample i. For example, for a batch size of N, a bottleneck layer will be a identity for p*N samples in the batch. About mosaic transform, I don't know very much about the technique. I saw some images but I didn't pay attention. BN mismatch is a problem as said in Fixing the train-test resolution discrepancy. The problem was the different image sizes, mainly after the global pooling. So, I see several options to solve the domain shift:
I think that only updating the BN with the training images isn't as optimal as finetuning. If with that simple trick, mAP increase, it's a good sign. I wouldn't use validation images as you should never use it for anything more than validation. |
@hal-314 I just finished a run of 7 YOLOv5l trainings. 1 was a baseline, and 6 others tried various overfitting suppressions:
Unfortunately none of my experiments produced higher best.pt mAPs, but two of them provided extremely interesting results (below). First, changing the scale hyp from 0.5 to 0.8 resulted in much less overfitting across all losses (orange below). This seems like a huge win, except that mAP did not benefit as a result. I don't know the cause, as typically lower val losses correlate with higher mAPs. This is very confusing. The second great conclusion is that reducing momentum from 0.937 to 0.90 (green line) helps in early training but causes earlier overfitting and lower final mAP. This is a negative result, but we can visualize the implied momentum gradient here in our heads and see that increasing the momentum hyp may result in the opposite effect. I've started two new runs at 0.95 and 0.97 momentum to test this new theory. |
@glenn-jocher I agree with you with momentum but you may need to train longer with scale hyp set to 0.8 to be able to overfit. I'm also confused with scale results. It may be to lower precision or failing in some sort of objects size (¿small or large?) while improving in general. Reading utils.datasets.py, I don't understand how you are doing zoom out (scale < 1) and assign pixels and bounding boxes outside the original image pixels*. You could easily do this by creating a mosaic image bigger than the target size and then resize it to the target size. Seeing the improvement with scale, I would suggest to try more data augmentation. Try to use some of the Albumentation transform as I said before. Finally, it's interesting that the dropout at the end doesn't matters with the overfitting and it's only penalizing the net. I don't understand this result. If you have time, it'd be interesting to see if using dropPath in the backbone makes any difference.
Doing so much zoom out could hurt as you are introducing a lot of constant padding. I think that it could hurt network performance. Usually, it's better using padding="reflection" in torchvision transforms instead of "padding="border". You could easy fake a "padding=reflection" when using mosaic. I would make a mosaic bigger than the target size and then resize to the wanted size. So, you avoid to add so much padding. drop -> interesante que solo afecte en objectness y que reducza el error de classificacion y GioU. |
I just read PP Yolo paper. It based on YOLOv3 but with ResNet backbone + multiple easy tricks not related to data augmentation. For example, DropBlock to avoid overfitting. It's faster than YoloV4 (and yoloV5) at the same accuracy. So, you may want to try DropBlock or DropPath to avoid overfitting. I think that YoloV5 applies some of the tricks but not all of them. YoloV5 could greatly benefit from the others :) |
Yes, higher scale jitter seems to help, though it helps when finetuning more than when training from scratch. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@hal-314 good news 😃! Your original issue may now be fixed ✅ in PR #3882. This PR implements a YOLOv5 🚀 + Albumentations integration. The integration will automatically apply Albumentations transforms during YOLOv5 training if Get StartedTo use albumentations simply class Albumentations:
# YOLOv5 Albumentations class (optional, used if package is installed)
def __init__(self):
self.transform = None
try:
import albumentations as A
check_version(A.__version__, '1.0.0') # version requirement
self.transform = A.Compose([
A.Blur(p=0.1),
A.MedianBlur(p=0.1),
A.ToGray(p=0.01)],
bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))
logging.info(colorstr('albumentations: ') + ', '.join(f'{x}' for x in self.transform.transforms))
except ImportError: # package not installed, skip
pass
except Exception as e:
logging.info(colorstr('albumentations: ') + f'{e}')
def __call__(self, im, labels, p=1.0):
if self.transform and random.random() < p:
new = self.transform(image=im, bboxes=labels[:, 1:], class_labels=labels[:, 0]) # transformed
im, labels = new['image'], np.array([[c, *b] for c, b in zip(new['class_labels'], new['bboxes'])])
return im, labels Example ResultExample UpdateTo receive this YOLOv5 update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@glenn-jocher is there any dropblock or droppath regularization utilized in the current yolov5 implementation? |
@fcakyon no |
Have you tried to use dropblock before? Does it perform well? Hope to see your reply! |
@glenn-jocher |
@saifmassoudsaif to plot all models in the same figure, you can use the import matplotlib.pyplot as plt
# Define your models and their corresponding data
# For example:
models = ['YOLOv5s', 'YOLOv5m', 'YOLOv5l', 'YOLOv5x']
losses = [0.1, 0.3, 0.2, 0.15]
# Create a figure and axis
fig, ax = plt.subplots()
# Plot the models and losses
ax.plot(models, losses)
# Set labels and title
ax.set_xlabel('Models')
ax.set_ylabel('Loss')
ax.set_title('Losses for YOLOv5 models')
# Show the plot
plt.show() This code will create a figure with the YOLOv5 models on the x-axis and the corresponding losses on the y-axis. You can customize the plot as per your requirements. Hope this helps! |
🚀 Feature
Andrej Karpathy lists overfitting as one of the main steps in a training pipeline, and a prerequisite to regularization and obtaining a best final result:
http://karpathy.github.io/2019/04/25/recipe/
On that front, the larger models are overfitting well, while the smallest just about completes training with no overfitting. Here you can observe overfitting in the validation losses (objectness in particular) across the 4 models. The progression of greater overfitting for larger models is observed, as expected.
![results](https://user-images.githubusercontent.com/26833433/90301178-1e496600-de53-11ea-9bfb-ee0a55ea0ee5.png)
I've noticed interestingly in the switch from nn.LeakyReLU(0.1) to nn.Hardswish() that the overfitting has increased. For example here is YOLOv5l v2.0 vs v3.0. Very interestingly, val Objectness actually performs worse with the Hardswish() activations, the mAP gain must originate from better box regressions and better classifications. I do not know why this is, but I see a similar pattern in v5x, with worse val Objectness loss there also.
![results](https://user-images.githubusercontent.com/26833433/90301429-abd98580-de54-11ea-8080-381acd5014f9.png)
Since we have a multi-component loss function, made up of 3 individual losses (box, obj, cls), we would ideally like to reduce overfitting on a loss component basis (starting with Objectness), but this is not common practice that I'm aware of. L1 and L2 regularization techniques that might help in this situation can be targeted to individual parameters by assigning parameter groupings to the optimizer, but I don't think there's a precedent for targeting loss components with different regularization techniques (anyone correct me if I'm wrong).
It may be possible to create separate optimizers for each loss component, and assign then different weight_decays, though I don't know if this would involve a hit to training speed or memory consumption.
In any case, these are my thoughts on the matter. Other things I've tried are increased augmentation, including rotation, mixup, scale and perspective, nn.Dropout2d(0.1), and increased weight_decay using the existing structure, but in all of these cases the training results (on COCO from scratch) become worse. If anyone has any ideas for Objectness regularization, or other techniques to reduce overfitting, please us know!
The text was updated successfully, but these errors were encountered: