-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pix2Seq models #2
Comments
Hi there! For pix2seq, I recommend checking issue#1 and their paper, especially the algorithm chart. My reconstructed version is at Pix2Seq.py. The core is using the predicted mask as guidance (a.k.a, only considering the past predicted frame), may be a little different from the original version. You may also try to find some insight with lucidrains' bit-diffusion. I'm afraid that I can not provide full details for you since recently I've been very busy with other projects. I have not yet tried other multi-class VOS tasks with Pix2Seq and my approach, thus I'm not sure about the performance. But my group member have reproduced my pipeline on Ultrasound Segmentation, and I'm confident it works well for binary segmentation. Thus, it should be suitable for both TBGDiff and pix2Seq with my codes if your task is binary segmentation. Sorry for that! When I'm free I will tidy up more detailed tutorials for it. Thanks for your attention! Cheers :) Rydeen |
I see. |
Sorry for another dumb question. |
Hi there! Never mind. For the mentioned Semi-Supervised VOS, they require the first frame ground truth as the initialization. That's why it is called Semi-supervised. We just add an auxiliary decoder to predict it as the initialization, instead of using the gt. Using the first gt is very unfair when it comes to Unsupervised VOS. The first initialization plays a very key role. In my experiment, for example, the STCN with the first gt will yield about 0.75 IoU even with ResNet50 (very surprising, right?). They belong to another stream. You can refer to some survey papers to check more differences. Feel free to ask me any questions if you still have some problems. Cheers :) Rydeen |
Dose it mean that using an auxiliary decoder module to predict the first frame for the semi-supervised VOS, similiar to the aux_head in TBGDiff which predicts pseudo masks? This auxiliary module can load some pretrained weights(e.g. segformer) and is trained together specifically for predicting the first frame. Am I getting this right? |
Yeah, you're right. But the auxiliary could be very simple, e.g., just conv layers or MLP projection like segformer. My aux_head predict all the frames, coarse but enough for diffusion guidance. For SSVOS, I just use aux_head to output first frame such that no need to modify other codes. The main difference between SSVOS & UVOS is whether using the first gt prior. I reproduce them just because I'm familiar with them, and I'm lazy to conduct other UVOS methods :p In my experience, the backbone/encoder has extracted enough information, especially for binary segmentation (heavy decoder also matters for multi-class task, like Mask2Former). Feel free to ask me any questions if you still have some problems. |
I completely understand now, thank you for your patient and detailed explanation! |
First of all, thanks to your awesome codes. It's really helpful to me.
I am new to this field. The official code of pix2seq is based on tensorflow2 and even without guidance for training a custom VOS task. It seems too hard for me.
I wonder If I can train&evaluate the pix2seq model on my custom dataset based on your released code?
The text was updated successfully, but these errors were encountered: