Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pix2Seq models #2

Open
YingJGuo opened this issue Nov 28, 2024 · 7 comments
Open

pix2Seq models #2

YingJGuo opened this issue Nov 28, 2024 · 7 comments
Labels
fixed addressed issue reproduction

Comments

@YingJGuo
Copy link

First of all, thanks to your awesome codes. It's really helpful to me.

I am new to this field. The official code of pix2seq is based on tensorflow2 and even without guidance for training a custom VOS task. It seems too hard for me.

I wonder If I can train&evaluate the pix2seq model on my custom dataset based on your released code?

@haipengzhou856
Copy link
Owner

Hi there!

For pix2seq, I recommend checking issue#1 and their paper, especially the algorithm chart. My reconstructed version is at Pix2Seq.py. The core is using the predicted mask as guidance (a.k.a, only considering the past predicted frame), may be a little different from the original version. You may also try to find some insight with lucidrains' bit-diffusion.

I'm afraid that I can not provide full details for you since recently I've been very busy with other projects. I have not yet tried other multi-class VOS tasks with Pix2Seq and my approach, thus I'm not sure about the performance. But my group member have reproduced my pipeline on Ultrasound Segmentation, and I'm confident it works well for binary segmentation.

Thus, it should be suitable for both TBGDiff and pix2Seq with my codes if your task is binary segmentation.

Sorry for that! When I'm free I will tidy up more detailed tutorials for it.

Thanks for your attention!

Cheers :)

Rydeen

@YingJGuo
Copy link
Author

I see.
Thanks again for your quick response!
I will close this issue.
:)

@haipengzhou856 haipengzhou856 added the fixed addressed issue label Nov 29, 2024
@YingJGuo
Copy link
Author

Sorry for another dumb question.
The paper mentioned that for STCN, STM and ShadowSAM 'we directly predict the first frame instead of using the label.'
I am confused about how this implemented. (Maybe Train the model without giving prompt? )

@haipengzhou856
Copy link
Owner

Hi there!

Never mind. For the mentioned Semi-Supervised VOS, they require the first frame ground truth as the initialization. That's why it is called Semi-supervised. We just add an auxiliary decoder to predict it as the initialization, instead of using the gt.

Using the first gt is very unfair when it comes to Unsupervised VOS. The first initialization plays a very key role. In my experiment, for example, the STCN with the first gt will yield about 0.75 IoU even with ResNet50 (very surprising, right?). They belong to another stream. You can refer to some survey papers to check more differences.

Feel free to ask me any questions if you still have some problems.

Cheers :)

Rydeen

@YingJGuo
Copy link
Author

YingJGuo commented Dec 1, 2024

Dose it mean that using an auxiliary decoder module to predict the first frame for the semi-supervised VOS, similiar to the aux_head in TBGDiff which predicts pseudo masks?

This auxiliary module can load some pretrained weights(e.g. segformer) and is trained together specifically for predicting the first frame.

Am I getting this right?

@haipengzhou856
Copy link
Owner

Yeah, you're right. But the auxiliary could be very simple, e.g., just conv layers or MLP projection like segformer. My aux_head predict all the frames, coarse but enough for diffusion guidance. For SSVOS, I just use aux_head to output first frame such that no need to modify other codes.

The main difference between SSVOS & UVOS is whether using the first gt prior. I reproduce them just because I'm familiar with them, and I'm lazy to conduct other UVOS methods :p

In my experience, the backbone/encoder has extracted enough information, especially for binary segmentation (heavy decoder also matters for multi-class task, like Mask2Former).

Feel free to ask me any questions if you still have some problems.
Cheers :)
Rydeen

@YingJGuo
Copy link
Author

YingJGuo commented Dec 1, 2024

I completely understand now, thank you for your patient and detailed explanation!
:D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed addressed issue reproduction
Projects
None yet
Development

No branches or pull requests

2 participants