pix2Seq models #2

YingJGuo · 2024-11-28T15:07:00Z

First of all, thanks to your awesome codes. It's really helpful to me.

I am new to this field. The official code of pix2seq is based on tensorflow2 and even without guidance for training a custom VOS task. It seems too hard for me.

I wonder If I can train&evaluate the pix2seq model on my custom dataset based on your released code?

haipengzhou856 · 2024-11-28T15:41:18Z

Hi there!

For pix2seq, I recommend checking issue#1 and their paper, especially the algorithm chart. My reconstructed version is at Pix2Seq.py. The core is using the predicted mask as guidance (a.k.a, only considering the past predicted frame), may be a little different from the original version. You may also try to find some insight with lucidrains' bit-diffusion.

I'm afraid that I can not provide full details for you since recently I've been very busy with other projects. I have not yet tried other multi-class VOS tasks with Pix2Seq and my approach, thus I'm not sure about the performance. But my group member have reproduced my pipeline on Ultrasound Segmentation, and I'm confident it works well for binary segmentation.

Thus, it should be suitable for both TBGDiff and pix2Seq with my codes if your task is binary segmentation.

Sorry for that! When I'm free I will tidy up more detailed tutorials for it.

Thanks for your attention!

Cheers :)

Rydeen

YingJGuo · 2024-11-28T16:58:39Z

I see.
Thanks again for your quick response!
I will close this issue.
:)

YingJGuo · 2024-11-30T08:31:54Z

Sorry for another dumb question.
The paper mentioned that for STCN, STM and ShadowSAM 'we directly predict the first frame instead of using the label.'
I am confused about how this implemented. (Maybe Train the model without giving prompt? )

haipengzhou856 · 2024-12-01T04:41:35Z

Hi there!

Never mind. For the mentioned Semi-Supervised VOS, they require the first frame ground truth as the initialization. That's why it is called Semi-supervised. We just add an auxiliary decoder to predict it as the initialization, instead of using the gt.

Using the first gt is very unfair when it comes to Unsupervised VOS. The first initialization plays a very key role. In my experiment, for example, the STCN with the first gt will yield about 0.75 IoU even with ResNet50 (very surprising, right?). They belong to another stream. You can refer to some survey papers to check more differences.

Feel free to ask me any questions if you still have some problems.

Cheers :)

Rydeen

YingJGuo · 2024-12-01T05:44:27Z

Dose it mean that using an auxiliary decoder module to predict the first frame for the semi-supervised VOS, similiar to the aux_head in TBGDiff which predicts pseudo masks?

This auxiliary module can load some pretrained weights(e.g. segformer) and is trained together specifically for predicting the first frame.

Am I getting this right?

haipengzhou856 · 2024-12-01T07:36:53Z

Yeah, you're right. But the auxiliary could be very simple, e.g., just conv layers or MLP projection like segformer. My aux_head predict all the frames, coarse but enough for diffusion guidance. For SSVOS, I just use aux_head to output first frame such that no need to modify other codes.

The main difference between SSVOS & UVOS is whether using the first gt prior. I reproduce them just because I'm familiar with them, and I'm lazy to conduct other UVOS methods :p

In my experience, the backbone/encoder has extracted enough information, especially for binary segmentation (heavy decoder also matters for multi-class task, like Mask2Former).

Feel free to ask me any questions if you still have some problems.
Cheers :)
Rydeen

YingJGuo · 2024-12-01T11:12:34Z

I completely understand now, thank you for your patient and detailed explanation!
:D

haipengzhou856 added the reproduction label Nov 28, 2024

YingJGuo closed this as completed Nov 28, 2024

haipengzhou856 reopened this Nov 29, 2024

haipengzhou856 added the fixed addressed issue label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pix2Seq models #2

pix2Seq models #2

YingJGuo commented Nov 28, 2024

haipengzhou856 commented Nov 28, 2024

YingJGuo commented Nov 28, 2024

YingJGuo commented Nov 30, 2024

haipengzhou856 commented Dec 1, 2024

YingJGuo commented Dec 1, 2024

haipengzhou856 commented Dec 1, 2024

YingJGuo commented Dec 1, 2024

pix2Seq models #2

pix2Seq models #2

Comments

YingJGuo commented Nov 28, 2024

haipengzhou856 commented Nov 28, 2024

YingJGuo commented Nov 28, 2024

YingJGuo commented Nov 30, 2024

haipengzhou856 commented Dec 1, 2024

YingJGuo commented Dec 1, 2024

haipengzhou856 commented Dec 1, 2024

YingJGuo commented Dec 1, 2024