Question on "content representation extraction task" of Q former #11

WonwoongCho · 2024-05-14T01:04:07Z

Hi authors, thank you for sharing the awesome work.

As far as I understand, only the style representation from Q-former is used during the inference.
If it is correct, why is the content training needed tho?
Does it help the Q-former to have better disentangled representation for "style"?

Probably I missed some parts of the paper. Would appreciate it if somebody let me know.
Thanks!

Tianhao-Qi · 2024-05-15T02:30:48Z

The goal of using dual content training is to help the model better distinguish between the style and the semantics of the reference image. Therefore, it will reduce the impact of reference image semantics and lead to better text alignment, as shown in Table 2 of our paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on "content representation extraction task" of Q former #11

Question on "content representation extraction task" of Q former #11

WonwoongCho commented May 14, 2024

Tianhao-Qi commented May 15, 2024

Question on "content representation extraction task" of Q former #11

Question on "content representation extraction task" of Q former #11

Comments

WonwoongCho commented May 14, 2024

Tianhao-Qi commented May 15, 2024