Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on "content representation extraction task" of Q former #11

Open
WonwoongCho opened this issue May 14, 2024 · 1 comment
Open

Comments

@WonwoongCho
Copy link

Hi authors, thank you for sharing the awesome work.

As far as I understand, only the style representation from Q-former is used during the inference.
If it is correct, why is the content training needed tho?
Does it help the Q-former to have better disentangled representation for "style"?

Probably I missed some parts of the paper. Would appreciate it if somebody let me know.
Thanks!

@Tianhao-Qi
Copy link
Collaborator

The goal of using dual content training is to help the model better distinguish between the style and the semantics of the reference image. Therefore, it will reduce the impact of reference image semantics and lead to better text alignment, as shown in Table 2 of our paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants