Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ability for mixture of encoder in MLLM? #236

Open
Rkyzzy opened this issue Jan 2, 2025 · 0 comments
Open

The ability for mixture of encoder in MLLM? #236

Rkyzzy opened this issue Jan 2, 2025 · 0 comments

Comments

@Rkyzzy
Copy link

Rkyzzy commented Jan 2, 2025

Hi! Thanks for the great work. Since Depth-Anything-V2 is capable of depth estimation, and I assume that its feature contains spatial information (especially depth). I wonder whether the Depth-anything v2 backbone can be utilized as one of the vision encoder in a mixture of encoder paradigm for MLLM. It is intuitive to me that it may have effect to enhance the spatial or 3d awareness of MLLM. (Previously the paradigm is often Siglip + DINOv2, but I think your work may have a better ability in predicting depth), and those ability may be a great fit for application scenario like Autonomous Driving or VLN for robot. Have you tried using it to enhance MLLM ability? Do you have any statistic result on this? Thanks in Advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant