The ability for mixture of encoder in MLLM? #236

Rkyzzy · 2025-01-02T02:56:28Z

Hi! Thanks for the great work. Since Depth-Anything-V2 is capable of depth estimation, and I assume that its feature contains spatial information (especially depth). I wonder whether the Depth-anything v2 backbone can be utilized as one of the vision encoder in a mixture of encoder paradigm for MLLM. It is intuitive to me that it may have effect to enhance the spatial or 3d awareness of MLLM. (Previously the paradigm is often Siglip + DINOv2, but I think your work may have a better ability in predicting depth), and those ability may be a great fit for application scenario like Autonomous Driving or VLN for robot. Have you tried using it to enhance MLLM ability? Do you have any statistic result on this? Thanks in Advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The ability for mixture of encoder in MLLM? #236

The ability for mixture of encoder in MLLM? #236

Rkyzzy commented Jan 2, 2025

The ability for mixture of encoder in MLLM? #236

The ability for mixture of encoder in MLLM? #236

Comments

Rkyzzy commented Jan 2, 2025