You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thanks for the great work. Since Depth-Anything-V2 is capable of depth estimation, and I assume that its feature contains spatial information (especially depth). I wonder whether the Depth-anything v2 backbone can be utilized as one of the vision encoder in a mixture of encoder paradigm for MLLM. It is intuitive to me that it may have effect to enhance the spatial or 3d awareness of MLLM. (Previously the paradigm is often Siglip + DINOv2, but I think your work may have a better ability in predicting depth), and those ability may be a great fit for application scenario like Autonomous Driving or VLN for robot. Have you tried using it to enhance MLLM ability? Do you have any statistic result on this? Thanks in Advance!
The text was updated successfully, but these errors were encountered:
Hi! Thanks for the great work. Since Depth-Anything-V2 is capable of depth estimation, and I assume that its feature contains spatial information (especially depth). I wonder whether the Depth-anything v2 backbone can be utilized as one of the vision encoder in a mixture of encoder paradigm for MLLM. It is intuitive to me that it may have effect to enhance the spatial or 3d awareness of MLLM. (Previously the paradigm is often Siglip + DINOv2, but I think your work may have a better ability in predicting depth), and those ability may be a great fit for application scenario like Autonomous Driving or VLN for robot. Have you tried using it to enhance MLLM ability? Do you have any statistic result on this? Thanks in Advance!
The text was updated successfully, but these errors were encountered: