-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depth Anything: update conversion script for V2 #31522
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing - thanks for adding!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@pcuenca, thank you for your conversion. Have you compared the prediction of the converted |
Hi @LiheYoung, thanks for checking! Yes, I could replicate exactly the results from the small version of the model, applying the same inputs to both the original and the transformers implementations. The reference implementation I used was the one from your demo Space. I saved the depth output from the second image example (the sunflowers) as a numpy array, and verified transformers inference with the following code: from transformers import AutoModelForDepthEstimation, AutoProcessor
from PIL import Image
import torch
import torch.nn.functional as F
import numpy as np
from torchvision.transforms import Compose
# Copied from source code
from depth_anything_transform import *
model_id = "pcuenq/Depth-Anything-V2-Small-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForDepthEstimation.from_pretrained(model_id).eval()
image = Image.open("space/Depth-Anything-V2/examples/demo02.jpg")
w, h = image.size
# Manually pre-process to match the original source code
# The transformers pre-processor produces slightly different values for some reason
transform = Compose([
Resize(
width=518,
height=518,
resize_target=False,
keep_aspect_ratio=True,
ensure_multiple_of=14,
resize_method='lower_bound',
image_interpolation_method=cv2.INTER_CUBIC,
),
NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
PrepareForNet(),
])
pixel_values = np.array(image) / 255.0
pixel_values = transform({'image': pixel_values})['image']
pixel_values = torch.from_numpy(pixel_values).unsqueeze(0)
with torch.inference_mode():
# DA2 processor
outputs = model(pixel_values=pixel_values, output_hidden_states=False)
# Transformers Processor
inputs = processor(images=image, return_tensors="pt")
outputs_transformers = model(**inputs, output_hidden_states=False)
# Compare with results from the same image obtained with https://huggingface.co/spaces/depth-anything/Depth-Anything-V2
def compare_with_reference(outputs, reference_depth, filename):
depth = outputs["predicted_depth"]
depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
max_diff = np.abs(depth - reference_depth).max()
mean_diff = np.abs(depth - reference_depth).mean()
print(f"Sum of absolute differences vs baseline: {np.sum(np.abs(depth.numpy() - reference_depth))}")
print(f"Difference using transformers processor, max: {max_diff}, mean: {mean_diff}")
# raw_depth = Image.fromarray(depth.numpy().astype('uint16'))
depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
depth = depth.numpy().astype(np.uint8)
# colored_depth = (cmap(depth)[:, :, :3] * 255).astype(np.uint8)
gray_depth = Image.fromarray(depth)
gray_depth.save(filename)
reference_depth = np.load("space/Depth-Anything-V2/depth_gradio.npy")
compare_with_reference(outputs, reference_depth, "gray_depth.png")
compare_with_reference(outputs_transformers, reference_depth, "gray_depth_transformers.png") Results are identical when the same pre-processing steps are used, but are not equal when using the transformers pre-processor. I assume most of the difference will come from the resampling algorithms (the original code uses OpenCV, while transformers uses PIL). I also assume (but didn't check) that the same processor differences will affect the v1 version as well. cc @NielsRogge in case he has additional insight |
@@ -14,7 +14,7 @@ rendered properly in your Markdown viewer. | |||
|
|||
--> | |||
|
|||
# Depth Anything | |||
# Depth Anything and Depth Anything V2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of not polluting this docs and instead add a new docs just for v2, as there's also a new paper: https://arxiv.org/abs/2406.09414.
This can be done in a similar way to how we did it for Flan-T5 compared to the original T5: https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/flan-t5.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed here - I'm happy for updates to the script if it's just a few lines so we can convert the checkpoints, but if adding the model into the library it should have its own model page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure how to deal with this. There are no modelling changes, the conversion script is inside the same directory as the previous checkpoints, and I felt it was weird to have a documentation page about a new model that actually refers to the same implementation as before. In my opinion, it's clearer to mention both in the same page so readers understand it's the same model architecture. We can use a single name in the title if that's preferred, and maybe improve the description in the body of the page making sure we mention both papers.
Happy to work on another solution if there's consensus. These are the options I see:
- Remove the doc updates, as in the original version of this PR that was approved.
- Create a new documentation page for Depth Anything V2. It'd be essentially a duplicate of the Depth Anything page, except the paper would be updated and the snippets would use the new model ids.
- Use the same page for both, as in the current version of this PR, maybe tweaking as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add a whole new model - we can just add a new modeling page (so option 2) :)
It's fine if the modeling pages are quite similar for the code examples, this is true for a lot of text models too.
There's some models which have checkpoints which load into another architecture, but there's no new architecture added. For example, BARTPho loads into the MBart model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of option 2 since we did the same for other models in the past
Hi @pcuenca, thank you for your clarification and efforts! I checked the sample code and also found slight differences between transformers's bicubic interpolation and OpenCV's cubic interpolation used by our original code. It seems inevitable in current transformers. So I am okay with this pull. Thank you. |
Thank you @LiheYoung! Can we move the transformers checkpoints to your |
Sure @pcuenca, thank you all! |
This reverts commit be0ca47.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the model pages!
|
||
Depth Anything V2 was introduced in [the paper of the same name](https://arxiv.org/abs/2406.09414) by Lihe Yang et al. It uses the same architecture as the original [Depth Anything model](depth_anything), but uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions. | ||
|
||
The abstract from the paper is the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can also add a note to the docs of v1, stating "there's a v2 available", in red so that it's visible?
Thanks @amyeroberts @NielsRogge for the guidance! The test failure seems unrelated, but happy to revisit if necessary. @LiheYoung I transferred the models to your organization and updated the model cards, feel free to make changes or create a collection :) |
Merging as the changes are unrelated to this PR |
Thank you for all your efforts! I will link our repository to these models. |
What does this PR do?
Update the Depth Anything conversion script to support V2 models.
The only architectural change is the use of intermediate features instead of the outputs from the last 4 features.
This is already supported in the backend configuration, so the change simply involves updating the configuration
Converted models (no model card or license information):
Pending to do, if this approach is accepted:
https://huggingface.co/depth-anything
organization, assuming the authors agree to it.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@NielsRogge, @amyeroberts
cc @LiheYoung, @bingykang