Depth Anything: update conversion script for V2 #31522

pcuenca · 2024-06-20T16:00:37Z

What does this PR do?

Update the Depth Anything conversion script to support V2 models.

The only architectural change is the use of intermediate features instead of the outputs from the last 4 features.

This is already supported in the backend configuration, so the change simply involves updating the configuration

Converted models (no model card or license information):

Small
Base
Large
Giant (not yet published by the authors).

Pending to do, if this approach is accepted:

Complete the model cards and transfer the models to the https://huggingface.co/depth-anything organization, assuming the authors agree to it.
Update docs.
Update tests, if necessary.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@NielsRogge, @amyeroberts
cc @LiheYoung, @bingykang

amyeroberts

Amazing - thanks for adding!

HuggingFaceDocBuilderDev · 2024-06-20T16:24:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LiheYoung · 2024-06-20T21:57:22Z

@pcuenca, thank you for your conversion. Have you compared the prediction of the converted transformers model with our original V2 codebase? I previously made a similar modification as your current PRs in the cloned transformers. But I find the results can not be exactly aligned in this verification line. There is a gap of around 1e-2 between the two model predictions.

pcuenca · 2024-06-22T17:34:47Z

Hi @LiheYoung, thanks for checking!

Yes, I could replicate exactly the results from the small version of the model, applying the same inputs to both the original and the transformers implementations. The reference implementation I used was the one from your demo Space. I saved the depth output from the second image example (the sunflowers) as a numpy array, and verified transformers inference with the following code:

from transformers import AutoModelForDepthEstimation, AutoProcessor
from PIL import Image
import torch
import torch.nn.functional as F
import numpy as np
from torchvision.transforms import Compose

# Copied from source code
from depth_anything_transform import *

model_id = "pcuenq/Depth-Anything-V2-Small-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForDepthEstimation.from_pretrained(model_id).eval()

image = Image.open("space/Depth-Anything-V2/examples/demo02.jpg")
w, h = image.size

# Manually pre-process to match the original source code
# The transformers pre-processor produces slightly different values for some reason

transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=True,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])
pixel_values = np.array(image) / 255.0
pixel_values = transform({'image': pixel_values})['image']
pixel_values = torch.from_numpy(pixel_values).unsqueeze(0)

with torch.inference_mode():
    # DA2 processor
    outputs = model(pixel_values=pixel_values, output_hidden_states=False)

    # Transformers Processor
    inputs = processor(images=image, return_tensors="pt")
    outputs_transformers = model(**inputs, output_hidden_states=False)

# Compare with results from the same image obtained with https://huggingface.co/spaces/depth-anything/Depth-Anything-V2
def compare_with_reference(outputs, reference_depth, filename):
    depth = outputs["predicted_depth"]
    depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
    max_diff = np.abs(depth - reference_depth).max()
    mean_diff = np.abs(depth - reference_depth).mean()
    print(f"Sum of absolute differences vs baseline: {np.sum(np.abs(depth.numpy() - reference_depth))}")
    print(f"Difference using transformers processor, max: {max_diff}, mean: {mean_diff}")

    # raw_depth = Image.fromarray(depth.numpy().astype('uint16'))
    depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
    depth = depth.numpy().astype(np.uint8)
    # colored_depth = (cmap(depth)[:, :, :3] * 255).astype(np.uint8)

    gray_depth = Image.fromarray(depth)
    gray_depth.save(filename)

reference_depth = np.load("space/Depth-Anything-V2/depth_gradio.npy")
compare_with_reference(outputs, reference_depth, "gray_depth.png")
compare_with_reference(outputs_transformers, reference_depth, "gray_depth_transformers.png")

Results are identical when the same pre-processing steps are used, but are not equal when using the transformers pre-processor. I assume most of the difference will come from the resampling algorithms (the original code uses OpenCV, while transformers uses PIL). I also assume (but didn't check) that the same processor differences will affect the v1 version as well.

cc @NielsRogge in case he has additional insight

NielsRogge · 2024-06-28T10:50:30Z

docs/source/en/model_doc/depth_anything.md

@@ -14,7 +14,7 @@ rendered properly in your Markdown viewer.

 -->

-# Depth Anything
+# Depth Anything and Depth Anything V2


I'm in favor of not polluting this docs and instead add a new docs just for v2, as there's also a new paper: https://arxiv.org/abs/2406.09414.

This can be done in a similar way to how we did it for Flan-T5 compared to the original T5: https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/flan-t5.md

Agreed here - I'm happy for updates to the script if it's just a few lines so we can convert the checkpoints, but if adding the model into the library it should have its own model page

I wasn't sure how to deal with this. There are no modelling changes, the conversion script is inside the same directory as the previous checkpoints, and I felt it was weird to have a documentation page about a new model that actually refers to the same implementation as before. In my opinion, it's clearer to mention both in the same page so readers understand it's the same model architecture. We can use a single name in the title if that's preferred, and maybe improve the description in the body of the page making sure we mention both papers.

Happy to work on another solution if there's consensus. These are the options I see:

Remove the doc updates, as in the original version of this PR that was approved.

Create a new documentation page for Depth Anything V2. It'd be essentially a duplicate of the Depth Anything page, except the paper would be updated and the snippets would use the new model ids.

Use the same page for both, as in the current version of this PR, maybe tweaking as needed.

No need to add a whole new model - we can just add a new modeling page (so option 2) :)

It's fine if the modeling pages are quite similar for the code examples, this is true for a lot of text models too.

There's some models which have checkpoints which load into another architecture, but there's no new architecture added. For example, BARTPho loads into the MBart model

I'm in favor of option 2 since we did the same for other models in the past

LiheYoung · 2024-06-29T01:58:01Z

Hi @pcuenca, thank you for your clarification and efforts! I checked the sample code and also found slight differences between transformers's bicubic interpolation and OpenCV's cubic interpolation used by our original code. It seems inevitable in current transformers. So I am okay with this pull. Thank you.

pcuenca · 2024-07-01T11:29:51Z

Thank you @LiheYoung! Can we move the transformers checkpoints to your https://huggingface.co/depth-anything organization? (I can update the model cards before we do).

LiheYoung · 2024-07-02T01:15:42Z

Sure @pcuenca, thank you all!

This reverts commit be0ca47.

amyeroberts

Thanks for updating the model pages!

Done similarly to Flan-T5: https://github.com/huggingface/transformers/pull/19892/files

NielsRogge · 2024-07-05T11:56:30Z

docs/source/en/model_doc/depth_anything_v2.md

+
+Depth Anything V2 was introduced in [the paper of the same name](https://arxiv.org/abs/2406.09414) by Lihe Yang et al. It uses the same architecture as the original [Depth Anything model](depth_anything), but uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions.
+
+The abstract from the paper is the following:


Perhaps we can also add a note to the docs of v1, stating "there's a v2 available", in red so that it's visible?

pcuenca · 2024-07-05T13:50:40Z

Thanks @amyeroberts @NielsRogge for the guidance! The test failure seems unrelated, but happy to revisit if necessary.

@LiheYoung I transferred the models to your organization and updated the model cards, feel free to make changes or create a collection :)

amyeroberts · 2024-07-05T18:28:20Z

Merging as the changes are unrelated to this PR

LiheYoung · 2024-07-06T01:58:47Z

Thank you for all your efforts! I will link our repository to these models.

Depth Anything: update conversion script for V2

86d3a5c

amyeroberts approved these changes Jun 20, 2024

View reviewed changes

pcuenca added 3 commits June 22, 2024 19:37

Merge remote-tracking branch 'upstream/main' into depth-anything-v2

5af464b

Update docs

be0ca47

Style

5985bcb

NielsRogge reviewed Jun 28, 2024

View reviewed changes

amyeroberts mentioned this pull request Jul 5, 2024

Adding Depth Anything V2 to Transformers #31803

Closed

5 tasks

pcuenca added 2 commits July 5, 2024 13:22

Revert "Update docs"

79d0f82

This reverts commit be0ca47.

Add docs for depth anything v2

5b029c6

amyeroberts approved these changes Jul 5, 2024

View reviewed changes

Add depth_anything_v2 to MODEL_NAMES_MAPPING

8d30094

Done similarly to Flan-T5: https://github.com/huggingface/transformers/pull/19892/files

NielsRogge reviewed Jul 5, 2024

View reviewed changes

Add tip in original docs

2f7e989

amyeroberts merged commit 1082361 into huggingface:main Jul 5, 2024
20 of 22 checks passed

pcuenca deleted the depth-anything-v2 branch July 6, 2024 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depth Anything: update conversion script for V2 #31522

Depth Anything: update conversion script for V2 #31522

pcuenca commented Jun 20, 2024

amyeroberts left a comment

HuggingFaceDocBuilderDev commented Jun 20, 2024

LiheYoung commented Jun 20, 2024 •

edited

Loading

pcuenca commented Jun 22, 2024

NielsRogge Jun 28, 2024 •

edited

Loading

amyeroberts Jun 28, 2024

pcuenca Jul 1, 2024

amyeroberts Jul 1, 2024

NielsRogge Jul 1, 2024

LiheYoung commented Jun 29, 2024

pcuenca commented Jul 1, 2024

LiheYoung commented Jul 2, 2024

amyeroberts left a comment

NielsRogge Jul 5, 2024

pcuenca commented Jul 5, 2024

amyeroberts commented Jul 5, 2024

LiheYoung commented Jul 6, 2024


		Depth Anything V2 was introduced in [the paper of the same name](https://arxiv.org/abs/2406.09414) by Lihe Yang et al. It uses the same architecture as the original [Depth Anything model](depth_anything), but uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions.

		The abstract from the paper is the following:

Depth Anything: update conversion script for V2 #31522

Depth Anything: update conversion script for V2 #31522

Conversation

pcuenca commented Jun 20, 2024

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 20, 2024

LiheYoung commented Jun 20, 2024 • edited Loading

pcuenca commented Jun 22, 2024

NielsRogge Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

amyeroberts Jun 28, 2024

Choose a reason for hiding this comment

pcuenca Jul 1, 2024

Choose a reason for hiding this comment

amyeroberts Jul 1, 2024

Choose a reason for hiding this comment

NielsRogge Jul 1, 2024

Choose a reason for hiding this comment

LiheYoung commented Jun 29, 2024

pcuenca commented Jul 1, 2024

LiheYoung commented Jul 2, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

NielsRogge Jul 5, 2024

Choose a reason for hiding this comment

pcuenca commented Jul 5, 2024

amyeroberts commented Jul 5, 2024

LiheYoung commented Jul 6, 2024

LiheYoung commented Jun 20, 2024 •

edited

Loading

NielsRogge Jun 28, 2024 •

edited

Loading