Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depth Anything: update conversion script for V2 #31522

Merged
merged 8 commits into from
Jul 5, 2024

Conversation

pcuenca
Copy link
Member

@pcuenca pcuenca commented Jun 20, 2024

What does this PR do?

Update the Depth Anything conversion script to support V2 models.

The only architectural change is the use of intermediate features instead of the outputs from the last 4 features.

This is already supported in the backend configuration, so the change simply involves updating the configuration

Converted models (no model card or license information):

Pending to do, if this approach is accepted:

  • Complete the model cards and transfer the models to the https://huggingface.co/depth-anything organization, assuming the authors agree to it.
  • Update docs.
  • Update tests, if necessary.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@NielsRogge, @amyeroberts
cc @LiheYoung, @bingykang

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing - thanks for adding!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@LiheYoung
Copy link

LiheYoung commented Jun 20, 2024

@pcuenca, thank you for your conversion. Have you compared the prediction of the converted transformers model with our original V2 codebase? I previously made a similar modification as your current PRs in the cloned transformers. But I find the results can not be exactly aligned in this verification line. There is a gap of around 1e-2 between the two model predictions.

@pcuenca
Copy link
Member Author

pcuenca commented Jun 22, 2024

Hi @LiheYoung, thanks for checking!

Yes, I could replicate exactly the results from the small version of the model, applying the same inputs to both the original and the transformers implementations. The reference implementation I used was the one from your demo Space. I saved the depth output from the second image example (the sunflowers) as a numpy array, and verified transformers inference with the following code:

from transformers import AutoModelForDepthEstimation, AutoProcessor
from PIL import Image
import torch
import torch.nn.functional as F
import numpy as np
from torchvision.transforms import Compose

# Copied from source code
from depth_anything_transform import *

model_id = "pcuenq/Depth-Anything-V2-Small-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForDepthEstimation.from_pretrained(model_id).eval()

image = Image.open("space/Depth-Anything-V2/examples/demo02.jpg")
w, h = image.size

# Manually pre-process to match the original source code
# The transformers pre-processor produces slightly different values for some reason

transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=True,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])
pixel_values = np.array(image) / 255.0
pixel_values = transform({'image': pixel_values})['image']
pixel_values = torch.from_numpy(pixel_values).unsqueeze(0)

with torch.inference_mode():
    # DA2 processor
    outputs = model(pixel_values=pixel_values, output_hidden_states=False)

    # Transformers Processor
    inputs = processor(images=image, return_tensors="pt")
    outputs_transformers = model(**inputs, output_hidden_states=False)

# Compare with results from the same image obtained with https://huggingface.co/spaces/depth-anything/Depth-Anything-V2
def compare_with_reference(outputs, reference_depth, filename):
    depth = outputs["predicted_depth"]
    depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
    max_diff = np.abs(depth - reference_depth).max()
    mean_diff = np.abs(depth - reference_depth).mean()
    print(f"Sum of absolute differences vs baseline: {np.sum(np.abs(depth.numpy() - reference_depth))}")
    print(f"Difference using transformers processor, max: {max_diff}, mean: {mean_diff}")

    # raw_depth = Image.fromarray(depth.numpy().astype('uint16'))
    depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
    depth = depth.numpy().astype(np.uint8)
    # colored_depth = (cmap(depth)[:, :, :3] * 255).astype(np.uint8)

    gray_depth = Image.fromarray(depth)
    gray_depth.save(filename)

reference_depth = np.load("space/Depth-Anything-V2/depth_gradio.npy")
compare_with_reference(outputs, reference_depth, "gray_depth.png")
compare_with_reference(outputs_transformers, reference_depth, "gray_depth_transformers.png")

Results are identical when the same pre-processing steps are used, but are not equal when using the transformers pre-processor. I assume most of the difference will come from the resampling algorithms (the original code uses OpenCV, while transformers uses PIL). I also assume (but didn't check) that the same processor differences will affect the v1 version as well.

cc @NielsRogge in case he has additional insight

@@ -14,7 +14,7 @@ rendered properly in your Markdown viewer.

-->

# Depth Anything
# Depth Anything and Depth Anything V2
Copy link
Contributor

@NielsRogge NielsRogge Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of not polluting this docs and instead add a new docs just for v2, as there's also a new paper: https://arxiv.org/abs/2406.09414.

This can be done in a similar way to how we did it for Flan-T5 compared to the original T5: https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/flan-t5.md

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed here - I'm happy for updates to the script if it's just a few lines so we can convert the checkpoints, but if adding the model into the library it should have its own model page

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how to deal with this. There are no modelling changes, the conversion script is inside the same directory as the previous checkpoints, and I felt it was weird to have a documentation page about a new model that actually refers to the same implementation as before. In my opinion, it's clearer to mention both in the same page so readers understand it's the same model architecture. We can use a single name in the title if that's preferred, and maybe improve the description in the body of the page making sure we mention both papers.

Happy to work on another solution if there's consensus. These are the options I see:

  1. Remove the doc updates, as in the original version of this PR that was approved.
  2. Create a new documentation page for Depth Anything V2. It'd be essentially a duplicate of the Depth Anything page, except the paper would be updated and the snippets would use the new model ids.
  3. Use the same page for both, as in the current version of this PR, maybe tweaking as needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add a whole new model - we can just add a new modeling page (so option 2) :)

It's fine if the modeling pages are quite similar for the code examples, this is true for a lot of text models too.

There's some models which have checkpoints which load into another architecture, but there's no new architecture added. For example, BARTPho loads into the MBart model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of option 2 since we did the same for other models in the past

@LiheYoung
Copy link

Hi @pcuenca, thank you for your clarification and efforts! I checked the sample code and also found slight differences between transformers's bicubic interpolation and OpenCV's cubic interpolation used by our original code. It seems inevitable in current transformers. So I am okay with this pull. Thank you.

@pcuenca
Copy link
Member Author

pcuenca commented Jul 1, 2024

Thank you @LiheYoung! Can we move the transformers checkpoints to your https://huggingface.co/depth-anything organization? (I can update the model cards before we do).

@LiheYoung
Copy link

Sure @pcuenca, thank you all!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the model pages!


Depth Anything V2 was introduced in [the paper of the same name](https://arxiv.org/abs/2406.09414) by Lihe Yang et al. It uses the same architecture as the original [Depth Anything model](depth_anything), but uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions.

The abstract from the paper is the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can also add a note to the docs of v1, stating "there's a v2 available", in red so that it's visible?

@pcuenca
Copy link
Member Author

pcuenca commented Jul 5, 2024

Thanks @amyeroberts @NielsRogge for the guidance! The test failure seems unrelated, but happy to revisit if necessary.

@LiheYoung I transferred the models to your organization and updated the model cards, feel free to make changes or create a collection :)

@amyeroberts
Copy link
Collaborator

Merging as the changes are unrelated to this PR

@amyeroberts amyeroberts merged commit 1082361 into huggingface:main Jul 5, 2024
20 of 22 checks passed
@LiheYoung
Copy link

Thank you for all your efforts! I will link our repository to these models.

@pcuenca pcuenca deleted the depth-anything-v2 branch July 6, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants