Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always return the same photos #15

Closed
kwea123 opened this issue May 21, 2023 · 4 comments
Closed

Always return the same photos #15

kwea123 opened this issue May 21, 2023 · 4 comments

Comments

@kwea123
Copy link

kwea123 commented May 21, 2023

No matter what I ask, it always returns the same set of photos. Not always the same photo but from a set of maybe 5~6 same images.
Like this one for "office". I intentionally gave a prompt that couldn't be explained with this image: "show me the ceiling", but it still shows this same photo
image

@jedyang97
Copy link
Contributor

jedyang97 commented May 21, 2023

Hi @kwea123 thanks for your feedback!

This is a small optimization we did just for the demo page: we pre-rendered 6 images for each scene (standing in the middle of the room, rotating 6 times) to speed up the image rendering step so that demo users won't wait too long while the agent is reasoning (rendering pictures in real time using nerfstudio+LERF takes about 30 seconds).

We are aware this is big pain point for this pipeline, therefore immediately next on our TODO list is a "smarter" rendering step, specifically,

  • We will pre-calculate the 3D semantic embeddings (using something like CLIP-vision encoder) of all "key points" in the scene
  • When a user text query comes in, we calculate a relevancy metric of the query over all key point embeddings and determine the "hot areas" of the room where we should render pictures for downstream reasoning
  • Only pass these photos of the "hot areas" for downstream multimodal reasoning pipeline

Hopefully, this will speed up the process so that we will be able to use real-time rendered images instead of 6 pre-rendered photos for the demo, and also that the photos are taken with better camera poses because they are now conditioned on the text query instead of a fixed point in the room.

We are also open to better ideas and implementations for this rendering step. Would love to hear what you think!

@mu-cai
Copy link

mu-cai commented May 22, 2023

NeRF rendering is slow, and GPT4 matching and searching are also slow. So I think real-time feedback is nearly impossible here.

@jedyang97
Copy link
Contributor

@mu-cai I believe it should be doable with the above-proposed pipeline. Based on our experience, it takes 3-5 seconds to render a 512 * 512 picture in NeRF. So if we calculate the relevancy scores and only take pictures around the the relevant areas (only 1-2 pics for a text query), the experience should be near-real-time. We can also add streaming for the rendering process (i.e., display a picture once it's done instead of waiting for all images to finish) which can further reduced the perceived latency from user side.

@jedyang97
Copy link
Contributor

Hi @kwea123 @mu-cai, we have implemented the above features in #17. You can try out these new features with our latest demo to see them in action!

Some quick screenshots:

"How many doors are there in this room?"
image

"find all the chairs"
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants