Always return the same photos #15

kwea123 · 2023-05-21T12:20:11Z

No matter what I ask, it always returns the same set of photos. Not always the same photo but from a set of maybe 5~6 same images.
Like this one for "office". I intentionally gave a prompt that couldn't be explained with this image: "show me the ceiling", but it still shows this same photo

jedyang97 · 2023-05-21T16:52:32Z

Hi @kwea123 thanks for your feedback!

This is a small optimization we did just for the demo page: we pre-rendered 6 images for each scene (standing in the middle of the room, rotating 6 times) to speed up the image rendering step so that demo users won't wait too long while the agent is reasoning (rendering pictures in real time using nerfstudio+LERF takes about 30 seconds).

We are aware this is big pain point for this pipeline, therefore immediately next on our TODO list is a "smarter" rendering step, specifically,

We will pre-calculate the 3D semantic embeddings (using something like CLIP-vision encoder) of all "key points" in the scene
When a user text query comes in, we calculate a relevancy metric of the query over all key point embeddings and determine the "hot areas" of the room where we should render pictures for downstream reasoning
Only pass these photos of the "hot areas" for downstream multimodal reasoning pipeline

Hopefully, this will speed up the process so that we will be able to use real-time rendered images instead of 6 pre-rendered photos for the demo, and also that the photos are taken with better camera poses because they are now conditioned on the text query instead of a fixed point in the room.

We are also open to better ideas and implementations for this rendering step. Would love to hear what you think!

mu-cai · 2023-05-22T05:15:43Z

NeRF rendering is slow, and GPT4 matching and searching are also slow. So I think real-time feedback is nearly impossible here.

jedyang97 · 2023-05-22T16:30:47Z

@mu-cai I believe it should be doable with the above-proposed pipeline. Based on our experience, it takes 3-5 seconds to render a 512 * 512 picture in NeRF. So if we calculate the relevancy scores and only take pictures around the the relevant areas (only 1-2 pics for a text query), the experience should be near-real-time. We can also add streaming for the rendering process (i.e., display a picture once it's done instead of waiting for all images to finish) which can further reduced the perceived latency from user side.

jedyang97 · 2023-06-01T07:57:36Z

Hi @kwea123 @mu-cai, we have implemented the above features in #17. You can try out these new features with our latest demo to see them in action!

Some quick screenshots:

"How many doors are there in this room?"

"find all the chairs"

jedyang97 closed this as completed Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always return the same photos #15

Always return the same photos #15

kwea123 commented May 21, 2023

jedyang97 commented May 21, 2023 •

edited

Loading

mu-cai commented May 22, 2023

jedyang97 commented May 22, 2023

jedyang97 commented Jun 1, 2023

Always return the same photos #15

Always return the same photos #15

Comments

kwea123 commented May 21, 2023

jedyang97 commented May 21, 2023 • edited Loading

mu-cai commented May 22, 2023

jedyang97 commented May 22, 2023

jedyang97 commented Jun 1, 2023

jedyang97 commented May 21, 2023 •

edited

Loading