Authors: Evžen Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner
Full paper: https://arxiv.org/abs/2411.13211
ViSTa is a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. It has a hierarchical structure: basic single-step tasks compose into more and more complex sequential tasks. ViSTa is intended to allow precise quantification of models' sequential reasoning capabilities across sequences of different lengths, with a focus on capabilities required for using vision-language models (VLMs) as reward models in reinforcement learning.
The recommended way to use ViSTa is to download it as a standalone dataset and to build your own evaluation code on top of it. More on this below. For those who would like to use our own evaluation code, we supply it in evaluation/
.
Downloading the dataset: The dataset has three parts: the videos, the problem sets, and the metadata table connecting them. The videos can be downloaded here, and the metadata table and the problem sets (also called tasks
) are in the repository under data/
.
Using the dataset: Download the videos and load the metadata table. The metadata table has the complete information about each video, allowing you to iterate through all the videos and load them as needed:
video
: the path to the video file, relative to the root of the downloaded videos directorydescription
: the description of the videolevel
: the level of the video, determining the number of actions in the videoenvironment
: the environment in which the video was recordedproblem_set_type
: human-readable description of the problem set type the video belongs to, e.g. "objects" or "actions" or "permutation"problem_set_id
: identifier of the problem set the video belongs to
Some videos belong to multiple problem sets, and some videos have multiple valid descriptions. In these cases, the table contains multiple rows for the same video.
If you want to evaluate your model on the problem sets we supply, take a look at the yaml files in data/tasks/[[problem_set_id]]
for our GPT prompts. You can also programmatically load each problem set through the Task
class in evaluation/src/vlm/objects.py
.
These test if a model can identify fundamental actions like "mine a wooden block", "open a door", or "put a banana into the closet". The actions are sometimes quite complex: for example, the video "heat up an apple" shows the agent putting the apple in a microwave, turning it on, waiting, then picking the apple back up.
These use sequences of the fundamental actions like "pick up an apple, then put the apple in the drawer" to test if a model understands action order and if it notices when we swap out actions for different ones.
Sequences in level
ViSTa groups the video-description pairs into problem sets: classification problems testing specific capabilities. During the evaluation of a problem set, models get a video and must score how well it matches each description from the problem set.
These are Level 1 (single-action) problem sets which test object recognition and contain videos such as "We pick up an apple" and "We pick up a hammer".
These are Level 1 (single-action) problem sets which test detection of specific object properties—open/closed, turned on/turned off, etc. They have videos such as "We observe an open drawer" and "We observe a closed drawer".
These are Level 1 (single-action) problem sets which test understanding of particular actions (heating, cooling, cleaning, etc.). The videos include "We heat up a banana", or "We put a banana in a microwave without turning it on".
Level
Level
Some videos in ViSTa are from existing datasets; most are manually filmed or edited.
ViSTa contains more than 3,000 videos in the virtual home environment: around 2,000 in level 1 and the rest in levels 2–8. The videos are clips from ALFRED and combinations thereof.
ViSTa contains more than 1,100 videos in the real world environment, of which 200 are sourced from Kinetics-700; the rest were created specifically for ViSTa.
- 810 videos (in levels 1–5 and 8) are directly analogous to the virtual home videos: they show the agent (us) doing tasks in the real world.
- 95 videos test object tracking: they show similar objects being shuffled.
- 18 videos test understanding of object interactions: they show us pinning fabric, or falsely seeming to do so.
- 200 videos test action recognition in complex contexts, sourced from Kinetics-700: they show either a door opening or a door closing.
In Minecraft, ViSTa has 53 videos in levels 1–3. Most were created manually; the rest were sourced from the BASALT benchmark.
In 2024, we used ViSTa to evaluate three current VLMs: CLIP, ViCLIP, and GPT-4o. Unsurprisingly, GPT-4o was significantly better than the open-source models. All models were good at recognizing objects, but had a harder time recognizing object properties and actions. None of the models were able to understand sequences of tasks well.