ViSTa Dataset: Do vision-language models understand sequential tasks?

Authors: Evžen Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner

Full paper: https://arxiv.org/abs/2411.13211

ViSTa is a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. It has a hierarchical structure: basic single-step tasks compose into more and more complex sequential tasks. ViSTa is intended to allow precise quantification of models' sequential reasoning capabilities across sequences of different lengths, with a focus on capabilities required for using vision-language models (VLMs) as reward models in reinforcement learning.

Getting started

The recommended way to use ViSTa is to download it as a standalone dataset and to build your own evaluation code on top of it. More on this below. For those who would like to use our own evaluation code, we supply it in evaluation/.

Downloading the dataset: The dataset has three parts: the videos, the problem sets, and the metadata table connecting them. The videos can be downloaded here, and the metadata table and the problem sets (also called tasks) are in the repository under data/.

Using the dataset: Download the videos and load the metadata table. The metadata table has the complete information about each video, allowing you to iterate through all the videos and load them as needed:

video: the path to the video file, relative to the root of the downloaded videos directory
description: the description of the video
level: the level of the video, determining the number of actions in the video
environment: the environment in which the video was recorded
problem_set_type: human-readable description of the problem set type the video belongs to, e.g. "objects" or "actions" or "permutation"
problem_set_id: identifier of the problem set the video belongs to

Some videos belong to multiple problem sets, and some videos have multiple valid descriptions. In these cases, the table contains multiple rows for the same video.

If you want to evaluate your model on the problem sets we supply, take a look at the yaml files in data/tasks/[[problem_set_id]] for our GPT prompts. You can also programmatically load each problem set through the Task class in evaluation/src/vlm/objects.py.

Levels

Single-action videos (level 1)

These test if a model can identify fundamental actions like "mine a wooden block", "open a door", or "put a banana into the closet". The actions are sometimes quite complex: for example, the video "heat up an apple" shows the agent putting the apple in a microwave, turning it on, waiting, then picking the apple back up.

Multiple-action videos (levels 2 through 8)

These use sequences of the fundamental actions like "pick up an apple, then put the apple in the drawer" to test if a model understands action order and if it notices when we swap out actions for different ones. Sequences in level $n$ contain $n$ fundamental actions.

Problem sets

ViSTa groups the video-description pairs into problem sets: classification problems testing specific capabilities. During the evaluation of a problem set, models get a video and must score how well it matches each description from the problem set.

Objects

These are Level 1 (single-action) problem sets which test object recognition and contain videos such as "We pick up an apple" and "We pick up a hammer".

Object properties

These are Level 1 (single-action) problem sets which test detection of specific object properties—open/closed, turned on/turned off, etc. They have videos such as "We observe an open drawer" and "We observe a closed drawer".

Actions

These are Level 1 (single-action) problem sets which test understanding of particular actions (heating, cooling, cleaning, etc.). The videos include "We heat up a banana", or "We put a banana in a microwave without turning it on".

General problems

Level $n$ (multiple-action) problem sets which test general sequential task understanding, e.g. "We open a drawer, then we pick up a banana from the drawer." Models must determine which of several possible sequences of actions matches the video.

Permutation problems

Level $n$ (multiple-action) problem sets testing whether the model can understand action order. In a given problem set, the videos are permutations of the same actions, differing only in their order.

Environments

Some videos in ViSTa are from existing datasets; most are manually filmed or edited.

Virtual home

ViSTa contains more than 3,000 videos in the virtual home environment: around 2,000 in level 1 and the rest in levels 2–8. The videos are clips from ALFRED and combinations thereof.

Real world

ViSTa contains more than 1,100 videos in the real world environment, of which 200 are sourced from Kinetics-700; the rest were created specifically for ViSTa.

810 videos (in levels 1–5 and 8) are directly analogous to the virtual home videos: they show the agent (us) doing tasks in the real world.
95 videos test object tracking: they show similar objects being shuffled.
18 videos test understanding of object interactions: they show us pinning fabric, or falsely seeming to do so.
200 videos test action recognition in complex contexts, sourced from Kinetics-700: they show either a door opening or a door closing.

Minecraft

In Minecraft, ViSTa has 53 videos in levels 1–3. Most were created manually; the rest were sourced from the BASALT benchmark.

Results on current VLMs

In 2024, we used ViSTa to evaluate three current VLMs: CLIP, ViCLIP, and GPT-4o. Unsurprisingly, GPT-4o was significantly better than the open-source models. All models were good at recognizing objects, but had a harder time recognizing object properties and actions. None of the models were able to understand sequences of tasks well.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.assets		.assets
data		data
evaluation		evaluation
virtual-home-generation		virtual-home-generation
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViSTa Dataset: Do vision-language models understand sequential tasks?

Getting started

Levels

Single-action videos (level 1)

Multiple-action videos (levels 2 through 8)

Problem sets

Objects

Object properties

Actions

General problems

Permutation problems

Environments

Virtual home

Real world

Minecraft

Results on current VLMs

About

Releases

Packages

Contributors 3

Languages

Eugleo/vista-dataset

Folders and files

Latest commit

History

Repository files navigation

ViSTa Dataset: Do vision-language models understand sequential tasks?

Getting started

Levels

Single-action videos (level 1)

Multiple-action videos (levels 2 through 8)

Problem sets

Objects

Object properties

Actions

General problems

Permutation problems

Environments

Virtual home

Real world

Minecraft

Results on current VLMs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages