Skip to content

Commit

Permalink
Merge pull request #124 from shivank21/ICAL
Browse files Browse the repository at this point in the history
⚡ Add Summary for ICAL
  • Loading branch information
MansiGupta1603 authored Dec 8, 2024
2 parents 86bb43c + 3bc9144 commit c9ec93d
Show file tree
Hide file tree
Showing 3 changed files with 84 additions and 0 deletions.
Binary file added images/ical1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/ical2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 84 additions & 0 deletions summaries/ICAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs

Gabriel Sarch,Lawrence Jang,Michael J. Tarr,William W. Cohen,Kenneth Marino,Katerina Fragkiadaki, **NeurIPS 2024**

## Summary

The paper introduces In-Context Abstraction Learning (ICAL), a new method to improve the ability of large language and vision-language models (LLMs and VLMs) to learn from a few demonstrations. Traditionally, LLMs and VLMs rely on high-quality demonstrations for in-context learning. The paper proposes a method for these models to learn from sub-optimal demonstrations by generating their own in-context examples with multi-modal abstractions.

## Contributions

1. **In-Context Abstraction Learning (ICAL):** A novel method that enables LLMs and VLMs to generate their own examples from generic, sub-optimal demonstrations.

2. **Multimodal Abstractions:** ICAL focuses on four types of cognitive abstractions to correct the errors in the examples


## Method and Pipeline

1. ICAL starts with a noisy trajectory, which is a sequence of observations and actions collected from non-expert humans or generated by an agent , denoted $\xi_{noisy} = \{ o_0, a_0, \dots, o_T, a_T \}$ in a new task domain $D$.

2. **Abstraction Phase:** The VLM identifies and corrects errors in the trajectory and enriches it with four types of language abstractions:

- **Task and Causal Abstractions:** Explain the fundamental principles or actions needed to achieve a goal. Example: "Since the box is already open, there is no need to close it after placing the watches inside."

- **State Changes:** Describe how actions affect the form and conditions of objects in the scene.

- **Task Decomposition and Subgoals:** Break down a complex task into intermediate steps

- **Task Construals:** Highlight essential visual details within a task

   Mathematically,

   $F_{abstract} : (\xi_{noisy}, I, \{e_1, \dots, e_k\}) \rightarrow (\xi_{optimized}, L)$

   Where:
- $\xi_{noisy}$: Noisy Trajectory
- $I$: Task Instruction
- $\{e_1, \dots, e_k\}$: Top-k previously successful in-context examples
- $\xi_{optimized}$: Optimized Trajectory
- $L$: Language Abstractions

3. The optimized trajectory is executed in the environment, and a human observer provides natural language feedback when the agent fails. The VLM then uses this feedback to revise the trajectory and the abstractions.

   Mathematically,

   $\Xi_{update}(\xi_{optimized}, H(a_t, o_t), L, I, \{e_1, \dots, e_k\}) \rightarrow \xi'_{optimized}, L'$

   Where:
- $\Xi_{update}$: Update function
- $\xi_{optimized}$: Current trajectory
- $H(a_t, o_t)$: Human feedback on action $a_t$ at observation $o_t$
- $L$: Current annotations
- $I$: Task Instruction
- $\{e_1, \dots, e_k\}$: Top-k retrieved examples
- $\xi'_{optimized}$: Revised Trajectory
- $L'$: Updated annotations

4. If the execution is successful, the revised trajectory and abstractions are added to the agent's memory.

<img src= '../images/ical1.jpg'>

5. **Retrieval Augmented Generation at Deployment:** When presented with new instructions, the agent retrieves similar examples from its memory and uses them as context to generate actions.

&emsp; Mathematically,

&emsp; $s = \lambda_I \cdot s_I + \lambda_{textual} \cdot s_{textual} + \lambda_{visual} \cdot s_{visual}$

&emsp; Where:
- $s$: Aggregated similarity score
- $s_I$, $s_{textual}$, $s_{visual}$: Similarity scores for instruction, textual state, and visual state respectively,computed via cosine similarity
- $\lambda_I$, $\lambda_{textual}$, and $\lambda_{visual}$: Weighting hyperparameters

<img src='../images/ical2.jpg'>

## Results

1. On the TEACh household instruction following validation unseen dataset. ICAL examples significantly improves on the state-of-the-art in goal-condition success, outperforming agents that use the raw visual demonstrations as in context examples without abstraction learning.

2. ICAL outperforms the previous state-of-the-art on the VisualWebArena benchmark

3. ICAL demonstrates superior performance on Ego4D action anticipation compared to hand-written few-shot GPT4V that uses chain of thought.

## Our Two Cents

The paper introduces a promising approach aimed at enhancing the learning capabilities of LLM and VLMs through sub-optimal demonstrations. The concept of generating multimodal abstractions is highly applicable to real-world situations. However, a limitation of the approach is its reliance on a fixed action API, which may restrict the adaptability of the agent. Looking ahead, a valuable area for future research would be to explore the integration of more advanced forms of human feedback, such as demonstrations or corrections to the abstractions, to further improve the learning process.

0 comments on commit c9ec93d

Please sign in to comment.