This is the PyTorch implementation of the paper "Egocentric Audio-Visual Object Localization."
We explore the task of egocentric audio-visual object localization, which aims to localize objects that emit sounds in the first-person recordings. In this work, we propose a new framework to address the uniqueness of egocentric videos by answering the following two questions: (1) how to associate visual content with audio representations while out-of-view sounds may exist; (2) how to persistently associate audio features with visual content that are captured under different viewpoints.
Note, some videos are further filtered out and some bounding boxes are updated recently.
-
Download videos.
a. Download Epic-Kitchens dataset from: https://epic-kitchens.github.io/2023 (The website provides scirt to download videos).
-
Preprocess videos.
a. Trim the video using Epic-Kitchens' original annotations, for example, the test video timestamps can be found at https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/EPIC_100_test_timestamps.csv.
b. Extract waveforms at 11000Hz for all the videos.
-
Data splits. Please follow the same train/test splits at https://github.com/epic-kitchens/epic-kitchens-100-annotations.
-
Filter out silent clips. As the action recognition splits are developed based on action, not audio, there could be video clips that are silent or do not include meaningful sounds. We try to filter out some silent video clips to obtain a better training set, please refer to
./code/script/filter_silent_clips.py
. (Optional, you can use the newly released EPIC-SOUND dataset to obtain an audio-based training split)
The annotations can be found at ./data/soundingobject.json
.
video
contains the index to locate the segment from a long video. For example,P04_105-00:05:26.32-00:05:28.01-16316-16400
represents thevideo_id,narration_timestamp,start_timestamp,stop_timestamp,start_frame,stop_frame
in the test split csv file.frame
is the exact frame index we use to annotate the sounding object.bbox
is the relative coordinates of bounding box, which is in[left, top, right, bottom]
format.
pip install -r requirements.txt
-
Process videos and prepare the data. a. Trim the video following https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/EPIC_100_train.csv and get the frames within [start_frame, stop_frame]. Store the data with the following directory structure
folder_name (e.g., 'P01_01-00:00:00.14-00:00:03.37-8-202') ├── audio | ├── P01_01-00:00:00.14-00:00:03.37-8-202.wav | └── rgb_frames | ├── frame_0000000008.jpg │ ├── frame_0000000009.jpg │ ├── ... │ ├── frame_0000000202.jpg └──
b. Create the index files
train.csv
. For each row, it stores the information:participant_id,video_id,start_timestamp,stop_timestamp,start_frame,stop_frame,narration,folder_dir
. Note that you can change the format and revise the dataloader accordingly. An example is given as follows:participant_id, video_id, start_timestamp, stop_timestamp, start_frame, stop_frame, narration, folder_dir P01, P01_01, 00:00:00.14, 00:00:03.37, 8, 202, open door, /YOUR_DIR/P01_01-00:00:00.14-00:00:03.37-8-202-open_door
-
Train the localization model
bash ./scripts/train_localization.sh
- During training, checkpoints are saved to
data/ckpt/MODEL_ID
.
If you find our work useful for your research, please consider citing our paper. 😄
@inproceedings{huang2023egocentric,
title={Egocentric Audio-Visual Object Localization},
author={Huang, Chao and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={22910--22921},
year={2023}
}
We borrowed a lot of code from CCoL and CoSep. We thank the authors for sharing their code. If you use our codes, please also consider cite their nice works.