The Genshin impact dataset (GID) is collected in the Genshin Impact game[1] for visual SLAM. It currently consists of 60 individual sequences (over 3 hours in total) and covers a wide range of scenes that are rare, hard, or dangerous for field collection in real world (such as dull deserts, dim caves, and lush jungles). It provides great opportunities for SLAM evaluation and benchmark. Moreover, it includes a large number of visual challenges (such as low illumination and low texture scenes) to test the robustness of various SLAM algorithms. It is part of our work How Challenging is a Challenge? CEMS: a Challenge Evaluation Module for SLAM Visual Perception.
If you use some resource from this dataset, please cite the paper as:
BibTeX
@article{Zhao2024CEMS,
title={How Challenging is a Challenge? CEMS: a Challenge Evaluation Module for SLAM Visual Perception},
author={Xuhui Zhao and Zhi Gao and Hao Li and Hong Ji and Hong Yang and Chenyang Li and Hao Fang and Ben M. Chen},
journal={Journal of Intelligent \& Robotic Systems},
year={2024},
volume={110},
number=42,
pages={1--19},
doi={https://doi.org/10.1007/s10846-024-02077-4}
}
APA
Zhao, X., Gao, Z., Li, H., Ji, H., Yang, H., Li, C., Fang, H., & M. Chen, B. (2024). How Challenging is a Challenge? CEMS: a Challenge Evaluation Module for SLAM Visual Perception. Journal of Intelligent & Robotic Systems.
The dataset is generally composed of two parts: sequences (blue part) and support files (orange part), as the following figure shows.
In the sequences part, each sequence contains several files for the convenience of usage. We take the Seq-001
as an example and elaborate next.
-
Seq-001.mp4
The recorded video from the Genshin Impact game which can be further processed according to different needs. It has a 1436 (width) × 996 (height) resolution at 30 FPS. -
Seq-001.png
The content preview of the recorded video for a fast grasp without playing it. It summarizes the resolution (width × height), duration (sec), FPS, and frames in total. -
Frames-Sparse
It is a folder storing split frames from the recorded video. For the convenience of end users, we split the whole video in advance with a frame interval of 10 (extract 1 frame every 10 frames). -
Groundtruth-EuRoC.txt
For the convenience of users, we provide groundtruth poses of split frames in both EuRoC and TUM format. This file records poses in EuRoC[2] format:timestamp[ns], pos_x[m], pos_y[m], pos_z[m], quat_w, quat_x, quat_y, quat_z
-
Groundtruth-TUM.txt
This file records poses in TUM[3] format:timestamp[s] pos_x[m] pos_y[m] pos_z[m] quat_x quat_y quat_z quat_w
-
Timestamps.txt
This file stores the corresponding timestamps of split frames in theFrames-Sprase
folder. The time unit is nanosecond (10-9 second).
In the support files part, it contains camera intrinsics and tool scripts.
-
Intrinsics.yaml
This file records the focal length (fx and fy) and principle point (cx and cy) for the pinhole camera model we use. It is organized in standard yaml format, which is easy for data input and output. -
tool-splitVideo.py
This Python script is used for splitting the original video into separate frames according to user settings. The only launch parameter for this script is the path of the video you want to process. As for the other parameters, users can set them in an interactive manner. All interactive parameters are summarized below:- Clipping start time: start timestamp of clipping, unit: second, default: 0s
- Clipping end time: end timestamp of clipping, unit: second, default: the end of the whole video
- Sampling interval N: sample one frame every N frame, default: output every frame
- Scale for output frame: scale factor for output frame images, default: 1 for the original size
- Type for output frame: file type for output frame images, default: .jpg
- Name format for frame: name format for output frame images, select from Timestamp format (12 digits to represent timestamp in nanoseconds) and Frame index format (4 digits to represent frame index in the original video). Default: Timestamp format.
-
tool-resizeFrames.py
This Python script is used for the resizing of existing frame images. It requires three launch parameters:- Search folder: the folder path of frames need to be processed
- Image type: the type of images in the folder
- Scale: the scale for resizing
We collect sequences at different places in the Genshin Impact game to cover a wide range of scenes as much as possible. Generally, each country in the game (Mondstadt, Liyue, Inazuma, and Sumeru) has 15 sequences to reflect its unique features. More specifically, the sequences are distributed as follows:
- Sequence 1-15 are collected in Mondstadt
- Sequence 16-30 are collected in Liyue
- Sequence 31-45 are collected in Inazuma
- Sequence 46-60 are collected in Sumeru
The following figure shows the distribution of sequences in different regions. You may click the figure and zoom in to see the details since the world map is very large.
Benifiting from the large and diverse game world, the sequences in GID also have a great diversity, which we summarize in the following aspects.
Scene The dataset involves a wide range of scenes, including deserts, caves, jungles, and so on. The following figure shows some type of scenes. For example, the user can test the robustness toward low light conditions of their SLAM in dim cave scenes.
Time The sequences in GID generally cover a whole day, from morning to afternoon and night. This potentially enables experiments for SLAM in changing illumination conditions. The following figure shows the coverage of a whole day.
Weather The dataset includes various weather conditions, such as clear, cloudy, and raining scenes. The following figure shows some examples of different weather conditions.
Visual Challenges for SLAM The dataset contains various visual challenges for SLAM algorithms, such as low-light, low-texture. Sequences of these challenges may boost the development and benchmark of visual SLAM in challenging environments. The following figure shows some representitative challenges in the dataset.
Duration
The sequences cover a wide range of durations, from 59 seconds (Seq-042
) to 333 seconds (Seq-049
& Seq-058
), which provides the possibility to test the scability of SLAM. The following figure shows the distribution of sequences in different durations.
We upload all 60 sequences and provide two ways to download the dataset: Google Drive and Baidu Netdisk. You can click Google Drive or Baidu Netdisk for downloading the whole dataset (about 22 GB totally) according to your network environment. Or you can download individual sequences by clicking corresponding links in the following table.
Seq. No | Region | Duration (sec) | Preview | Google Drive | Baidu Netdisk |
---|---|---|---|---|---|
Seq-001 | Mondstadt | 102 | Link | Link | |
Seq-002 | Mondstadt | 280 | Link | Link | |
Seq-003 | Mondstadt | 170 | Link | Link | |
Seq-004 | Mondstadt | 120 | Link | Link | |
Seq-005 | Mondstadt | 177 | Link | Link | |
Seq-006 | Mondstadt | 142 | Link | Link | |
Seq-007 | Mondstadt | 140 | Link | Link | |
Seq-008 | Mondstadt | 130 | Link | Link | |
Seq-009 | Mondstadt | 129 | Link | Link | |
Seq-010 | Mondstadt | 182 | Link | Link | |
Seq-011 | Mondstadt | 209 | Link | Link | |
Seq-012 | Mondstadt | 231 | Link | Link | |
Seq-013 | Mondstadt | 123 | Link | Link | |
Seq-014 | Mondstadt | 150 | Link | Link | |
Seq-015 | Mondstadt | 293 | Link | Link | |
Seq-016 | Liyue | 294 | Link | Link | |
Seq-017 | Liyue | 191 | Link | Link | |
Seq-018 | Liyue | 288 | Link | Link | |
Seq-019 | Liyue | 175 | Link | Link | |
Seq-020 | Liyue | 177 | Link | Link | |
Seq-021 | Liyue | 322 | Link | Link | |
Seq-022 | Liyue | 238 | Link | Link | |
Seq-023 | Liyue | 158 | Link | Link | |
Seq-024 | Liyue | 163 | Link | Link | |
Seq-025 | Liyue | 241 | Link | Link | |
Seq-026 | Liyue | 326 | Link | Link | |
Seq-027 | Liyue | 257 | Link | Link | |
Seq-028 | Liyue | 104 | Link | Link | |
Seq-029 | Liyue | 286 | Link | Link | |
Seq-030 | Liyue | 269 | Link | Link | |
Seq-031 | Inazuma | 172 | Link | Link | |
Seq-032 | Inazuma | 110 | Link | Link | |
Seq-033 | Inazuma | 249 | Link | Link | |
Seq-034 | Inazuma | 77 | Link | Link | |
Seq-035 | Inazuma | 268 | Link | Link | |
Seq-036 | Inazuma | 235 | Link | Link | |
Seq-037 | Inazuma | 152 | Link | Link | |
Seq-038 | Inazuma | 252 | Link | Link | |
Seq-039 | Inazuma | 231 | Link | Link | |
Seq-040 | Inazuma | 98 | Link | Link | |
Seq-041 | Inazuma | 129 | Link | Link | |
Seq-042 | Inazuma | 59 | Link | Link | |
Seq-043 | Inazuma | 133 | Link | Link | |
Seq-044 | Inazuma | 155 | Link | Link | |
Seq-045 | Inazuma | 64 | Link | Link | |
Seq-046 | Sumeru | 72 | Link | Link | |
Seq-047 | Sumeru | 191 | Link | Link | |
Seq-048 | Sumeru | 208 | Link | Link | |
Seq-049 | Sumeru | 333 | Link | Link | |
Seq-050 | Sumeru | 219 | Link | Link | |
Seq-051 | Sumeru | 146 | Link | Link | |
Seq-052 | Sumeru | 237 | Link | Link | |
Seq-053 | Sumeru | 147 | Link | Link | |
Seq-054 | Sumeru | 213 | Link | Link | |
Seq-055 | Sumeru | 79 | Link | Link | |
Seq-056 | Sumeru | 186 | Link | Link | |
Seq-057 | Sumeru | 150 | Link | Link | |
Seq-058 | Sumeru | 333 | Link | Link | |
Seq-059 | Sumeru | 200 | Link | Link | |
Seq-060 | Sumeru | 190 | Link | Link |
All the sequences are collected with fixed and consistent camera settings. The computer used for data collection is equipped with an Intel Core i9-9900K CPU, 64GB RAM, and an NVIDIA Titan RTX GPU. We first record videos from the Genshin Impact game, where the videos are saved in .mkv
format. The original resolution of the recorded video is 1920 (width) × 1200 (height) @ 30FPS, as the following figure shows.
Then, we write Python scripts to split the recorded videos into frames and save them in .jpg
format, where we sample 1 frame every 10 frames. Moreover, we simultaneously crop the frame images to 1436 × 996 to remove unrelated parts in the original videos. The following figure shows the cropped and outputted frames of the Seq-046
sequence.
To obtain the precise poses of the camera, we use the ColMap software[4] for groundtruth estimation and 3D reconstruction. We input all the frames in a sequence to ColMap and obtain the camera poses and 3D points. We use "automatic reconstruction" mode with the following parameters:
- Data type: Video frames
- Quality: Medium
- Shared intrinsics: Yes
- Sparse model: Yes
- Dense model: Yes
For other parameters, we let ColMap manage them as default. The following figure shows the estimated camera poses and point cloud of the Seq-046
sequence in ColMap.
We can also visualize reconstructed 3D meshes with MeshLab[5] software, as the following figure shows.
After reconstruction, we export the estimated poses and trajectory in ColMap to a images.txt
file, which contains the estimated camera poses. We then write Python scripts to convert the images.txt
file to the aforementioned standard TUM and EuRoC formats. Moreover, we export the estimated camera intrinsics in ColMap to a cameras.txt
file.
The following figure briefly demonstrates the performance of ORB-SLAM2[6] (monocular) on our dataset, which is a classic and sophisticated visual SLAM. For the best understanding, you may click here to download and view the whole video of testing (50s).
Generally, the ORB-SLAM2 performs well in various scenes, even in some challenging scenes, demonstrating the feasibility of our dataset for running SLAM algorithms. For example, we compare the estimated trajectory for Seq-060
and the groundtruth poses with EVO tool[7], as the following figure shows.
After scale and trajectory alignment, it can be seen that the estimated poses are generally consistent with groundtruth. On the one side, this demonstrates the feasibility of our dataset; on the other side, this shows the high accuracy of groundtruth estimated by ColMap.
Answer:
-
Compared with field-collected sequences, our dataset contains more diverse scenes for SLAM to test. Moreover, many scenes in the dataset may be difficult or dangerous to collect in real world, such as the desert, the caves, and snow mountains.
-
Compared with sequences collected in simulation environments, the proposed dataset has the following advantages.
- The scenes are exquisite and beautiful in the Genshin Impact game. Generally, few simulation platforms (such as Gazebo[8], XTDrone[9]) provides such simulated quality. Some sophisticated simulation platforms (such as AirSim[10], Nvidia Omniverse[11]) may provide high quality, but they are usually difficult to get involved and design your own world.
- It is time-consuming and laborious to build a high-quality scene in simulation software from scratch, especially for large scenes. However, we can directly use the built scenes and collect sequences in the game, which is more efficient.
- Existing simulation platforms are difficult to simulate photorealistic visual challenges we wanted for SLAM tests. For example, XTDrone typically cannot simulate different weather conditions. However, we can easily recored sequences containing photorealistic weather changes in the game, such as sunny, rainy, snowy, and foggy.
Q2: How the groundtruth poses are estimated? What about the accuracy? How you guarantee its reliability?
Answer:
-
As we mentioned before, we use the ColMap software for the groundtruth pose estimation, which is a popular and sophisticated software for 3D reconstruction. We use the "automatic reconstruction" mode with medium quality to obtain the groundtruth poses. The estimated poses are generally accurate.
-
Since we do not have the real poses of camera, we evaluate the accuracy of estimated groundtruth with reprojection error, which is automatically calculated in ColMap software. The reprojection error indicates the average distance between the reprojected 3D points and the corresponding 2D points in the image. The following figure shows the corresponding reprojection error of each sequence in the dataset. Generally, the overall of all sequences is 0.88 (less than 1 pixel), which is very small.
-
We cannot obtain the real groundtruth poses, so we focus more on the consistency of estimated trajectory and reconstructed 3D points. We think that if the consistency is high, then the estimated trajectory is accurate. Of course, this is not absolute, and the estimated groundtruth may also have errors. We will continue to explore and adopt more accurate methods to estimate the groundtruth.
-
Moreover, it should be noticed that the scale of the estimated trajectory is not absolute due to the scale ambiguity, and the groundtruth trajectory does not have absolute scale information. Therefore, remember to perform scale alignment before evaluating estimated trajectories from your SLAM. The scale of different sequences is not comparable.
- Step1: Download the sequences and useful tools you need in the dataset with provided links.
- Step2: (optional) Resample the downloaded video with provided Python script according to your needs.
- Step3: Run your interested visual odometry or SLAM algorithm and save the estimated trajectory to a file.
- Step4: Evaluate the performance of your algorithm with the provided groundtruth poses with various tools, such as the EVO.
- [1] https://genshin.hoyoverse.com
- [2] https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets
- [3] https://cvg.cit.tum.de/data/datasets/rgbd-dataset
- [4] https://colmap.github.io
- [5] https://www.meshlab.net
- [6] https://github.com/raulmur/ORB_SLAM2
- [7] https://github.com/MichaelGrupp/evo
- [8] https://gazebosim.org
- [9] https://github.com/robin-shaun/XTDrone
- [10] https://microsoft.github.io/AirSim
- [11] https://developer.nvidia.com/omniverse