WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types
This is the main page of the ACL 2022 paper: WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types.
************************************* Updates *************************************
- 16/03/2022: We add a new version of dataset (V2).
- An annotator re-annoted the entire dataset based on the annotations in V1.
- The Train and Valid are resampled to make the distributions more similar.
- 12/06/2022: We release more data:
- Dataset with 10 cands.
- The extracted Wikipedia data: entity2desc and entity2imgURLs.
- The downloaded WikinewsImgs.
WikiDiverse is a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikinews. It has 8K image-caption pairs and uses Wikipedia as the corresponding knowledge base.
- The annotated data: Google Drive
- The data with retrieved 10 cands: Google Drive
- The Wikipedia data:
- Entity2desc (filtered): Google Drive (split with '@@@@')
- The original Wikipedia information: Quark Drive (tsv format, split with '@@@@').
- The P(e|m) data: Google Drive
- The image files:
- Wikinews: the downloaded and cleaned imgs can be found in Google Drive
- Wikipedia: the alignment between entity names and image URLs: Google Drive (split with '@@@@')
- The annotated data (passage level)
[
"The Lions versus the Packers (2007).",
"https://upload.wikimedia.org/wikipedia/commons/0/06/DetroitLionsRunningPlay-2007.jpg",
"sports",
[
[
"Lions",
"Organization",
4,
9,
"https://en.wikipedia.org/wiki/Detroit_Lions"
],
[
"Packers",
"Organization",
21,
28,
"https://en.wikipedia.org/wiki/Green_Bay_Packers"
]
]
]
- The data with cands (mention level)
[
"Bart writing \"HDTV is worth every cent\" in the \"chalkboard gag.\".", # sentence
"https://upload.wikimedia.org/wikinews/en/c/ca/Simpsons_new_title_sequence_screenshot.png", # img_url
"HDTB", # mention
"Other", # mention type
['bart', 'writing'], # list of left context
[' ', 'be', 'worth', 'every', 'cent', 'in', 'the', 'chalkboard', 'gag'], # list of right context
"https://en.wikipedia.org/wiki/High-definition_television", # entity url
[], # list of candidates
"entertainment", # topic
14, # mention start position
18, # mention end position
]
(To evaluate the recall of the candidates, please refer to eval_cands_recall.py)
- How to get the wikinews imgs
import hashlib
import re
for item in data:
m_img = item[1].split('/')[-1]
prefix = hashlib.md5(m_img.encode()).hexdigest()
suffix = re.sub(r'(\S+(?=\.(jpg|JPG|png|PNG|svg|SVG)))|(\S+(?=\.(jpeg|JPEG)))', '', m_img)
m_img = 'path to wikinewsImgs' + prefix + suffix
m_img = m_img.replace('.svg', '.png').replace('.SVG', '.png')
- The original Wikipedia info (entity level)
[
# entity description with EL annotations
'The Ugaritic alphabet is a cuneiform{"title": "Cuneiform", "url": "https://en.wikipedia.org//wiki/Cuneiform", "text": "cuneiform"} abjad{"title": "Abjad", "url": "https://en.wikipedia.org//wiki/Abjad", "text": "abjad"} (consonantal alphabet) used from around either the fifteenth century BCE...',
# entity description without EL annotations
'The Ugaritic alphabet is a cuneiform abjad (consonantal alphabet) used from around either the fifteenth century BCE ....',
# main-image list, each item includes an image URL and the caption (may containing EL annotations).
'[["https://upload.wikimedia.org/wikipedia/commons/0/04/22_alphabet.jpg", "The Ugaritic Alphabet"], ["https://upload.wikimedia.org/wikipedia/en/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg", ""]]',
# entity name
'Ugaritic alphabet',
# entity URL
'https://en.wikipedia.org/wiki/Ugaritic_alphabet',
# list of other images in the page, each item includes an image URL and the caption (may containing EL annotations).
[["https://upload.wikimedia.org/wikipedia/commons/b/b0/Literacy.PNG", "Dark green shows approximate spread of writing by 1300 BCE"], ["https://upload.wikimedia.org/wikipedia/commons/d/dd/Ugaritic-alphabet-chart.svg", "Ugaritic alphabet"]],
]
###Statistics and Leaderboard
Train | Valid | Test | |
---|---|---|---|
# instance (sent level) | 6312 | 755 | 757 |
# instance (ment level) | 13205 | 1552 | 1570 |
# ment per instance | 2.09 | 2.06 | 2.07 |
R@10 of candidate retrieval | 88.62% | 89.17% | 88.01% |
F1 of entity disambiguation | - | 74.19% | 73.34% |
Others to be added
If you use WikiDiverse in your work, please cite our paper:
@inproceedings{wang2022wikidiverse,
title={WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types},
author={Wang, Xuwu and Tian, Junfeng and Gui, Min and Li, Zhixu and Wang, Rui and Yan, Ming and Chen, Lihan and Xiao, Yanghua},
booktitle={ACL},
year={2022}
}
WikiDiverse dataset is distributed under the CC BY-SA 4.0 license.