Skip to content

Annotated data produced for Project EPIC, including part-of-speech tagging, SRL, and CHIME annotation

Notifications You must be signed in to change notification settings

Project-EPIC/epic-annotation

Repository files navigation

Epic Annotation Overview

Annotation data released under Project Epic. For annotation related to the CHIME grant, see https://github.com/Project-EPIC/chime-annotation. We're still working on the collected data; here's what is and what will be available:

Dataset# of Tweets
Part-of-speech tagging for a variety of events32,626
Named Entity Annotation for 10 different events18,081
Behavioral Annotation (from Verma et al (ICWSM 2011)) for 3 events1,500
Semantic role labeling (Gustav, Red River)32,912 lines
Territory of Information/Evidentiality/Speech act annotation500 tweets for 4 events

For the Named Entity and Behavioral Annotation, we can only provide includes only annotations, and not original tweets, in order to attempt to honor privacy concerns of potentially sensitive information. The original tweets can be accessed through Twitter: we've included tools to facilitate this: please see the Epic Tweet Documentation.

This annotation is simple part of speech tags for collections of tweets surrounding multiple events. This annotation was done by using an automatic POS tagger, and the output was then hand corrected. The datasets we include and number of tweets for each are as follows:

  • Dallas Tornado (2012) : 850
  • Haiti Earthquake : 487
  • Hurricane Gustav : 1,000
  • Highland Park Fire : 700
  • New Zealand Earthquake : 14,800
  • Oklahmoa Fires : 449
  • Red River Floods (2009 and 2010): 14,340

Each event has a file, with each line containing a word and the corrected part of speech. Tweets are separated by blank lines.

This annotation is based on the paper Foundations of a Multilayer Annotation Framework for Twitter. They describe collection of tweets for five events, searching on certain hand-curated keywords. These were then filtered down into usable datasets. For a full description of the data collection process, see Anderson and Schram, 2009.

Based on these methods, named entities were tagged over the following events: The Events, with the number of tweets for each JSON:

  • Colorado Wildfires (2012) : 741
  • Dallas Tornado (2012) : 475
  • Haiti Earthquake : 480
  • Highland Park Fire : 344
  • Hurricane Sandy : 716
  • Lower North Fork Fire : 239
  • New Mexico Fire : 122
  • New Zealand Earthquake : 1227
  • Red River Flood (2009) : 12885
  • Red River Flood (2010) : 450
  • Winter Storm Nemo : 402

Total : 18081

Some of these datasets may not have been collected with accurate Tweet IDs, and thus they may not be recoverable from the Twitter API. We are looking into possibilities for restoring accurate tweet ids, or releasing the data with raw text.

These tweets are annotated with named entity tags based on the Automatic Content Extraction guidelines for entities. The tags annotated are:

  • PERSON
  • ARTIFACT
  • ORGANIZATION
  • LOCATION
  • FACILITY
These annotations are provided along with the span of text for the tweet annotated. For information on how to extract the original tweet texts, please see the Epic Tweet Documentation.

Semantic Role Labelling involves annotation of the important semantic entities within a sentence and the syntactic relations between them. More generally, we aim to identify who did what to whom. The SRL data annotated for Project EPIC is over two events: Hurricane Gustiv and the Red River floods. This data is based on PropBank annotation, and is presented in an Excel style format. Each line contains a word, along with the word's index in the tweet, part of speech, dependency relation and semantic role. The semantic roles are the final column: they indicate the verb that the word is a role of (via it's index), as well as the type of argument. These types are:

  • A0: ARG0
  • A1: ARG1
  • A2: ARG2
  • AM: Modifier - can be temporal (TMP), directional (DIR), and many others

For example, consider the following tweet:

IndexWordLemmaPOS-HeadDep. RelationPB VerbSemantic Role
1ThinkingthinkVBG_5DEPthink.XX_
2ofofIN_1ADV_1:A1
3GustavgustavNNP_2PMOD__
4..._1P__
5MaymayMD_0ROOT_7:AM-MOD
6ititPRP_5SBJ_7:A0
7bringbringVB_5VCbring.XX_
8minimalminimalJJ_9NMOD__
9damagedamageNN_7OBJ_7:A1
10..._5P__

Here, the verbs are "think", indexed 1, and "bring", indexed 7. The phrase "of Gustav" is the ARG1 of "think", marked by the index of the verb on "of": 1:A1. "May" is a modal (MOD) modifier of "bring", marked 7:AM-MOD. The pronoun "it" is the ARG0 of bring (7:A0), and the phrase "minimal damage" is the ARG1 of bring (7:A1 on "damage").

This data is based on the paper Natural Language Processing to the Rescue? Extracting “Situational Awareness” Tweets During Mass Emergency. They collected four datasets of 500 tweets each. These datasets overlap with the named entity annotation, and include the two Red River Floods (2009, 2010), the Oklahoma wildfire, and the Haiti Earthquake. These tweets were annotated with 'behavioral' categories:

  • Situational Awareness: whether they contribute to user's awareness of the event
  • Subjectivity: Whether the tweet is objective of subjective
  • Linguistic Register: Whether the tweet is in a formal or informal register
  • Personal/impersonal: Whether the tweet is expressed from a personal standpoint or not

These categories are annotated at the tweet level: each tweet has four annotations for each of the above categories. Like the named entity data, we include only tweet IDs and annotations. Unfortunately, the original IDs for the Oklahoma dats were not maintained, and this data is currently unavailable. We are looking into ways of releasing it publically in a consistent and ethical fashion.

This data was collected for Will Corvey's dissertation. It contains territory of information, evidentiality, and speech annotations for four different events: the Oklahoma fires, Haiti earthquake, and the Red River Flooding of '09 and '10.

For any questions, please contact
Kevin Stowe
[email protected]

About

Annotated data produced for Project EPIC, including part-of-speech tagging, SRL, and CHIME annotation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages