Epic Annotation Overview

Annotation data released under Project Epic. For annotation related to the CHIME grant, see https://github.com/Project-EPIC/chime-annotation. We're still working on the collected data; here's what is and what will be available:

Dataset	# of Tweets
Part-of-speech tagging for a variety of events	32,626
Named Entity Annotation for 10 different events	18,081
Behavioral Annotation (from Verma et al (ICWSM 2011)) for 3 events	1,500
Semantic role labeling (Gustav, Red River)	32,912 lines
Territory of Information/Evidentiality/Speech act annotation	500 tweets for 4 events

For the Named Entity and Behavioral Annotation, we can only provide includes only annotations, and not original tweets, in order to attempt to honor privacy concerns of potentially sensitive information. The original tweets can be accessed through Twitter: we've included tools to facilitate this: please see the Epic Tweet Documentation.

Part of Speech Annotation

This annotation is simple part of speech tags for collections of tweets surrounding multiple events. This annotation was done by using an automatic POS tagger, and the output was then hand corrected. The datasets we include and number of tweets for each are as follows:

Dallas Tornado (2012) : 850
Haiti Earthquake : 487
Hurricane Gustav : 1,000
Highland Park Fire : 700
New Zealand Earthquake : 14,800
Oklahmoa Fires : 449
Red River Floods (2009 and 2010): 14,340

Each event has a file, with each line containing a word and the corrected part of speech. Tweets are separated by blank lines.

Named Entity Annotation

This annotation is based on the paper Foundations of a Multilayer Annotation Framework for Twitter. They describe collection of tweets for five events, searching on certain hand-curated keywords. These were then filtered down into usable datasets. For a full description of the data collection process, see Anderson and Schram, 2009.

Based on these methods, named entities were tagged over the following events: The Events, with the number of tweets for each JSON:

Colorado Wildfires (2012) : 741
Dallas Tornado (2012) : 475
Haiti Earthquake : 480
Highland Park Fire : 344
Hurricane Sandy : 716
Lower North Fork Fire : 239
New Mexico Fire : 122
New Zealand Earthquake : 1227
Red River Flood (2009) : 12885
Red River Flood (2010) : 450
Winter Storm Nemo : 402

Total : 18081

Some of these datasets may not have been collected with accurate Tweet IDs, and thus they may not be recoverable from the Twitter API. We are looking into possibilities for restoring accurate tweet ids, or releasing the data with raw text.

These tweets are annotated with named entity tags based on the Automatic Content Extraction guidelines for entities. The tags annotated are:

PERSON
ARTIFACT
ORGANIZATION
LOCATION
FACILITY

These annotations are provided along with the span of text for the tweet annotated. For information on how to extract the original tweet texts, please see the Epic Tweet Documentation.

Semantic Role Labelling

Semantic Role Labelling involves annotation of the important semantic entities within a sentence and the syntactic relations between them. More generally, we aim to identify who did what to whom. The SRL data annotated for Project EPIC is over two events: Hurricane Gustiv and the Red River floods. This data is based on PropBank annotation, and is presented in an Excel style format. Each line contains a word, along with the word's index in the tweet, part of speech, dependency relation and semantic role. The semantic roles are the final column: they indicate the verb that the word is a role of (via it's index), as well as the type of argument. These types are:

A0: ARG0
A1: ARG1
A2: ARG2
AM: Modifier - can be temporal (TMP), directional (DIR), and many others

For example, consider the following tweet:

Index	Word	Lemma	POS	-	Head	Dep. Relation	PB Verb	Semantic Role
1	Thinking	think	VBG	_	5	DEP	think.XX	_
2	of	of	IN	_	1	ADV	_	1:A1
3	Gustav	gustav	NNP	_	2	PMOD	_	_
4	.	.	.	_	1	P	_	_
5	May	may	MD	_	0	ROOT	_	7:AM-MOD
6	it	it	PRP	_	5	SBJ	_	7:A0
7	bring	bring	VB	_	5	VC	bring.XX	_
8	minimal	minimal	JJ	_	9	NMOD	_	_
9	damage	damage	NN	_	7	OBJ	_	7:A1
10	.	.	.	_	5	P	_	_

Here, the verbs are "think", indexed 1, and "bring", indexed 7. The phrase "of Gustav" is the ARG1 of "think", marked by the index of the verb on "of": 1:A1. "May" is a modal (MOD) modifier of "bring", marked 7:AM-MOD. The pronoun "it" is the ARG0 of bring (7:A0), and the phrase "minimal damage" is the ARG1 of bring (7:A1 on "damage").

Behavioral Annotation

This data is based on the paper Natural Language Processing to the Rescue? Extracting “Situational Awareness” Tweets During Mass Emergency. They collected four datasets of 500 tweets each. These datasets overlap with the named entity annotation, and include the two Red River Floods (2009, 2010), the Oklahoma wildfire, and the Haiti Earthquake. These tweets were annotated with 'behavioral' categories:

Situational Awareness: whether they contribute to user's awareness of the event
Subjectivity: Whether the tweet is objective of subjective
Linguistic Register: Whether the tweet is in a formal or informal register
Personal/impersonal: Whether the tweet is expressed from a personal standpoint or not

These categories are annotated at the tweet level: each tweet has four annotations for each of the above categories. Like the named entity data, we include only tweet IDs and annotations. Unfortunately, the original IDs for the Oklahoma dats were not maintained, and this data is currently unavailable. We are looking into ways of releasing it publically in a consistent and ethical fashion.

Territory of Information/Evidentiality/Speech act annotation

This data was collected for Will Corvey's dissertation. It contains territory of information, evidentiality, and speech annotations for four different events: the Oklahoma fires, Haiti earthquake, and the Red River Flooding of '09 and '10.

For any questions, please contact
Kevin Stowe
[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
annotations		annotations
Epic Tweet Documentation.pdf		Epic Tweet Documentation.pdf
PopulateTweets.rb		PopulateTweets.rb
README.md		README.md
README.txt		README.txt
lookup.tar		lookup.tar
oauth.properties		oauth.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Epic Annotation Overview

Part of Speech Annotation

Named Entity Annotation

Semantic Role Labelling

Behavioral Annotation

Territory of Information/Evidentiality/Speech act annotation

About

Releases

Packages

Contributors 2

Languages

Project-EPIC/epic-annotation

Folders and files

Latest commit

History

Repository files navigation

Epic Annotation Overview

About

Resources

Stars

Watchers

Forks

Languages