-
Notifications
You must be signed in to change notification settings - Fork 66
Developing Resources
Author: Jannik Strötgen
eMail: [email protected]
The details of HeidelTime are described in the paper "Multilingual and Cross-domain Temporal Tagging", which is now published and available online. Nevertheless, we present a short description here. For details, please have a look in the paper or .
HeidelTime is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax. This rule syntax is described on this page.
HeidelTime 2.0 contains manually created resources for 13 languages and automatically created resources for more than 200 languages. Note that the automatically created resources result in lower extraction and normalization quality than manually developed ones. However, for many languages, temporal tagging has never been addressed before and thus HeidelTime can be considered as baseline temporal tagger for all these languages. If you want to build HeidelTime resources for a not yet manually addressed language (e.g., Swedish), you can use the automatically created resources (auto-swedish) as a starting point.
In HeidelTime 1.9, resource loading has been changed so that the previously necessary update of working resources via the printResourceInformation
scripts is no longer needed.
By default, the following order of looking for resources is used (in descending priority, i.e. higher places trump lower places).
HeidelTime Kit:
heideltime-kit/resources/{german,english,...}
heideltime-kit/class/{german,english,...}
HeidelTime Standalone:
heideltime-standalone/{german,english,...}
-
heideltime-standalone/de.unihd.dbs.heideltime.standalone.jar/{german,english,...}
(inside the .jar file)
HeidelTime contains three different types of resources.
- Pattern Resources: Pattern Resources are used to create regular expressions, which can be accessed by any rule. This allows to use category names (e.g., "month") instead of listing all items every time the category is needed in a rule.
- Normalization Resources: Normalization resources contain normalized values of expressions included in the pattern resources. They correspond to the ISO format for temporal information.
- Rule Resources: Rule resources contain rules to identify and normalize temporal expressions.
In all resource files of all types, you can add comments using //
. All lines starting with //
are ignored and not interpreted by HeidelTime's resource parser.
The resources are organized in a directory structure. For every language, three directories are used representing the three resource types. Thus, the resources are organized in the following directory structure:
+ heideltime-kit
+ resources
+ english
| + normalization
| + repattern
| + rules
+ german
| + normalization
| + repattern
| + rules
+ dutch
...
And so on. If you want to develop a resources for an additional language, create the same directory structure for the new language and copy all files of an existing language into the directories of the new language. Then, adapt all the pattern, normalization, and rule files.
The pattern resources contain one expression per line. Assume the resource reMonthLong
contains the following three lines, only:
January
February
March
HeidelTime's resource interpreter reads the pattern resources and creates a regular expression for every pattern resource file: Thus, in this example the expression looks like this:
reMonthLong="(January|February|March)"
The pattern resources can be accessed in the rules as described below.
The normalization resources contain normalized values of expressions included in the pattern resources. E.g., for the pattern resource reMonthLong
, there is a normalization resource called normMonth
. Note that not all pattern resources need an own normalization resource.
"January","01"
"February","02"
"March","03"
HeidelTime's resource interpreter reads the normalization resources and creates HashMaps for every file:
normMonth("January")="01"
The normalization resources can be accessed in the rules as described below.
The rule resources contain rules for the extraction and the normalization of temporal expressions. There is one file for every type (date, time, duration, and set; see TimeML for details [
1]
).
Every rule contains at least the following three parts:
-
RULENAME
: name of the rule -
EXTRACTION
: regular expression pattern that has to be match -
NORM_VALUE
: definition how to normalize the matched expression In the extraction part, one can use the pattern resources described above. in addition, one can use parentheses to group parts of the expression. Every reference to a pattern resource counts as one group. In the normalization part, one can use the normalization resources described above.
One can refer to parts of the expression using the groups defined in the extraction part of the rule.
In addition to the three parts of the rules just described (rulename, extraction, norm_value), for some linguistic phenomena, one needs to specify further constraints to correctly extract a specific temporal expression. For this, the following parts can be added to a rule:
-
POS_CONSTRAINT(group(x):y)
: part-of-speech of group x is y -
OFFSET(group(x)-group(y))
: offset begins at beginning of group x and ends at the end of group y -
NORM_MOD
: attributemod
is set here -
NORM_QUANT
: attributequant
is set here -
NORM_FREQ
: attributefreq
is set here
For the normalization process (i.e., in the NORM_VALUE
, NORM_MOD
, NORM_QUANT
, NORM_FREQ
) the following functions can be applied:
-
SUBSTRING(x,i,j)
: returns a substring of the string x starting at position i and having the length j. -
LOWERCASE(x)
: converts all characters of the string x into lower case -
UPPERCASE(x)
: converts all characters of the string x into upper case -
SUM(x,y)
: adds the integer y to the integer x To call these functions, they have to be surrounded by%
signs, e.g.,%LOWERCASE%("January")="january"
We will now use some example rules to explain the single parts, the functions, and the syntax of the rule language:
RULENAME="date_simple1",EXTRACTION="%reMonthLong %reDayNumber, %reYear4Digit",NORM_VALUE="group(3)-%normMonth(group(1))-%normDay(group(2))"
Using the %
sign, one can refer to a pattern resource or a normalization resource in the extraction and norm_value parts of a rule, respectively.
Assuming that reMonthLong
contains all month names, reDayNumber
a regular expression for the numbers between 1 and 31, and reYear4Digit
a four digit number, then this rule matches expressions such as
- "January 9, 2011" Since every pattern resource counts as one group, the following groups are available for the normalization process:
- group(1)="January"
- group(2)="9"
- group(3)="2011"
As defined by TimeML
[
1]
, the standard annotation language for temporal information, this expression has to be normalized to "2011-01-09". The normalized expression consists of the year as it is mentioned in the expression (group(3)
) followed by a "-". The next part is the month. However, the month name January has to be normalized to "01". Thus, the normalization resourcenormMonth
is called to normalize thegroup(1)
expression. Finally, the day (9) is normalized to "09" using thenormDay
normalization resource.
The normalized expression is thus "2011-01-09".
RULENAME="date_simple2",EXTRACTION="(%reMonthLong|%reMonthShort) %reDayNumber, %reYear4Digit",NORM_VALUE="group(5)-%normMonth(group(1))-%normDay(group(4))"
Assuming that reMonthShort
contains abbreviated month names, this rule additionally matches expressions such as:
- "Mar. 11, 1982" Note that the group information changed since we added an additional pattern resource and a new parentheses pair. This results in the following group information:
- group(1)="Mar."
- group(2)=null
- group(3)="Mar."
- group(4)="11"
- group(5)="1982"
Note that depending on whether
reMonthLong
orreMonthShort
matches the expression eithergroup(2)
orgroup(3)
equalsnull
. Independent on which resource matches,group(1)
contains the expression. Thus in the norm_value partgroup(1)
is used with the%normMonth
resource.
RULENAME="date_complex2",EXTRACTION="%reYear4Digit(-| to )%reYear2Digit",NORM_VALUE="%SUBSTRING%(group(1),0,2)group(3)",OFFSET="group(3)-group(3)"
This rule matches expressions such as "1990-95". We want to use this rule to extract "95" as a temporal expression with the normalized value "1995". The following group information is available for the normalization process:
- group(1)="1990"
- group(2)="-"
- group(3)="95"
In the normalization part, the substring function is called to match the century information ("19") followed by the
group(3)
. Thus, the expression "1990-95", is normalized to "1995". Since the correct offset of this expression is just "95", we use theOFFSET
part in the rule and let the offset begin at the beginning ofgroup(3)
and end at the end ofgroup(3)
.
RULENAME="date complex3",EXTRACTION="%rePartWords([ ]?)%reYear4Digit",NORM_VALUE="group(3)",NORM_MOD="%normPartWords(group(1))"
Assuming rePartWords
matches expressions such as "the beginning of", this rule matches temporal expressions such as "the beginning of 2001". The following group information is available:
- group(1)="the beginning of"
- group(2)=" "
- group(3)="2001"
This rule sets the value to "2001" and additionally sets the modification attribute
MOD
to "START" since the normalization resourcenormPartWords
contains a line "the beginning of","START".
RULENAME="date_complex_negative",EXTRACTION="%reYear4Digit ([\S]+)",NORM_VALUE="REMOVE",POS_CONSTRAINT="group(2):NNS:"
This rule is an example for negative rules. Their task is to prevent phrases to be matched as temporal expressions. Assume a rule matching every four digit number as a temporal expression of the type year. This results in many false matches since not every four digit number refers to a year expression. The task of the negative rule is to block the four digit number to be extracted as a temporal expression if it is unlikely to be a temporal expression. This rule matches expressions starting with a four digit number followed by any kind of token. Thus, expressions such as "2000 shoes" are matched by this rule and the following group information is available:
- group(1)="2000"
- group(2)="shoes"
In this rule, an additional constraint is set, namely that
group(2)
has the part-of-speechNNS
(plural noun). This constraint is satisfied in the given example. In the norm value part of the rule the value is set to "REMOVE". In HeidelTime's source code, this "REMOVE" is interpreted in such a way that this phrase and all sub-phrases are blocked as temporal expressions.
For underspecified and implicit expressions, one can set the value in an underspecified way.
UNDEF-(this|next|last)-%normUnit(x)
UNDEF-this-%normUnit(x)-REST
UNDEF-REF-%normUnit(x)-REST
UNDEF-REFUNIT-%normUnit(x)-REST
REST
may either be a part of the expression, or additionally contain a calculation function.
RULENAME="date_relative_1",EXTRACTION="([\d]+) %reUnit ago",NORM VALUE="UNDEF-this-%normUnit(group(2))-MINUS-group(1)"
This rule matches expressions such as "20 years ago" with the following group information being available for the normalization process:
- group(1)="20"
- group(2)="years" The expression is such normalized to "UNDEF-this-year-MINUS-20". The calculation is performed during HeidelTime's disambiguation phase.
A detailed description on the UNDEF
values and the disambiguation phase will follow soon.
Since HeidelTime 1.8, it is also possible to specify so-called "empty values" in rules; these produce TimeML Timex annotations that have no extent, but are rather inserted right after the Timex the rule creates.
Consider this:
RULENAME="duration_r15f",EXTRACTION="(%reApproximate )?(\b[Ii] |\b[Gg]li |\b[Ll]e )?%reUnit ([Ss]cors[ie]|precedenti)",NORM_VALUE="PX%normUnit4Duration(group(4))",NORM_MOD="%normApprox4Durations(group(2))",EMPTY_VALUE="PAST_REF"
This will create a duration Timex annotation with the given extent and value PX%normUnit4Duration(group(4))
, as well as an annotation of value PAST_REF
with no extent, but a reference to the previous annotation's ID.
- TimeML, http://www.timeml.org/