-
Notifications
You must be signed in to change notification settings - Fork 23
FAQs
- Which model type (i.e. "basic", "standard" and "full") should I use?
- How fast is a basic/standard/full model? Can I tune the speed?
- I have a development set. Can I tune the performance?
- I see "Warning: couldn't find coarse POS map for this language" during training
- I see "Warning: unknow language" during training/testing
===========
A: The short answer is "full" for best accuracy, "standard" for a good accuracy-speed trade-off, and "basic" for best speed.
The basic model type uses 1st-order features only. The standard model type involves up to 3rd-order features. The full model type adds two additional types of 3rd-order features, and some global features from re-ranking literature. Therefore it is the most accurate but the slowest model type.
======
A: Actual parsing speed varies depending on the sentence length and the size of the model (i.e. number of parameters). Typically, a basic model is about 2x~3x faster than a standard one, and the latter is about 2x faster than a full model.
Here are some options to obtain better parsing speed:
- use option "label:false" if dependency label is not required
- use more threads in parallel for decoding (e.g. "thread:6")
- for standard/full model type, change the decoding converge threshold to trade-off between speed and accuracy (e.g. "converge-test:k"; k=1 is the fastest but values in [20, 300] are more reasonable)
- use the basic model type (i.e. "model:basic")
As a flavor of how fast a typical model is, the table below shows the parsing speed (tokens/sec) on the CoNLL-2008 English dataset, with an average sentence length 24:
Model setting | label:true | label:false |
---|---|---|
basic | 2,431 | 4,811 |
standard, thread:4, converge:30 | 1,468 | 2,298 |
full, thread:4, converge:30 | 896 | 1,154 |
=======
A: RBGParser can automatically tune the parsing speed for standard/full model type, by searching an optimal decoding converge threshold. If you are to train a model, add arguments "dev test-set:example.dev" to enable speed tuning. The parser will tune the converge threshold right after the training is done. If you already trained a model, you can also tune the model via:
java -classpath "bin:lib/trove.jar" -Xmx32000m \
parser.DependencyParser \
model-file:example.model \
dev test-file:example.dev
This will load the model, search the optimal threshold, and over-write the model file with the optimized configuration.
The speed tuning procedure prints some information like the following:
Tuning hill-climbing converge number on eval set...
converge=300 UAS=0.933531
converge=155 UAS=0.933133
converge=80 UAS=0.933551
converge=45 UAS=0.932753
converge=65 UAS=0.933113
converge=55 UAS=0.933093
converge=50 UAS=0.933113
final converge=50
The procedure does binary search to find a minimal converge value k with no more than 0.05% UAS decrease.
Currently RBGParser doesn't provide an automatic procedure to tune other hyper-parameters for parsing accuracy (UAS). If you do want to tune the parsing accuracy (UAS), try to train different models with various gamma such as {0.1, 0.3, ..., 0.9 } and tensor rank R in such as { 30, 50, 70, ... }, and use the one with best UAS on the dev set.
=====
A: RBGParser uses the POS map for mapping fine POS tags to a universal set of 13 core POS tags (e.g. noun and verb). Such core tags are used to created certain types of features (for example, a feature checks conjunction agreement in full model type).
The parser loads such mapping via option "unimap-file:example.uni.map". If it fails or the mapping is not given, the parser instead uses heuristic rules to determine if a fine POS tag is noun, verb, conjunction, punctuation, adposition or other tags, and the warning is given. The heuristic rules looks for certain patterns in the POS tag string (e.g. a tag starts with "N" is assumed as a noun).
The project directory /unimap/ contains such mappings for the CoNLL-2006 datasets and CoNLL-2008 English dataset. For example, here is a portion of the English mapping file /unimap/english08.uni.map:
...
. .
: .
? .
CC CONJ
CD NUM
CD|RB X
DT DET
EX DET
FW X
IN ADP
...
The first tag in each line is a fine POS tag in original CoNLL annotations and the second tag is the corresponding universal tag.