In this repository we present general-purpose neural network models for sentence boundary detection. We report on a series of experiments with long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and convolutional neural network (CNN) for sentence boundary detection. We show that these neural networks architectures achieve state-of-the-art results both on multi-lingual benchmarks and on a zero-shot scenario.
The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging (Manning, 2011), dependency parsing (Yu and Vu, 2017), named entity recognition or machine translation.
Sentence boundary detection is a nontrivial task, because of the ambiguity of
the period sign .
, which has several functions (Grefenstette and
Tapanainen, 1994), e.g.:
- End of sentence
- Abbreviation
- Acronyms and initialism
- Mathematical numbers
A sentence boundary detection system has to resolve the use of ambiguous
punctuation characters to determine if the punctuation character is a true
end-of-sentence marker. In this implementation we define ?!:;.
as potential
end-of sentence markers.
Various approaches have been employed to achieve sentence boundary detection in different languages. Recent research in sentence boundary detection focus on machine learning techniques, such as hidden Markov models (Mikheev, 2002), maximum entropy (Reynar and Ratnaparkhi, 1997), conditional random fields (Tomanek et al., 2007), decision tree (Wong et al., 2014) and neural networks (Palmer and Hearst, 1997). Kiss and Strunk (2006) use an unsupervised sentence detection system called Punkt, which does not depend on any additional resources. The system use collocation information as evidence from unannotated corpora to detect e.g. abbreviations or ordinal numbers.
The sentence boundary detection task can be treated as a classification problem. Our work is similar to the SATZ system, proposed by Palmer and Hearst (1997), which uses a fully-connected feed-forward neural network. The SATZ system disambiguates a punctuation mark given a context of k surrounding words. This is different to our approach, as we use a char-based context window instead of a word-based context window.
In the present work, we train different architectures of neural networks, such as long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and convolutional neural network (CNN) and compare the results with OpenNLP. OpenNLP is a state-of-the-art tool and uses a maximum entropy model for sentence boundary detection. To test the robustness of our models, we use the Europarl corpus for German and English and the SETimes corpus for nine different Balkan languages.
Additionally, we use a zero-shot scenario to test our model on unseen abbreviations. We show that our models outperform OpenNLP both for each language and on the zero-shot learning task. Therefore, we conclude that our trained models can be used for building a robust, language-independent state-of-the-art sentence boundary detection system.
Similar to Wong et al. (2014) we use the Europarl corpus (Koehn, 2005) for our experiments. The Europarl parallel corpus is extracted from the proceedings of the European Parliament and is originally created for the research of statistical machine translation systems. We only use German and English from Europarl. Wong et al. (2014) do not mention that the Europarl corpus is not fully sentence-segmented. The Europarl corpus has a one-sentence per line data format. Unfortunately, in some cases one or more sentences appear in a line. Thus, we define the Europarl corpus as "quasi"-sentence segmented corpus.
We use the SETimes corpus (Tyers and Alperen, 2010) as a second corpus for our experiments. The SETimes corpus is based on the content published on the SETimes.com news portal and contains parallel texts in ten languages. Aside from English the languages contained in the SETimes corpus fall into several linguistic groups: Turkic (Turkish), Slavic (Bulgarian, Croatian, Macedonian and Serbian), Hellenic (Greek), Romance (Romanian) and Albanic (Albanian). The SETimes corpus is also a "quasi"-sentence segmented corpus. For our experiments we use all the mentioned languages except English, as we use an English corpus from Europarl. We do not use any additional data like abbreviation lists.
For a zero-shot scenario we extracted 80 German abbreviations including their context in a sentence from Wikipedia. These abbreviations do not exist in the German Europarl corpus.
Both Europarl and SETimes are not tokenized. Text tokenization (or, equivalently, segmentation) is highly non-trivial for many languages (Schütze, 2017). It is problematic even for English as word tokenizers are either manually designed or trained. For our proposed sentence boundary detection system we use a similar idea from Lee et al. (2016). They use a character-based approach without explicit segmentation for neural machine translation. We also use a character-based context window, so no explicit segmentation of input text is necessary.
For both corpora we use the following preprocessing steps: (a) we remove duplicate sentences, (b) we extract only sentences with ends with a potential end-of-sentence marker. For Europarl and SETimes each text for a language is split into train, dev and test sets. The following table shows a detailed summary of the training, development and test sets used for each language.
Language | # Train | # Dev | # Test |
---|---|---|---|
German | 1,476,653 | 184,580 | 184,580 |
English | 1,474,819 | 184,352 | 184,351 |
Bulgarian | 148,919 | 18,615 | 18,614 |
Bosnian | 97,080 | 12,135 | 12,134 |
Greek | 159,000 | 19,875 | 19,874 |
Croatian | 143,817 | 17,977 | 17,976 |
Macedonian | 144,631 | 18,079 | 18,078 |
Romanian | 148,924 | 18,615 | 18,615 |
Albanian | 159,323 | 19,915 | 19,915 |
Serbian | 158,507 | 19,813 | 19,812 |
Turkish | 144,585 | 18,073 | 18,072 |
A script for automatically downloading and extracting the datasets is available and can be used with:
./download_data.sh
Training, development and testdata is located in the data
folder.
We use three different architectures of neural networks: long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and convolutional neural network (CNN). All three models capture information at the character level. Our models disambiguate potential end-of-sentence markers followed by a whitespace or line break given a context of k surrounding characters. The potential end-of-sentence marker is also included in the context window. The following table shows an example of a sentence and its extracted contexts: left context, middle context and right context. We also include the whitespace or line break after a potential end-of-sentence marker.
Input sentence | Left | Middle | Right |
---|---|---|---|
I go to Mr. Pete Tong | to Mr | . | _Pete |
We use a standard LSTM (Hochreiter and Schmidhuber, 1997; Gers et al., 2000) network with an embedding size of 128. The number of hidden states is 256. We apply dropout with probability of 0.2 after the hidden layer during training. We apply a sigmoid non-linearity before the prediction layer.
Our bidirectional LSTM network uses an embedding size of 128 and 256 hidden states. We apply dropout with a probability of 0.2 after the hidden layer during training, and we apply a sigmoid non-linearity before the prediction layer.
For the convolutional neural network we use a 1D convolution layer with 6 filters and a stride size of 1 (Waibel et al., 1989). The output of the convolution filter is fed through a global max pooling layer and the pooling output is concatenated to represent the context. We apply one 250-dimensional hidden layer with ReLU non-linearity before the prediction layer. We apply dropout with a probability of 0.2 during training.
Our proposed character-based model disambiguates a punctuation mark given a context of k surrounding characters. In our experiments we found that a context size of 5 surrounding characters gives the best results. We found that it is very important to include the end-of-sentence marker in the context, as this increases the F1-score of 2%. All models are trained with averaged stochastic gradient descent with a learning rate of 0.001 and mini-batch size of 32. We use Adam for first-order gradient-based optimization. We use binary cross-entropy as loss function. We do not tune hyperparameters for each language. Instead, we tune hyperparameters for one language (English) and use them across languages. The following table shows the number of trainable parameters for each model.
Model | # Parameters |
---|---|
LSTM | 420,097 |
Bi-LSTM | 814,593 |
CNN | 33,751 |
We train a maximum of 5 epochs for each model. For the German and English corpus (Europarl) the time per epoch is 54 minutes for the Bi-LSTM model, 35 minutes for the LSTM model and 7 minutes for the CNN model. For each language from the SETimes corpus the time per epoch is 6 minutes for the Bi-LSTM model, 4 minutes for the LSTM model and 50 seconds for the CNN model. Timings are performed on a DGX-1 with a Nvidia P-100.
The results on the development set for both Europarl and SETimes are shown in the following table. Download link for model and vocab files for each language are included, as well as detailed evaluation results.
Language | LSTM | Bi-LSTM | CNN | OpenNLP |
---|---|---|---|---|
German | 0.9759 (model, vocab) | 0.9760 (model, vocab) | 0.9751 (model, vocab) | 0.9736 |
English | 0.9864 (model, vocab) | 0.9863 (model, vocab) | 0.9861 (model, vocab) | 0.9843 |
Bulgarian | 0.9928 (model, vocab) | 0.9926 (model, vocab) | 0.9924 (model, vocab) | 0.9900 |
Bosnian | 0.9953 (model, vocab) | 0.9958 (model, vocab) | 0.9952 (model, vocab) | 0.9921 |
Greek | 0.9959 (model, vocab) | 0.9964 (model, vocab) | 0.9959 (model, vocab) | 0.9911 |
Croatian | 0.9947 (model, vocab) | 0.9948 (model, vocab) | 0.9946 (model, vocab) | 0.9917 |
Macedonian | 0.9795 (model, vocab) | 0.9799 (model, vocab) | 0.9794 (model, vocab) | 0.9776 |
Romanian | 0.9906 (model, vocab) | 0.9904 (model, vocab) | 0.9903 (model, vocab) | 0.9888 |
Albanian | 0.9954 (model, vocab) | 0.9954 (model, vocab) | 0.9945 (model, vocab) | 0.9934 |
Serbian | 0.9891 (model, vocab) | 0.9890 (model, vocab) | 0.9886 (model, vocab) | 0.9838 |
Turkish | 0.9860 (model, vocab) | 0.9867 (model, vocab) | 0.9858 (model, vocab) | 0.9830 |
For each language the best neural network model outperforms OpenNLP. On average, the best neural network model is 0.32% better than OpenNLP. The worst neural network model also outperforms OpenNLP for each language. On average, the worst neural network model is 0.26% better than OpenNLP. In over 60% of the cases the bi-directional LSTM model is the best model. In almost all cases the CNN model performs worse than the LSTM and bi-directional LSTM model, but it still achieves better results than the OpenNLP model. This suggests that the CNN model still needs more hyperparameter tuning.
The results on the development set for both Europarl and SETimes are shown in the following table. Download link for model and vocab files for each language are included, as well as detailed evaluation results.
Language | LSTM | Bi-LSTM | CNN | OpenNLP |
---|---|---|---|---|
German | 0.975 (model, vocab) | 0.9760 (model, vocab) | 0.9751 (model, vocab) | 0.9738 |
English | 0.9861 (model, vocab) | 0.9860 (model, vocab) | 0.9858 (model, vocab) | 0.9840 |
Bulgarian | 0.9922 (model, vocab) | 0.9923 (model, vocab) | 0.9919 (model, vocab) | 0.9887 |
Bosnian | 0.9957 (model, vocab) | 0.9959 (model, vocab) | 0.9953 (model, vocab) | 0.9925 |
Greek | 0.9967 (model, vocab) | 0.9969 (model, vocab) | 0.9963 (model, vocab) | 0.9925 |
Croatian | 0.9946 (model, vocab) | 0.9948 (model, vocab) | 0.9943 (model, vocab) | 0.9907 |
Macedonian | 0.9810 (model, vocab) | 0.9811 (model, vocab) | 0.9794 (model, vocab) | 0.9786 |
Romanian | 0.9907 (model, vocab) | 0.9906 (model, vocab) | 0.9904 (model, vocab) | 0.9889 |
Albanian | 0.9953 (model, vocab) | 0.9949 (model, vocab) | 0.9940 (model, vocab) | 0.9934 |
Serbian | 0.9877 (model, vocab) | 0.9877 (model, vocab) | 0.9870 (model, vocab) | 0.9832 |
Turkish | 0.9858 (model, vocab) | 0.9854 (model, vocab) | 0.9854 (model, vocab) | 0.9808 |
For each language the best neural network model outperforms OpenNLP. On average, the best neural network model is 0.32% better than OpenNLP. The worst neural network model also outperforms OpenNLP for each language. On average, the worst neural network model is 0.25% better than OpenNLP. In half of the cases the bi-directional LSTM model is the best model. In almost all cases the CNN model performs worse than the LSTM and bi-directional LSTM model, but it still achieves better results than the OpenNLP model.
Model | Precision | Recall | F1-Score |
---|---|---|---|
LSTM | 0.6046 | 0.9750 | 0.7464 |
Bi-LSTM | 0.6341 | 0.9750 | 0.7684 |
CNN | 0.57350 | 0.9750 | 0.7222 |
OpenNLP | 54.60 | 96.25 | 69.68 |
The table above shows the results for the zero-shot scenario. The bi-directional LSTM model outperforms OpenNLP by a large margin and is 7% better than OpenNLP. The bi-directional LSTM model also outperforms all other neural network models. That suggests that the bi-directional LSTM model generalizes better than LSTM or CNN for unseen abbreviations. The worst neural network model (CNN) still performs 2,5% better than OpenNLP.
In this repository, we propose a general-purpose system for sentence boundary detection using different architectures of neural networks. We use the Europarl and SETimes corpus and compare our proposed models with OpenNLP. We achieve state-of-the-art results.
In a zero-shot scenario, in which no manifestation of the test abbreviations is observed during training, our system is also robust against unseen abbreviations.
The fact that our proposed neural network models perform well on different languages and on a zero-shot scenario leads us to the conclusion that our system is a general-purpose system.
To reproduce this results, the following scripts can be used:
benchmark_all.sh
- runs evaluation for various neural network models and all languagesbenchmark_all_opennlp
- runs evaluation for OpenNLP for all languages
We use Keras and TensorFlow for the implementation of the neural network architectures.
The following commandline options are available:
$ python3 main.py --help
usage: main.py [-h] [--training-file TRAINING_FILE] [--test-file TEST_FILE]
[--input-file INPUT_FILE] [--epochs EPOCHS]
[--architecture ARCHITECTURE] [--window-size WINDOW_SIZE]
[--batch-size BATCH_SIZE] [--dropout DROPOUT]
[--min-freq MIN_FREQ] [--max-features MAX_FEATURES]
[--embedding-size EMBEDDING_SIZE] [--kernel-size KERNEL_SIZE]
[--filters FILTERS] [--pool-size POOL_SIZE]
[--hidden-dims HIDDEN_DIMS] [--strides STRIDES]
[--lstm_gru_size LSTM_GRU_SIZE] [--mlp-dense MLP_DENSE]
[--mlp-dense-units MLP_DENSE_UNITS]
[--model-filename MODEL_FILENAME]
[--vocab-filename VOCAB_FILENAME] [--eos-marker EOS_MARKER]
{train,test,tag,extract}
positional arguments:
{train,test,tag,extract}
optional arguments:
-h, --help show this help message and exit
--training-file TRAINING_FILE
Defines training data set
--test-file TEST_FILE
Defines test data set
--input-file INPUT_FILE
Defines input file to be tagged
--epochs EPOCHS Defines number of training epochs
--architecture ARCHITECTURE
Neural network architectures, supported: cnn, lstm,
bi-lstm, gru, bi-gru, mlp
--window-size WINDOW_SIZE
Defines number of window size (char-ngram)
--batch-size BATCH_SIZE
Defines number of batch_size
--dropout DROPOUT Defines number dropout
--min-freq MIN_FREQ Defines the min. freq. a char must appear in data
--max-features MAX_FEATURES
Defines number of features for Embeddings layer
--embedding-size EMBEDDING_SIZE
Defines Embeddings size
--kernel-size KERNEL_SIZE
Defines Kernel size of CNN
--filters FILTERS Defines number of filters of CNN
--pool-size POOL_SIZE
Defines pool size of CNN
--hidden-dims HIDDEN_DIMS
Defines number of hidden dims
--strides STRIDES Defines numer of strides for CNN
--lstm_gru_size LSTM_GRU_SIZE
Defines size of LSTM/GRU layer
--mlp-dense MLP_DENSE
Defines number of dense layers for mlp
--mlp-dense-units MLP_DENSE_UNITS
Defines number of dense units for mlp
--model-filename MODEL_FILENAME
Defines model filename
--vocab-filename VOCAB_FILENAME
Defines vocab filename
--eos-marker EOS_MARKER
Defines end-of-sentence marker used for tagging
A new model can be trained using the train
parameter. The only mandatory
argument in training mode is the --training-file
parameter. This parameter
specifices the training file with sentence-separated entries.
python3 main.py train --training-file <TRAINING_FILE>
A previously trained model can be evaluated using the test
parameter. The only
mandatory argument for the testing mode is the --test-file
parameter, that
specifies the test file with sentence-separated entries.
python3 main.py test --test-file <TEST_FILE>
To tag an input text with a previously trained model, the tag
parameter must
be used in combination with specifying the to be tagged input text via the
--input-file
parameter.
python3 main.py tag --input-file INPUT_FILE
A evaluation script can be found in the eos-eval
folder. The main arguments
for the eval.py
script are:
$ python3 eval.py --help
usage: eval.py [-h] [-g GOLD] [-s SYSTEM] [-v]
optional arguments:
-h, --help show this help message and exit
-g GOLD, --gold GOLD Gold standard
-s SYSTEM, --system SYSTEM
System output
-v, --verbose Verbose outpu
The system and gold standard file must use </eos>
as end-of-sentence marker.
Then the evaluations script calculates precision, recall and F1-score. The
--verbose
parameter gives a detailed output of e.g. false negatives.
We would like to thank the Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften (LRZ) for giving us access to the NVIDIA DGX-1 supercomputer.
For questions about deep-eos, please create a new issue here. If you want to contribute to the project please refer to the Contributing guide!
To respect the Free Software Movement and the enormous work of Dr. Richard Stallman
this implementation is released under the GNU Affero General Public License
in version 3. More information can be found here
and in COPYING
.
S. Schweter and S. Ahmed, "Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection” in Proceedings of the 15th Conference on Natural Language Processing (KONVENS), 2019.
You can use the following BibTeX entry:
@InProceedings{Schweter:Ahmed:2019,
author = {Stefan Schweter and Sajawel Ahmed},
title = {{Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection}},
booktitle = {Proceedings of the 15th Conference on Natural Language Processing (KONVENS)},
location = {Erlangen, Germany},
year = 2019,
note = {accepted}
}
- Related work section: Elephant can be trained on data that is not tokenized (and that is only sentence-segmented), see issue #4.
A PyTorch fork of deep-eos was written by @m-stoeckel and is available here.