Lexical Simplification with Pretrained Encoders

Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning. Recently unsupervised lexical simplification approaches only rely on the complex word itself regardless of the given sentence to generate candidate substitutions, which will inevitably produce a large number of spurious candidates. We present a simple BERT-based LS approach that makes use of the pre-trained unsupervised deep bidirectional representations BERT. We feed the given sentence masked the complex word into the masking language model of BERT to generate candidate substitutions. By considering the whole sentence, the generated simpler alternatives are easier to hold cohesion and coherence of a sentence. Experimental results show that our approach obtains obvious improvement on standard LS benchmark.

Pre-trained models

FastText (word embeddings trained using FastText)
BERT based on Pytroch-transformers 1.0

How to run this code

We recommend Python 3.5 or higher. The model is implemented with PyTorch 1.0.1 using pytorch-transformers v1.0.0. Here, we give three versions: LSBert1.0 and LSBert2.0 need to be privoided with sentence and complex word, recursive_LSBert2 can directly simplify one sentence.

(1) Download pretrianed BERT. In our experiments, we adopted pretrained BERT-Large, Uncased (Whole Word Masking).

(2) Download the pre-trained word embeddings using FastText.

run LSBert1.0 published in AAAI2020

(3) run "./run_LSBert1.sh".

run LSBert2.0 published in arXiv

(4) Download an English paraphrase database (PPDB) , and assign the path of PPDB in the ".sh" file.

(5) Download an pretrained sequence labeling task to identify complex word , and put the into the main directory of code.

(6) run "./run_LSBert2.sh".

run LSBert2.0 to simplify one sentence

(7) run "./run_LSBert_TS.sh": Iteratively call LSBert2.0 to simplify one sentence

Idea

Suppose that there is a sentence "the cat perched on the mat" and the complex word "perched". We concatenate the original sequence S and S' as a sentence pair, and feed the sentence pair {S,S'} into the BERT to obtain the probability distribution of the vocabulary corresponding to the mask word. Finally, we select as simplification candidates the top words from the probability distribution, excluding the morphological derivations of the complex word. For this example, we can get the top three simplification candidate words "sat, seated, hopped".

Example or Advantage

Comparison of simplification candidates of complex words using three methods. Given one sentence "John composed these verses." and complex words 'composed' and 'verses', the top three simplification candidates for each complex word are generated by our method BERT-LS and the state-of-the-art two baselines based word embeddings (Glavas and Paetzold-NE). The top three substitution candidates generated by BERT-LS are not only related with the complex words, but also can fit for the original sentence very well. Then, by considering the frequency or order of each candidate, we can easily choose 'wrote' as the replacement of 'composed and 'poems' as the replacement of 'verses'. In this case, the simplification sentence 'John wrote these poems.' is more easily understand than the original sentence.

Citation

BERT-LS technical report

@article{qiang2020BERTLS,
  title =  {Lexical Simplification with Pretrained Encoders },
  author = {Qiang, Jipeng and 
            Li, Yun and
            Yi, Zhu and
            Yuan, Yunhao and 
            Wu, Xindong},
  journal={Thirty-Fourth AAAI Conference on Artificial Intelligence},
  pages={8649–8656},
  year  =  {2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
PPDB		PPDB
datasets		datasets
pytorch_pretrained_bert		pytorch_pretrained_bert
verb		verb
BERT_LS.png		BERT_LS.png
Example1.png		Example1.png
LSBert1.py		LSBert1.py
LSBert2.py		LSBert2.py
README.md		README.md
SUBTLEX_frequency.xlsx		SUBTLEX_frequency.xlsx
complex_word.py		complex_word.py
conlleval.py		conlleval.py
evaluator.py		evaluator.py
experiment.py		experiment.py
frequency_merge_wiki_child.txt		frequency_merge_wiki_child.txt
helper_functions.py		helper_functions.py
hubconf.py		hubconf.py
labeler.py		labeler.py
plural.py		plural.py
read_xls.py		read_xls.py
recursive_LSBert2.py		recursive_LSBert2.py
recursive_simplification.txt		recursive_simplification.txt
run_LSBert1.sh		run_LSBert1.sh
run_LSBert2.sh		run_LSBert2.sh
run_LSBert2_TS.sh		run_LSBert2_TS.sh
simplification.py		simplification.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lexical Simplification with Pretrained Encoders

Pre-trained models

How to run this code

run LSBert1.0 published in AAAI2020

run LSBert2.0 published in arXiv

run LSBert2.0 to simplify one sentence

Idea

Example or Advantage

Citation

About

Releases

Packages

Languages

qiang2100/BERT-LS

Folders and files

Latest commit

History

Repository files navigation

Lexical Simplification with Pretrained Encoders

Pre-trained models

How to run this code

run LSBert1.0 published in AAAI2020

run LSBert2.0 published in arXiv

run LSBert2.0 to simplify one sentence

Idea

Example or Advantage

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages