mrc-osi-bert-quantization/doc/QuantizationInNLP.md at master · Huawei-MRC-OSI/mrc-osi-bert-quantization · GitHub

Most popular papers

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Transformer and related models

A student report, covers BERT as well http://web.stanford.edu/class/cs224n/reports/custom/15742249.pdf
Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

Word embeddings

Compressing Word Embeddings via Deep Compositional Code Learning
- The codebook trick can give an alternative approach to memory layers
Word2Bits - Quantized Word Vectors
- Impressive results, but may reflect Word2Vec being not so good more than quantization being good
- Need to investigate applications to more up-to-date embeddings

Surveys

Additional papers

Convolutional neural network compression for natural language processing
Improving text classification with vectors of reduced precision
Neural Networks Compression for Language Modeling
Natural Language Processing with Small Feed-Forward Networks
- Not quantization, but building small networks in the first place, they can be quantized separately