- Word mover's distance: paper, gensim example
- Extension of BOW based on word2vec: The sentence vector contains top 10 word vector similarity between each word of the sentence and the vocabulary.
- LSTM + AutoEncoder
- SIF
- Mean-pooling / max-pooling of word vectors / LSTM hidden states
- Doc2vec
- LDA
- LSI
- Sent2vec
- Simhash
- Skip-Thought
- Quick thoughts
- Sentence-BERT: pre-train Siamese BERT on SNLI data
- https://hanxiao.github.io/2018/06/24/4-Encoding-Blocks-You-Need-to-Know-Besides-LSTM-RNN-in-Tensorflow/
- sentence-transformers, can be trained with sentence pairs (e.g., for copurs from text classification)
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
a = np.array([0, 1, 2, 3])
b = np.array([0, 1, 3, 3])
print 'scipy cosine similarity: {}, sklearn similarity: {}'.format(1 - cosine(a, b), cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))[0][0])
# scipy cosine similarity: 0.981022943176, sklearn similarity: 0.981022943176
# scipy vectorization
distances = scipy.spatial.distance.cdist(np.array([query_embedding]), [list of array], "cosine")[0]
tmp_sims = 1 - distances
- Two models are proposed: PV-DM (Paragraph Vector-Distributed Memory) and PV-DBOW (Paragraph Vector-Distributed Bag of Words). PV-DM just likes CBOW in Word2vec and PV-DBOW just likes Skip-gram in Word2vec. PV-DM is consistently better than PV-DBOW.
- Model principles: during training, concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. While paragraph vectors are unique among paragraphs, the word vectors are shared. At prediction time, the paragraph vectors are inferred by fixing the word vectors and training the new paragraph vector until convergence.
- For PV-DM: using concatenation in PV-DM is often better than sum.
- For PV-DBOW:
- BOW features lose the ordering of the words and also ignore semantics of the words (Dot product of any two word vector is zero). Word vector concatenation reserve the word order.
- Weighted averaging of word vectors loses the word order in the same way as the standard bag-of-words models do.
- For long documents, bag-of-words models perform quite well.
# from https://github.com/fxsjy/jieba/issues/575
resentencesp = re.compile(r'([﹒﹔﹖﹗.;。!?]["’”」』]{0,2}|:(?=["‘“「『]{1,2}|$))')
def split_paragraph(sentence):
s = sentence
slist = []
for i in resentencesp.split(s):
if resentencesp.match(i) and slist:
slist[-1] += i
elif i:
slist.append(i)
return slist