This is implementation of CIKM'18 paper Construction of Efficient V-Gram Dictionary for Sequential Data Analysis
vgram
is a Python package, which provide sklearn-like fit-transform interface for easy integration into your pipeline.
vgram
is similar to BPE (Sennrich et al., 2016), but instead of frequencies takes into account the informativeness of subwords.
This allows you to compress the dictionary to several thousand elements with increasing accuracy.
pip install vgram
Also you should have Python 3 and cmake. Maybe you keep some errors about not installed pybind11 but it is okay.
Fit vgram dictionary
from vgram import VGram
texts = ["hello world", "a cat sat on the mat"]
vgram = VGram(size=20, iter_num=300)
vgram.fit(texts)
result = vgram.transform(texts)
vgram
can be applied not only for text, but also to integer sequences.
This generalization allow works with non-textual data or transorm text to tokens by yourself.
This example is equivalent to previous.
from vgram import IntVGram, CharTokenizer
texts = ["hello world", "a cat sat on the mat"]
tokenizer = CharTokenizer()
# transform text to tokens ids and pass to IntVGram
tok_texts = tokenizer.fit_transform(texts)
vgram = IntVGram(size=10000, iter_num=10)
vgram.fit(tok_texts)
result = tokenizer.decode(vgram.transform(tok_texts))
By default VGram
make all texts lowercase and remove all non-alphanumeric symbols amd split on characters.
This normalizations is not good for many tasks, that's why you can fit vgram dictionary with custom normalization and tokenization.
You should only derive class BaseTokenizer and implement normalize and tokenize methods.
Note: This feature is not stable, use previous variant for custom tokenizers.
from vgram import VGram, BaseTokenizer
class MyTokenizer(BaseTokenizer):
def normalize(self, s):
return s
def tokenize(self, s):
return s.split(' ')
texts = ["hello world", "a cat sat on the mat"]
tokenizer = MyTokenizer()
tok_vgram = VGram(size=10000, iter_num=10, tokenizer=tokenizer)
tok_vgram.fit(texts)
tok_result = tok_vgram.transform(texts)
You can change tokenization and try to build dictionary of vgrams, where words are symbols.
Basic example of 20 news groups dataset classification
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from vgram import VGram
# fetch data
train, test = fetch_20newsgroups(subset='train'), fetch_20newsgroups(subset='test')
data = train.data + test.data
# make vgram pipeline and fit it
vgram = Pipeline([
("vgb", VGram(size=10000, iter_num=10)),
("vect", CountVectorizer())
])
# it's ok, vgram fit only once
vgram.fit(data)
# fit classifier and get score
pipeline = Pipeline([
("features", vgram),
('tfidf', TfidfTransformer(sublinear_tf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-4, max_iter=100, random_state=42))
])
pipeline.fit(train.data, train.target)
print("train accuracy: ", np.mean(pipeline.predict(train.data) == train.target))
print("test accuracy: ", np.mean(pipeline.predict(test.data) == test.target))
# show first ten elements of constructed vgram dictionary
alpha = vgram.named_steps["tokenizer"].decode(vgram.named_steps["vgb"].alphabet())
print("First 10 alphabet elements:", alpha[:10])
V-Gram is unsupervised method that's why we fit vgram to all data.
Once fitted, vgram don't fit again and we could not trouble about doubled fitting.
In last two lines shown how to get dictionary alphabet and print some elements.
Read full documentation for more information.