This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.
from artifici_lda.lda_service import train_lda_pipeline_default
FR_STOPWORDS = [
"le", "les", "la", "un", "de", "en",
"a", "b", "c", "s",
"est", "sur", "tres", "donc", "sont",
# even slang/texto stop words:
"ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.
fr_comments = [
"Un super-chat marche sur le trottoir",
"Les super-chats aiment ronronner",
"Les chats sont ronrons",
"Un super-chien aboie",
"Deux super-chiens",
"Combien de chiens sont en train d'aboyer?"
]
transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
fr_comments,
n_topics=2,
stopwords=FR_STOPWORDS,
language='french')
print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)
Output:
array([[0.14218195, 0.85781805],
[0.11032992, 0.88967008],
[0.16960695, 0.83039305],
[0.88967041, 0.11032959],
[0.8578187 , 0.1421813 ],
[0.83039303, 0.16960697]])
['Un super-chien aboie', 'Les super-chats aiment ronronner']
[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
[('chats', 3.4911393765493255), ('super', 2.499979634668601 )]]
[[('super chiens', 2.4921035508342464)],
[('super chats', 2.492102155345991 )]]
See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.
Those languages are supported:
- Danish
- Dutch
- English
- Finnish
- French
- German
- Hungarian
- Italian
- Norwegian
- Porter
- Portuguese
- Romanian
- Russian
- Spanish
- Swedish
- Turkish
You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.
numpy==1.26.3 # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==1.4.0 # BSD-3-Clause
PyStemmer==2.2.0.1 # BSD-3-Clause and MIT
snowballstemmer==2.2.0 # BSD-3-Clause and BSD-2-Clause
translitcodec==0.7.0 # MIT License
scipy==1.12.0 # BSD-3-Clause and MIT-like
Run pytest with ./run_tests.sh
. Coverage:
----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name Stmts Miss Cover
--------------------------------------------------------------
artifici_lda/__init__.py 0 0 100%
artifici_lda/data_utils.py 39 0 100%
artifici_lda/lda_service.py 31 0 100%
artifici_lda/logic/__init__.py 0 0 100%
artifici_lda/logic/count_vectorizer.py 9 0 100%
artifici_lda/logic/lda.py 23 7 70%
artifici_lda/logic/letter_splitter.py 36 4 89%
artifici_lda/logic/stemmer.py 60 3 95%
artifici_lda/logic/stop_words_remover.py 61 5 92%
--------------------------------------------------------------
TOTAL 259 19 93%
This project is published under the MIT License (MIT).
Copyright (c) 2018 Artifici online services inc.
Coded by Guillaume Chevalier at Neuraxio Inc.