Draw a dendrogram of similarity between text files.
The similarity is measured in terms of Damerau-Levenshtein edit distance. The distance between given two texts is a count of inserted, deleted, and substituted characters required to modify one text to the other. A smaller value means that the two texts are more similar.
Features:
-
Parallel execution: Supports execution on multiple CPU cores. Plus, jit compilation by Numba (v1.6+).
-
Options in tokenization: By default, the text is compared with a sequence of words extracted by splitting input text into different character types. Optionally, you can compare texts line by line, character by character, or token by token as extracted with lexical analyzers of programming languages.
-
File-centric search: A function to list files in order of similarity to a given file.
-
Diff (Experimental): Diff functionality to show textual differences between files. (This function is provided to check for differences in similarity calculations depending on tokenization.)
pipx install dendro-text
To uninstall,
pipx uninstall dendro-text
To enable jit compilation by Numba, install it according to the instructions on Numba website.
To install dendro-text with Numba,
sudo apt install python3-testresources
pipx install dendro-text --preinstall numba
The speedup with Numba was approx. 5x in one example I tried.
If you are doing tasks like investigating files in the dendrogram one by one (as I am doing), you may find the picaf tool useful.
dendro-text <file>...
-t --tokenize Compare texts as tokens of languages indicated by file extensions, using Pygments lexer.
-c --char-by-char Compare texts in a char-by-char manner.
-l --line-by-line Compare texts in a line-by-line manner.
-U --no-uniq-files Do not remove duplicates from the input files.
--prep=PREPROCESSOR Perform preprocessing for each input file.
-m --max-depth=DEPTH Flatten the subtrees (of dendrogram) deeper than this.
-a --ascii-char-tree Draw a tree picture with ASCII characters, not box-drawing characters.
-B --box-drawing-tree-with-fullwidth-space Draw a tree picture with box-drawing characters and fullwidth space.
-s --file-separator=S File separator (default: comma).
-f --field-separator=S Separator of tree picture and file (default: tab).
Option -a
is for environments (such as C locale) where box-drawing characters turns into garbled characters.
Option -B
is to prevent tree pictures from being corrupted in environments where box-drawing characters are treated as fullwidth ones.
-j NUM Parallel execution. Number of worker processes.
--progress Show progress bar with ETA.
-n --neighbors=NUM Pick up NUM (>=1) neighbors of (files similar to) the first file. Drop the other files.
-N --neighbor-list=NUM List NUM neighbors of the first file, in order of increasing distance. `0` for +inf.
-p --pyplot Plot dendrogram with `matplotlib.pyplot`
--pyplot-font-names List font names can be used in plotting dendrogram.
--pyplot-font=FONTNAME Specify font name in plotting dendrogram.
-d --diff Diff mode (Implies option -U). **Experimental.**
-W --show-words Show list of words extracted from the input file.
- Prepare several text files whose file names are the contents as they are.
$ bash
$ for t in ab{c,cc,ccc,cd,de}fg; do echo $t > $t.txt; done
$ ls -1
abcccfg.txt
abccfg.txt
abcdfg.txt
abcfg.txt
abdefg.txt
Here, the content of each file is the same as its filename, e.g.:
$ cat abccfg.txt
abccfg
- Create dendrograms showing file similarity by character-by-character comparison.
$ dendro-text -c *.txt
─┬─┬─┬── abcfg.txt
│ │ └── abcdfg.txt
│ └─┬── abccfg.txt
│ └── abcccfg.txt
└── abdefg.txt
- List files in order of similarity to a file
abccfg.txt
, with option-N0
.
$ dendro-text -c -N0 abccfg.txt *.txt
0 abccfg.txt
1 abcccfg.txt
1 abcdfg.txt
1 abcfg.txt
2 abdefg.txt
- Show differences between two files as text, with option
-d
(diff).
Tokens that are only in the first file are indicated by a red background color, and tokens that are only in the second file are indicated by a blue background color.
- Create a dendrogram when ignoring a letter
c
, with option--prep
(preprocessing).
Note that the three files abcccfg.txt
, abccfg.txt
, and abcfg.txt
are now grouped in one node, because they no longer differ.
$ dendro-text -c *.txt --prep 'sed s/c//g'
─┬─┬── abcdfg.txt
│ └── abcccfg.txt,abccfg.txt,abcfg.txt
└── abdefg.txt
The default tokenization (extracting words from the text) method is to split text at the point where the type of letter changes.
For example, the text "The version of dendro-text is marked as v1.1.1." turns into the following token sequence:
["The", " ", "version", " ", "of", " ", "dendro", "_", "text", " ",
"is", " ", "marked", " ", "as", " ", "v", "1", ".", "1", ".", "1", "."]
Edit distance is measured token-by-token edits; the edit distance between two texts is increased by one for each token replaced. When you choose the tokenization method by option -l or -c, the edit distance is measured by lines or characters, i.e., tokens generated by the specified option.
In the case of the default tokenization, i.e., splitting text into words at boundaries of varying character types, the character types are symbols, alphabetic characters, spaces, etc., for C locale characters. As for Unicode characters, the character type is identified by reference to the Unicode block.
The enclosed file Blocks.txt
is the definition of the Unicode Blocks, and was taken from: https://github.com/CNMan/Unicode/blob/master/UCD/Blocks.txt .
A preprocessor (argument of option --prep
) is a script or a command line, which takes a file as an input file, and outputs the preprocessed content of the file to the standard output.
Multiple preprocessors (preprocessing scripts) can be added by giving multiple option --prep
's. In such a case, each preprocessing script will get a temporary file on a temporary directory.
The base name of the temporary file is the same as the original input file, but the directory is not.
For example, in the following command line,
$ dendro-text --prep p1.sh --prep p2.sh t1.txt t2.txt t3.txt
Preprocessing scripts p1.sh
and p2.sh
will get (such as) some/temp/dir/t1.txt
, some/temp/dir/t2.txt
or some/temp/dir/t3.txt
as input file.
-
The source code of the functions
distance_int_list
and part of the functionedit_sequence_int_list
are modified material of the Wikipedia article "Algorithm Implementation/Strings/Levenshtein distance", which is released under the Creative Commons Attribution-Share-Alike License 3.0. -
The file
Blocks.txt
is released under the Unicode Data Files and Software License. -
All of the other source code is released under the BSD 2-Clause License.
- v2.0.0: The script is renamed to
dendro-text
. Drop windows support.