📖 Documentation

1 Main Window

The main window of Wordless is divided into several sections:

1.1 Menu Bar
The Menu Bar resides at the top of the main window.
1.2 Work Area
The Work Area resides at the upper half of the main window, just below Menu Bar.

The Work Area is further divided into the Results Area on the left side and the Settings Area on the right side. You can click on the tabs to toggle between different modules.
1.3 File Area
The File Area resides at the lower half of the main window, just above Status Bar.
1.4 Status Bar
The Status Bar resides at the bottom of the main window.

You can show/hide the Status Bar by checking/unchecking Menu Bar → Preferences → Show Status Bar

You can modify the global scaling factor and font settings of the user interface via Menu Bar → Preferences → General → User Interface Settings.

2 File Area

In most cases, the first thing to do in Wordless is open and select your files to be processed via Menu Bar → File → Open Files/Folder.

Files are loaded, cached and selected automatically after being added to the File Table. Only selected files will be processed by Wordless. You can drag and drop files around the File Table to change their orders, which would be reflected in the results.

By default, Wordless would try to detect the encoding and language settings of all files for you, you should double check and make sure that the settings of each and every file are correct. If you prefer changing file settings manually, you could uncheck Open Files dialog → Auto-detect encodings and/or Open Files dialog → Auto-detect languages. The default file settings could be modified via Menu Bar → Preferences → Settings → Files → Default Settings. Additionally, you need to change Open Files dialog → Tokenized and Open Files dialog → Tagged options of each files according to whether or not the file has been tokenized or tagged.

2.1 Menu Bar → File
- 2.1.1 Open Files
  Open the Open Files dialog to add file(s) to the File Table.
- 2.1.2 Reopen Closed Files
  Add file(s) that are closed the last time back to the File Table.
  
  * The history of all closed files will be erased upon exit of Wordless.
- 2.1.3 Select All
  Select all files in the File Table.
- 2.1.4 Deselect All
  Deselect all files in the File Table.
- 2.1.5 Invert Selection
  Select files that are not currently selected and deselect files that are currently selected in the File Table.
- 2.1.6 Close Selected
  Remove files that are currently selected from the File Table.
- 2.1.7 Close All
  Remove all files from the File Table.
2.2 Open Files dialog
- 2.2.1 Add files
  Add one single file or multiple files into the table.
  
  * You can use the Ctrl key (Command key on macOS) and/or the Shift key to select multiple files.
- 2.2.2 Add folder
  Add all files in the folder into the table.
  
  By default, all files in the chosen folder and the subfolders of the chosen folder (and subfolders of subfolders, and so on) are added to the table. If you do not want to add files in subfolders to the table, you could uncheck Include files in subfolders.
- 2.2.3 Remove files
  Remove the selected files from the table.
- 2.2.4 Clear table
  Remove all files from the table.
- 2.2.5 Auto-detect encodings
  Auto-detect the encodings of all files when they are added into the table. If the detection results are incorrect, you can manually modify encoding settings in the table.
- 2.2.6 Auto-detect languages
  Auto-detect the languages of all files when they are added into the table. If the detection results are incorrect, you can manually modify language settings in the table.
- 2.2.7 Include files in subfolders
  When adding a folder to the table, recursively add all files in the chosen folder and subfolders of the chosen folder (and subfolders of subfolders, and so on) into the table

3 Profiler

Note

Renamed from Overview to Profiler in Wordless 2.2.0

In Profiler, you can check and compare general linguistic features of different files.

All statistics are grouped into 5 tables for better readability: Readability, Counts, Lexical Density/Diversity, Lengths, Length Breakdown.

3.1.1 Readability
Readability statistics of each file calculated according to the different readability tests used. See section 12.4.1 Readability Formulas for more details.
3.1.2 Counts
- 3.1.2.1 Count of Paragraphs
  The number of paragraphs in each file. Each line in the file is counted as one paragraph. Blank lines and lines containing only spaces, tabs and other invisible characters are not counted.
- 3.1.2.2 Count of Paragraphs %
  The percentage of the number of paragraphs in each file out of the total number of paragraphs in all files.
- 3.1.2.3 Count of Sentences
  The number of sentences in each file. Wordless automatically applies the built-in sentence tokenizer according to the language of each file to calculate the number of sentences in each file. You can modify sentence tokenizer settings via Menu Bar → Preferences → Settings → Sentence Tokenization → Sentence Tokenizer Settings.
- 3.1.2.4 Count of Sentences %
  The percentage of the number of sentences in each file out of the total number of sentences in all files.
- 3.1.2.5 Count of Sentence Segments
  The number of sentence segments in each file. Each part of sentence ending with one or more consecutive terminal punctuation marks (as per the Unicode Standard) is counted as one sentence segment. See here for the full list of terminal punctuation marks.
- 3.1.2.6 Count of Sentence Segments %
  The percentage of the number of sentence segments in each file out of the total number of sentence segments in all files.
- 3.1.2.7 Count of Tokens
  The number of tokens in each file. Wordless automatically applies the built-in word tokenizer according to the language of each file to calculate the number of tokens in each file. You can modify word tokenizer settings via Menu Bar → Preferences → Settings → Word Tokenization → Word Tokenizer Settings.
  
  You can specify what should be counted as a "token" via Token Settings in the Settings Area
- 3.1.2.8 Count of Tokens %
  The percentage of the number of tokens in each file out of the total number of tokens in all files.
- 3.1.2.9 Count of Types
  The number of token types in each file.
- 3.1.2.10 Count of Types %
  The percentage of the number of token types in each file out of the total number of token types in all files.
- 3.1.2.11 Count of Syllables
  The number of syllables in each files. Wordless automatically applies the built-in syllable tokenizer according to the language of each file to calculate the number of syllable in each file. You can modify syllable tokenizer settings via Menu Bar → Preferences → Settings → Syllable Tokenization → Syllable Tokenizer Settings.
- 3.1.2.12 Count of Syllables %
  The percentage of the number of syllables in each file out of the total number of syllable in all files.
- 3.1.2.13 Count of Characters
  The number of single characters in each file. Spaces, tabs and all other invisible characters are not counted.
- 3.1.2.14 Count of Characters %
  The percentage of the number of characters in each file out of the total number of characters in all files.
3.1.3 Lexical Density/Diversity
Statistics of lexical density/diversity which reflect the the extend to which the vocabulary used in each file varies. See section 12.4.2 Indicators of Lexical Density/Diversity for more details.
3.1.4 Lengths
- 3.1.4.1 Paragraph Length in Sentences / Sentence Segments / Tokens (Mean)
  The average value of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.2 Paragraph Length in Sentences / Sentence Segments / Tokens (Standard Deviation)
  The standard deviation of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.3 Paragraph Length in Sentences / Sentence Segments / Tokens (Variance)
  The variance of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.4 Paragraph Length in Sentences / Sentence Segments / Tokens (Minimum)
  The minimum of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.5 Paragraph Length in Sentences / Sentence Segments / Tokens (25th Percentile)
  The 25th percentile of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.6 Paragraph Length in Sentences / Sentence Segments / Tokens (Median)
  The median of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.7 Paragraph Length in Sentences / Sentence Segments / Tokens (75th Percentile)
  The 75th percentile of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.8 Paragraph Length in Sentences / Sentence Segments / Tokens (Maximum)
  The maximum of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.9 Paragraph Length in Sentences / Sentence Segments / Tokens (Range)
  The range of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.10 Paragraph Length in Sentences / Sentence Segments / Tokens (Interquartile Range)
  The interquartile range of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.11 Paragraph Length in Sentences / Sentence Segments / Tokens (Modes)
  The mode(s) of paragraph lengths expressed in sentences / sentence segments / tokens.
- 3.1.4.12 Sentence / Sentence Segment Length in Tokens (Mean)
  The average value of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.13 Sentence / Sentence Segment Length in Tokens (Standard Deviation)
  The standard deviation of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.14 Sentence / Sentence Segment Length in Tokens (Variance)
  The variance of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.15 Sentence / Sentence Segment Length in Tokens (Minimum)
  The minimum of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.16 Sentence / Sentence Segment Length in Tokens (25th Percentile)
  The 25th percentile of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.17 Sentence / Sentence Segment Length in Tokens (Median)
  The median of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.18 Sentence / Sentence Segment Length in Tokens (75th Percentile)
  The 75th percentile of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.19 Sentence / Sentence Segment Length in Tokens (Maximum)
  The maximum of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.20 Sentence / Sentence Segment Length in Tokens (Range)
  The range of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.21 Sentence / Sentence Segment Length in Tokens (Interquartile Range)
  The interquartile range of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.22 Sentence / Sentence Segment Length in Tokens (Modes)
  The mode(s) of sentence / sentence segment lengths expressed in tokens.
- 3.1.4.23 Token/Type Length in Syllables/Characters (Mean)
  The average value of token / token type lengths expressed in syllables/characters.
- 3.1.4.24 Token/Type Length in Syllables/Characters (Standard Deviation)
  The standard deviation of token / token type lengths expressed in syllables/characters.
- 3.1.4.25 Token/Type Length in Syllables/Characters (Variance)
  The variance of token / token type lengths expressed in syllables/characters.
- 3.1.4.26 Token/Type Length in Syllables/Characters (Minimum)
  The minimum of token / token type lengths expressed in syllables/characters.
- 3.1.4.27 Token/Type Length in Syllables/Characters (25th Percentile)
  The 25th percentile of token / token type lengths expressed in syllables/characters.
- 3.1.4.28 Token/Type Length in Syllables/Characters (Median)
  The median of token / token type lengths expressed in syllables/characters.
- 3.1.4.29 Token/Type Length in Syllables/Characters (75th Percentile)
  The 75th percentile of token / token type lengths expressed in syllables/characters.
- 3.1.4.30 Token/Type Length in Syllables/Characters (Maximum)
  The maximum of token / token type lengths expressed in syllables/characters.
- 3.1.4.31 Token/Type Length in Syllables/Characters (Range)
  The range of token / token type lengths expressed in syllables/characters.
- 3.1.4.32 Token/Type Length in Syllables/Characters (Interquartile Range)
  The interquartile range of token / token type lengths expressed in syllables/characters.
- 3.1.4.33 Token/Type Length in Syllables/Characters (Modes)
  The mode(s) of token / token type lengths expressed in syllables/characters.
- 3.1.4.34 Syllable Length in Characters (Mean)
  The average value of syllable lengths expressed in characters.
- 3.1.4.35 Syllable Length in Characters (Standard Deviation)
  The standard deviation of syllable lengths expressed in characters.
- 3.1.4.36 Syllable Length in Characters (Variance)
  The variance of syllable lengths expressed in characters.
- 3.1.4.37 Syllable Length in Characters (Minimum)
  The minimum of syllable lengths expressed in characters.
- 3.1.4.38 Syllable Length in Characters (25th Percentile)
  The 25th percentile of syllable lengths expressed in characters.
- 3.1.4.39 Syllable Length in Characters (Median)
  The median of syllable lengths expressed in characters.
- 3.1.4.40 Syllable Length in Characters (75th Percentile)
  The 75th percentile of syllable lengths expressed in characters.
- 3.1.4.41 Syllable Length in Characters (Maximum)
  The maximum of syllable lengths expressed in characters.
- 3.1.4.42 Syllable Length in Characters (Range)
  The range of syllable lengths expressed in characters.
- 3.1.4.43 Syllable Length in Characters (Interquartile Range)
  The interquartile range of Syllable lengths expressed in characters.
- 3.1.4.44 Syllable Length in Characters (Modes)
  The mode(s) of syllable lengths expressed in characters.
3.1.5 Length Breakdown
- 3.1.5.1 Count of n-token-long Sentences / Sentence Segments
  The number of n-token-long sentences / sentence segments, where n = 1, 2, 3, etc.
- 3.1.5.2 Count of n-token-long Sentences / Sentence Segments %
  The percentage of the number of n-token-long sentences / sentence segments in each file out of the total number of n-token-long sentences / sentence segments in all files, where n = 1, 2, 3, etc.
- 3.1.5.3 Count of n-syllable-long Tokens
  The number of n-syllable-long tokens, where n = 1, 2, 3, etc.
- 3.1.5.4 Count of n-syllable-long Tokens %
  The percentage of the number of n-syllable-long tokens in each file out of the total number of n-syllable-long tokens in all files, where n = 1, 2, 3, etc.
- 3.1.5.5 Count of n-character-long Tokens
  The number of n-character-long tokens, where n = 1, 2, 3, etc.
- 3.1.5.6 Count of n-character-long Tokens %
  The percentage of the number of n-character-long tokens in each file out of the total number of n-character-long tokens in all files, where n = 1, 2, 3, etc.

4 Concordancer

In Concordancer, you can search for tokens in different files and generate concordance lines. You can adjust settings for data generation via Generation Settings.

After the concordance lines are generated and displayed in the table, you can sort the results by clicking Sort Results or search in Data Table for parts that might be of interest to you by clicking Search in results. Highlight colors for sorting can be modified via Menu Bar → Preferences → Settings → Tables → Concordancer → Sorting.

You can generate concordance plots for all search terms. You can modify the settings for the generated figure via Figure Settings.

4.1 Left
The context before each search term, which displays 10 tokens left to the Node by default. You can change this behavior via Generation Settings.
4.2 Node
The search term(s) specified in Search Settings → Search Term.
4.3 Right
The context after each search term, which displays 10 tokens right to the Node by default. You can change this behavior via Generation Settings.
4.4 Sentiment
The sentiment of the Node combined with its context (Left and Right).
4.5 Token No.
The position of the first token of Node in each file.
4.6 Token No. %
The percentage of the position of the first token of Node in each file.
4.7 Sentence Segment No.
The position of the sentence segment where the Node is found in each file.
4.8 Sentence Segment No. %
The percentage of the position of the sentence segment where the Node is found in each file.
4.9 Sentence No.
The position of the sentence where the Node is found in each file.
4.10 Sentence No. %
The percentage of the position of the sentence where the Node is found in each file.
4.11 Paragraph No.
The position of the paragraph where the Node is found in each file.
4.12 Paragraph No. %
The percentage of the position of the paragraph where the Node is found in each file.
4.13 File
The name of the file where the Node is found.

5 Parallel Concordancer

Note

Added in Wordless 2.0.0
Renamed from Concordancer (Parallel Mode) to Parallel Concordancer in Wordless 2.2.0

In Parallel Concordancer, you can search for tokens in parallel corpora and generate parallel concordance lines. You may leave Search Settings → Search Term blank so as to search for instances of additions and deletions.

You can search in Data Table for parts that might be of interest to you by clicking Search in results.

5.1 Parallel Unit No.
The position of the alignment unit (paragraph) where the the search term is found.
5.2 Parallel Unit No. %
The percentage of the position of the alignment unit (paragraph) where the the search term is found.
5.3 Parallel Units
The parallel unit (paragraph) where the search term is found in each file.

Highlight colors for search terms can be modified via Menu Bar → Preferences → Settings → Tables → Parallel Concordancer → Highlight Color Settings.

6 Dependency Parser

Note

Added in Wordless 3.0.0

In Dependency Parser, you can search for all dependency relations associated with different tokens and calculate their dependency lengths (distances).

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can select lines in the Results Area and then click Generate Figure to show dependency graphs for all selected sentences. You can modify the settings for the generated figure via Figure Settings and decide how the figures should be displayed.

6.1 Head
The token functioning as the head in the dependency structure.
6.2 Dependent
The token functioning as the dependent in the dependency structure.
6.3 Dependency Length
The dependency length (distance) between the head and dependent in the dependency structure. The dependency length is positive when the head follows the dependent and would be negative if the head precedes the dependent.
6.4 Dependency Length (Absolute)
The absolute value of the dependency length (distance) between the head and dependent in the dependency structure. The absolute dependency length is always positive.
6.5 Sentence
The sentence where the dependency structure is found.

Highlight colors for the head and the dependent can be modified via Menu Bar → Preferences → Settings → Tables → Dependency Parser → Highlight Color Settings.
6.6 Sentence No.
The position of the sentence where the dependency structure is found.
6.7 Sentence No. %
The percentage of the position of the sentence where the dependency structure is found.
6.8 File
The name of the file where the dependency structure is found.

7 Wordlist Generator

Note

Renamed from Wordlist to Wordlist Generator in Wordless 2.2.0

In Wordlist Generator, you can generate wordlists for different files and calculate the raw frequency, relative frequency, dispersion and adjusted frequency for each token. You can disable the calculation of dispersion and/or adjusted frequency by setting Generation Settings → Measures of Dispersion / Measure of Adjusted Frequency to None.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for wordlists using any statistics. You can modify the settings for the generated figure via Figure Settings.

7.1 Rank
The rank of the token sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.
7.2 Token
You can specify what should be counted as a "token" via Token Settings.
7.3 Syllabification
The syllabified form of each token.

If the token happens to exist in the vocabulary of multiple languages, all syllabified forms with their applicable languages will be listed.

If there is no syllable tokenization support for the language where the token is found, "No language support" is displayed instead. To check which languages have syllable tokenization support, please refer to section 12.1 Supported Languages.
7.4 Frequency
The number of occurrences of the token in each file.
7.5 Dispersion
The dispersion of the token in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.
7.6 Adjusted Frequency
The adjusted frequency of the token in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.
7.7 Number of Files Found
The number of files in which the token appears at least once.
7.8 Number of Files Found %
The percentage of the number of files in which the token appears at least once out of the total number of files that are cureently selected.

8 N-gram Generator

Note

Renamed from N-gram to N-gram Generator in Wordless 2.2.0

In N-gram Generator, you can search for n-grams (consecutive tokens) or skip-grams (non-consecutive tokens) in different files, count and compute the raw frequency and relative frequency of each n-gram/skip-gram, and calculate the dispersion and adjusted frequency for each n-gram/skip-gram using different measures. You can adjust the settings for the generated results via Generation Settings. You can disable the calculation of dispersion and/or adjusted frequency by setting Generation Settings → Measures of Dispersion / Measure of Adjusted Frequency to None. To allow skip-grams in the results, check Generation Settings → Allow skipped tokens and modify the settings. You can also set constraints on the position of search terms in all n-grams via Search Settings → Search Term Position.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for n-grams using any statistics. You can modify the settings for the generated figure via Figure Settings.

8.1 Rank
The rank of the n-gram sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.
8.2 N-gram
You can specify what should be counted as a "n-gram" via Token Settings.
8.3 Frequency
The number of occurrences of the n-gram in each file.
8.4 Dispersion
The dispersion of the n-gram in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.
8.5 Adjusted Frequency
The adjusted frequency of the n-gram in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details.
8.6 Number of Files Found
The number of files in which the n-gram appears at least once.
8.7 Number of Files Found %
The percentage of the number of files in which the n-gram appears at least once out of the total number of files that are currently selected.

9 Collocation Extractor

Note

Renamed from Collocation to Collocation Extractor in Wordless 2.2.0

In Collocation Extractor, you can search for patterns of collocation (tokens that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of collocates and calculate the Bayes factor and effect size for each pair using different measures. You can adjust the settings for the generated results via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts, word clouds, and network graphs for patterns of collocation using any statistics. You can modify the settings for the generated figure via Figure Settings.

9.1 Rank
The rank of the collocating token sorted by the p-value of the significance test conducted on the node and the collocating token in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.
9.2 Node
The search term. You can specify what should be counted as a "token" via Token Settings.
9.3 Collocate
The collocating token. You can specify what should be counted as a "token" via Token Settings.
9.4 Ln, ..., L3, L2, L1, R1, R2, R3, ..., Rn
The number of co-occurrences of the node and the collocating token with the collocating token at the given position in each file.
9.5 Frequency
The total number of co-occurrences of the node and the collocating token with the collocating token at all possible positions in each file.
9.6 Test Statistic
The test statistic of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

Please note that test statistic is not available for some tests of statistical significance.
9.7 p-value
The p-value of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
9.8 Bayes Factor
The Bayes factor the node and the collocating token in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
9.9 Effect Size
The effect size of the node and the collocating token in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
9.10 Number of Files Found
The number of files in which the node and the collocating token co-occur at least once.
9.11 Number of Files Found %
The percentage of the number of files in which the node and the collocating token co-occur at least once out of the total number of files that are currently selected.

10 Colligation Extractor

Note

Renamed from Colligation to Colligation Extractor in Wordless 2.2.0

In Colligation Extractor, you can search for patterns of colligation (parts of speech that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of parts of speech and calculate the Bayes factor and effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.

Wordless will automatically apply its built-in part-of-speech tagger on every file that are not part-of-speech-tagged already according to the language of each file. If part-of-speech tagging is not supported for the given languages, the user should provide a file that has already been part-of-speech-tagged and make sure that the correct Text Type has been set on each file.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for patterns of colligation using any statistics. You can modify the settings for the generated figure via Figure Settings.

10.1 Rank
The rank of the collocating part of speech sorted by the p-value of the significance test conducted on the node and the collocating part of speech in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.
10.2 Node
The search term. You can specify what should be counted as a "token" via Token Settings.
10.3 Collocate
The collocating part of speech. You can specify what should be counted as a "token" via Token Settings.
10.4 Ln, ..., L3, L2, L1, R1, R2, R3, ..., Rn
The number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at the given position in each file.
10.5 Frequency
The total number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at all possible positions in each file.
10.6 Test Statistic
The test statistic of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

Please note that test statistic is not available for some tests of statistical significance.
10.7 p-value
The p-value of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
10.8 Bayes Factor
The Bayes factor of the node and the collocating part of speech in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
10.9 Effect Size
The effect size of the node and the collocating part of speech in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
10.10 Number of Files Found
The number of files in which the node and the collocating part of speech co-occur at least once.
10.11 Number of Files Found %
The percentage of the number of files in which the node and the collocating part of speech co-occur at least once out of the total number of file that are currently selected.

11 Keyword Extractor

Note

Renamed from Keyword to Keyword Extractor in Wordless 2.2

In Keyword Extractor, you can search for candidates of potential keywords (tokens that have far more or far less frequency in the observed file than in the reference file) in different files given a reference corpus, conduct different tests of statistical significance on each keyword and calculate the Bayes factor and effect size for each keyword using different measures. You can adjust the settings for the generated data via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.

You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.

You can generate line charts or word clouds for keywords using any statistics. You can modify the settings for the generated figure via Figure Settings.

11.1 Rank
The rank of the keyword sorted by the p-value of the significance test conducted on the keyword in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties.
11.2 Keyword
The potential keyword. You can specify what should be counted as a "token" via Token Settings.
11.3 Frequency (in Reference File)
The number of occurrences of the keyword in the reference file.
11.4 Frequency (in Observed Files)
The number of occurrences of the keyword in each observed file.
11.5 Test Statistic
The test statistic of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.

Please note that test statistic is not available for some tests of statistical significance.
11.6 p-value
The p-value of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
11.7 Bayes Factor
The Bayes factor of the keyword in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
11.8 Effect Size
The effect size of on the keyword in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.
11.9 Number of Files Found
The number of files in which the keyword appears at least once.
11.10 Number of Files Found %
The percentage of the number of files in which the keyword appears at least once out of the total number of files that are currently selected.

12 Appendixes

12.1 Supported Languages

Language	Sentence Token-ization	Word Token-ization	Syllable Token-ization	Part-of-speech Tagging	Lemma-tization	Stop Word List	Depen-dency Parsing	Senti-ment Analysis
Afrikaans	✔	✔	✔	✔	✔	✖️	✔	✔
Albanian	⭕️	✔	✔	✖️	✔	✖️	✖️	✔
Amharic	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Arabic	✔	✔	✖️	✔	✔	✔	✔	✔
Armenian (Classical)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Armenian (Eastern)	✔	✔	✖️	✔	✔	✖️	✔	✔
Armenian (Western)	✔	✔	✖️	✔	✔	✖️	✔	✔
Assamese	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Asturian	⭕️	⭕️	✖️	✖️	✔	✖️	✖️	✖️
Azerbaijani	⭕️	✔	✖️	✖️	✖️	✔	✖️	✔
Basque	✔	✔	✔	✔	✔	✔	✔	✔
Belarusian	✔	✔	✔	✔	✔	✖️	✔	✔
Bengali	⭕️	✔	✖️	✖️	✔	✔	✖️	✔
Bulgarian	✔	✔	✔	✔	✔	✖️	✔	✔
Burmese	✔	✔	✖️	✖️	✖️	✖️	✖️	✔
Buryat (Russia)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Catalan	✔	✔	✔	✔	✔	✔	✔	✔
Chinese (Classical)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Chinese (Simplified)	✔	✔	✖️	✔	✔	✔	✔	✔
Chinese (Traditional)	✔	✔	✖️	✔	✔	✔	✔	✔
Church Slavonic (Old)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Coptic	✔	✔	✖️	✔	✔	✖️	✔	✖️
Croatian	✔	✔	✔	✔	✔	✖️	✔	✔
Czech	✔	✔	✔	✔	✔	✖️	✔	✔
Danish	✔	✔	✔	✔	✔	✔	✔	✔
Dutch	✔	✔	✔	✔	✔	✔	✔	✔
English (Middle)	⭕️	⭕️	✖️	✖️	✔	✖️	✖️	✖️
English (Old)	✔	✔	✖️	✔	✔	✖️	✔	✖️
English (United Kingdom)	✔	✔	✔	✔	✔	✔	✔	✔
English (United States)	✔	✔	✔	✔	✔	✔	✔	✔
Erzya	✔	✔	✖️	✔	✔	✖️	✔	✖️
Esperanto	⭕️	⭕️	✔	✖️	✖️	✖️	✖️	✔
Estonian	✔	✔	✔	✔	✔	✖️	✔	✔
Faroese	✔	✔	✖️	✔	✖️	✖️	✔	✖️
Finnish	✔	✔	✖️	✔	✔	✔	✔	✔
French	✔	✔	✔	✔	✔	✔	✔	✔
French (Old)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Galician	✔	✔	✔	✔	✔	✖️	✔	✔
Georgian	⭕️	⭕️	✖️	✖️	✔	✖️	✖️	✔
German (Austria)	✔	✔	✔	✔	✔	✔	✔	✔
German (Germany)	✔	✔	✔	✔	✔	✔	✔	✔
German (Switzerland)	✔	✔	✔	✔	✔	✔	✔	✔
Gothic	✔	✔	✖️	✔	✔	✖️	✔	✖️
Greek (Ancient)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Greek (Modern)	✔	✔	✔	✔	✔	✔	✔	✔
Gujarati	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Hebrew (Ancient)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Hebrew (Modern)	✔	✔	✖️	✔	✔	✔	✔	✔
Hindi	✔	✔	✖️	✔	✔	✖️	✔	✔
Hungarian	✔	✔	✔	✔	✔	✔	✔	✔
Icelandic	✔	✔	✔	✔	✔	✖️	✔	✔
Indonesian	✔	✔	✔	✔	✔	✔	✔	✔
Irish	✔	✔	✖️	✔	✔	✖️	✔	✔
Italian	✔	✔	✔	✔	✔	✔	✔	✔
Japanese	✔	✔	✖️	✔	✔	✖️	✔	✔
Kannada	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Kazakh	✔	✔	✖️	✔	✔	✔	✔	✔
Khmer	✔	✔	✖️	✔	✖️	✖️	✖️	✔
Korean	✔	✔	✖️	✔	✔	✖️	✔	✔
Kurdish (Kurmanji)	✔	✔	✖️	✔	✔	✖️	✔	✔
Kyrgyz	✔	✔	✖️	✔	✔	✖️	✔	✔
Lao	✔	✔	✖️	✔	✖️	✔	✖️	✔
Latin	✔	✔	✖️	✔	✔	✖️	✔	✔
Latvian	✔	✔	✔	✔	✔	✖️	✔	✔
Ligurian	✔	✔	✖️	✔	✔	✖️	✔	✖️
Lithuanian	✔	✔	✔	✔	✔	✖️	✔	✔
Luganda	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Luxembourgish	⭕️	✔	✖️	✖️	✔	✖️	✖️	✔
Macedonian	✔	✔	✖️	✔	✔	✖️	✔	✔
Malay	⭕️	✔	✖️	✖️	✔	✖️	✖️	✔
Malayalam	✔	✔	✖️	✖️	✖️	✖️	✖️	✔
Maltese	✔	✔	✖️	✔	✖️	✖️	✔	✔
Manx	✔	✔	✖️	✔	✔	✖️	✔	✖️
Marathi	✔	✔	✖️	✔	✔	✖️	✔	✔
Meitei (Meitei script)	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Mongolian	⭕️	⭕️	✔	✖️	✖️	✖️	✖️	✔
Nepali	⭕️	✔	✖️	✖️	✖️	✔	✖️	✔
Nigerian Pidgin	✔	✔	✖️	✔	✔	✖️	✔	✖️
Norwegian (Bokmål)	✔	✔	✔	✔	✔	✔	✔	✔
Norwegian (Nynorsk)	✔	✔	✔	✔	✔	✖️	✔	✖️
Odia	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Persian	✔	✔	✖️	✔	✔	✖️	✔	✔
Polish	✔	✔	✔	✔	✔	✖️	✔	✔
Pomak	✔	✔	✖️	✔	✔	✖️	✔	✖️
Portuguese (Brazil)	✔	✔	✔	✔	✔	✔	✔	✔
Portuguese (Portugal)	✔	✔	✔	✔	✔	✔	✔	✔
Punjabi (Gurmukhi script)	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Romanian	✔	✔	✔	✔	✔	✔	✔	✔
Russian	✔	✔	✔	✔	✔	✔	✔	✔
Russian (Old)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Sámi (Northern)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Sanskrit	✔	✔	✖️	✔	✔	✖️	✔	✔
Scottish Gaelic	✔	✔	✖️	✔	✔	✖️	✔	✔
Serbian (Cyrillic script)	⭕️	✔	✔	✖️	✔	✖️	✖️	✔
Serbian (Latin script)	✔	✔	✔	✔	✔	✖️	✔	✔
Sindhi	✔	✔	✖️	✔	✖️	✖️	✖️	✔
Sinhala	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Slovak	✔	✔	✔	✔	✔	✖️	✔	✔
Slovene	✔	✔	✔	✔	✔	✔	✔	✔
Sorbian (Lower)	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✖️
Sorbian (Upper)	✔	✔	✖️	✔	✔	✖️	✔	✖️
Spanish	✔	✔	✔	✔	✔	✔	✔	✔
Swahili	⭕️	⭕️	✖️	✖️	✔	✖️	✖️	✔
Swedish	✔	✔	✔	✔	✔	✔	✔	✔
Tagalog	⭕️	✔	✖️	✖️	✔	✖️	✖️	✔
Tajik	⭕️	✔	✖️	✖️	✖️	✔	✖️	✔
Tamil	✔	✔	✖️	✔	✔	✖️	✔	✔
Tatar	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Telugu	✔	✔	✔	✔	✖️	✖️	✔	✔
Tetun (Dili)	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✖️
Thai	✔	✔	✔	✔	✖️	✔	✖️	✔
Tibetan	✔	✔	✖️	✔	✔	✖️	✖️	✖️
Tigrinya	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Tswana	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✖️
Turkish	✔	✔	✖️	✔	✔	✔	✔	✔
Ukrainian	✔	✔	✔	✔	✔	✖️	✔	✔
Urdu	✔	✔	✖️	✔	✔	✖️	✔	✔
Uyghur	✔	✔	✖️	✔	✔	✖️	✔	✔
Vietnamese	✔	✔	✖️	✔	✖️	✖️	✔	✔
Welsh	✔	✔	✖️	✔	✔	✖️	✔	✔
Wolof	✔	✔	✖️	✔	✔	✖️	✔	✖️
Yoruba	⭕️	✔	✖️	✖️	✖️	✖️	✖️	✔
Zulu	⭕️	⭕️	✔	✖️	✖️	✖️	✖️	✔
Other languages	⭕️	⭕️	✖️	✖️	✖️	✖️	✖️	✖️

Note

✔: Supported
⭕️: Supported but falls back to the default English (United States) tokenizer
✖️: Not supported

12.2 Supported File Types

File Type	File Extensions	Remarks
CSV files¹	*.csv
Excel workbooks¹²	*.xlsx	Legacy Microsoft 97-2003 Excel Workbooks (*.xls) are not supported.
HTML pages¹²	.htm, .html
Lyrics File¹	*.lrc	Simple LRC and enhanced LRC formats are supported.
PDF files¹²	*.pdf	Text could only be extracted from text-searchable PDF files. There is no support for automatically converting scanned PDF files into text-searchable ones.
PowerPoint presentations¹²	*.pptx	Legacy Microsoft 97-2003 PowerPoint presentations (*.ppt) are not supported.
Text files	*.txt
Translation memory files¹	*.tmx
Word documents¹²	*.docx	Legacy Microsoft 97-2003 Word documents (*.doc) are not supported.
XML files¹	*.xml

Important

Non-TXT files will be automatically converted to TXT files when being imported into Wordless. You can check the converted files under folder imports at the installation location of Wordless on your computer (as for macOS users, right click Wordless.app, select Show Package Contents and navigate to Contents/MacOS/imports/). You can change this location via Menu Bar → Preferences → Settings → General → Import → Temporary Files → Default path.
It is not recommended to directly import non-text files into Wordless and the support for doing so is provided only for convenience, since accuracy of text extraction could never be guaranteed and unintended data loss might occur, for which reason users are encouraged to convert their files using specialized tools and make their own choices on which part of the data should be kept or discarded.

12.3 Supported File Encodings

Language	File Encoding	Auto-detection
All languages	UTF-8 without BOM	✔
All languages	UTF-8 with BOM	✔
All languages	UTF-16 with BOM	✔
All languages	UTF-16BE without BOM	✔
All languages	UTF-16LE without BOM	✔
All languages	UTF-32 with BOM	✔
All languages	UTF-32BE without BOM	✔
All languages	UTF-32LE without BOM	✔
All languages	UTF-7	✔
Arabic	CP720	✔
Arabic	CP864	✔
Arabic	ISO-8859-6	✔
Arabic	Mac OS	✔
Arabic	Windows-1256	✔
Baltic languages	CP775	✔
Baltic languages	ISO-8859-13	✔
Baltic languages	Windows-1257	✔
Celtic languages	ISO-8859-14	✔
Chinese	GB18030	✔
Chinese	GBK	✔
Chinese (Simplified)	GB2312	✔
Chinese (Simplified)	HZ	✔
Chinese (Traditional)	Big-5	✔
Chinese (Traditional)	Big5-HKSCS	✔
Chinese (Traditional)	CP950	✔
Croatian	Mac OS	✔
Cyrillic	CP855	✔
Cyrillic	CP866	✔
Cyrillic	ISO-8859-5	✔
Cyrillic	Mac OS	✔
Cyrillic	Windows-1251	✔
English	ASCII	✔
English	EBCDIC 037	✔
English	CP437	✔
European	HP Roman-8	✔
European (Central)	CP852	✔
European (Central)	ISO-8859-2	✔
European (Central)	Mac OS Central European	✔
European (Central)	Windows-1250	✔
European (Northern)	ISO-8859-4	✔
European (Southern)	ISO-8859-3	✔
European (Southeastern)	ISO-8859-16	✔
European (Western)	EBCDIC 500	✔
European (Western)	CP850	✔
European (Western)	CP858	✔
European (Western)	CP1140	✔
European (Western)	ISO-8859-1	✔
European (Western)	ISO-8859-15	✔
European (Western)	Mac OS Roman	✔
European (Western)	Windows-1252	✔
French	CP863	✔
German	EBCDIC 273	✔
Greek	CP737	✔
Greek	CP869	✔
Greek	CP875	✔
Greek	ISO-8859-7	✔
Greek	Mac OS	✔
Greek	Windows-1253	✔
Hebrew	CP856	✔
Hebrew	CP862	✔
Hebrew	EBCDIC 424	✔
Hebrew	ISO-8859-8	✔
Hebrew	Windows-1255	✔
Icelandic	CP861	✔
Icelandic	Mac OS	✔
Japanese	CP932	✔
Japanese	EUC-JP	✔
Japanese	EUC-JIS-2004	✔
Japanese	EUC-JISx0213	✔
Japanese	ISO-2022-JP	✔
Japanese	ISO-2022-JP-1	✔
Japanese	ISO-2022-JP-2	✔
Japanese	ISO-2022-JP-2004	✔
Japanese	ISO-2022-JP-3	✔
Japanese	ISO-2022-JP-EXT	✔
Japanese	Shift_JIS	✔
Japanese	Shift_JIS-2004	✔
Japanese	Shift_JISx0213	✔
Kazakh	KZ-1048	✔
Kazakh	PTCP154	✔
Korean	EUC-KR	✔
Korean	ISO-2022-KR	✔
Korean	JOHAB	✔
Korean	UHC	✔
Nordic languages	CP865	✔
Nordic languages	ISO-8859-10	✔
Persian/Urdu	Mac OS Farsi	✔
Portuguese	CP860	✔
Romanian	Mac OS	✔
Russian	KOI8-R	✔
Tajik	KOI8-T	✔
Thai	CP874	✔
Thai	ISO-8859-11	✔
Thai	TIS-620	✔
Turkish	CP857	✔
Turkish	EBCDIC 1026	✔
Turkish	ISO-8859-9	✔
Turkish	Mac OS	✔
Turkish	Windows-1254	✔
Ukrainian	CP1125	✔
Ukrainian	KOI8-U	✔
Urdu	CP1006	✔
Vietnamese	CP1258	✔

12.4 Supported Measures

12.4.1 Readability Formulas

The readability of a text depends on several variables including the average sentence length, average word length in characters, average word length in syllables, number of monosyllabic words, number of polysyllabic words, number of difficult words, etc.

It should be noted that some readability measures are language-specific, or applicable only to texts in languages for which Wordless have built-in syllable tokenization support (check 12.1 for reference), while others can be applied to texts in all languages.

The following variables would be used in formulas:
NumSentences: Number of sentences
NumWords: Number of words
NumWordsSyl₁: Number of monosyllabic words
NumWordsSylsₙ₊: Number of words with n or more syllables
NumWordsLtrsₙ₊: Number of words with n or more letters
NumWordsLtrsₙ₋: Number of words with n or fewer letters
NumConjs: Number of conjunctions
NumPreps: Number of prepositions
NumProns: Number of pronouns
NumWordsDale₇₆₉: Number of words outside the Dale list of 769 easy words (Dale, 1931)
NumWordsDale₃₀₀₀: Number of words outside the Dale list of 3000 easy words (Dale & Chall, 1948b)
NumWordsSpache: Number of words outside the Spache word list (Spache, 1974)
NumWordTypes: Number of word types
NumWordTypesBambergerVanecek: Number of word types outside the Bamberger-Vanecek's list of 1000 most common words (Bamberger & Vanecek, 1984, pp. 176–179)
NumWordTypesDale₇₆₉: Number of word types outside the Dale list of 769 easy words (Dale, 1931)
NumSyls: Number of syllables
NumSylsLuongNguyenDinh₁₀₀₀: Number of syllables outside the Luong-Nguyen-Dinh list of 1000 most frequent syllables extracted from all easy documents of the corpus of Vietnamese text readability dataset on literature domain (Luong et al., 2018)
NumCharsAll: Number of characters (letters, CJK characters, etc., numerals, and punctuation marks)
NumCharsAlnum: Number of alphanumeric characters (letters, CJK characters, etc., and numerals)
NumCharsAlpha: Number of alphabetic characters (letters, CJK characters, etc.)

Readability Formula	Formula	Supported Languages
Al-Heeti's Readability Prediction Formula¹ (Al-Heeti, 1984, pp. 102, 104, 106)		Arabic
Automated Arabic Readability Index (Al-Tamimi et al., 2013)		Arabic
Automated Readability Index¹ (Smith & Senter, 1967, p. 8 Navy: Kincaid et al., 1975, p. 14)		All languages
Bormuth's Cloze Mean & Grade Placement (Bormuth, 1969, pp. 152, 160)	where C is the cloze criterion score, whose value could be changed via Menu Bar → Preferences → Settings → Measures → Readability → Bormuth's Grade Placement → Cloze criterion score	English
Coleman-Liau Index (Coleman & Liau, 1975)		All languages
Coleman's Readability Formula¹ (Liau et al., 1976)		All languages²³
Dale-Chall Readability Formula¹ (Dale & Chall, 1948a; Dale & Chall, 1948b Powers-Sumner-Kearl: Powers et al., 1958 New: Chall & Dale, 1995)		English
Danielson-Bryan's Readability Formula¹ (Danielson & Bryan, 1963)		All languages
Dawood's Readability Formula (Dawood, 1977)		Arabic
Degrees of Reading Power (College Entrance Examination Board, 1981)	where M is Bormuth's cloze mean.	English
Devereux Readability Index (Smith, 1961)		All languages
Dickes-Steiwer Handformel (Dickes & Steiwer, 1977)		All languages
Easy Listening Formula (Fang, 1966)		All languages²
Flesch-Kincaid Grade Level (Kincaid et al., 1975, p. 14)		All languages²
Flesch Reading Ease¹ (Flesch, 1948 Powers-Sumner-Kearl: Powers et al., 1958 Dutch: Douma, 1960, p. 453; Brouwer, 1963 French: Kandel & Moles, 1958 German: Amstad, 1978 Italian: Franchina & Vacca, 1986 Russian: Oborneva, 2006, p. 13 Spanish: Fernández Huerta, 1959; Szigriszt Pazos, 1993, p. 247 Ukrainian: Partiko, 2001)		All languages²
Flesch Reading Ease (Farr-Jenkins-Paterson)¹ (Farr et al., 1951 Powers-Sumner-Kearl: Powers et al., 1958)		All languages²
FORCAST Grade Level (Caylor & Sticht, 1973, p. 3)	* One sample of 150 words would be taken randomly from the text, so the text should be at least 150 words long.	All languages²
Fórmula de comprensibilidad de Gutiérrez de Polini (Gutiérrez de Polini, 1972)		Spanish
Fórmula de Crawford (Crawford, 1985)		Spanish²
Fucks's Stilcharakteristik (Fucks, 1955)		All languages²
Gulpease Index (Lucisano & Emanuela Piemontese, 1988)		Italian
Gunning Fog Index¹ (English: Gunning, 1968, p. 38 Powers-Sumner-Kearl: Powers et al., 1958 Navy: Kincaid et al., 1975, p. 14 Polish: Pisarek, 1969)	where NumHardWords is the number of words with 3 or more syllables, except proper nouns and words with 3 syllables ending with -ed or -es, for English texts, and the number of words with 4 or more syllables in their base forms, except proper nouns, for Polish texts.	English & Polish²
Legibilidad µ (Muñoz Baquedano, 2006)	where LenWordsAvg is the average word length in letters, and LenWordsVar is the variance of word lengths in letters.	Spanish
Lensear Write (O’Hayre, 1966, p. 8)	where NumWords1Syl is the number of monosyllabic words excluding the, is, are, was, were. * One sample of 100 words would be taken randomly from the text, and if the text is shorter than 100 words, NumWords1Syl and NumSentences would be multiplied by 100 and then divided by NumWords.	English²
Lix (Björnsson, 1968)		All languages
Lorge Readability Index¹ (Lorge, 1944 Corrected: Lorge, 1948)		English³
Luong-Nguyen-Dinh's Readability Formula (Luong et al., 2018)	* The number of syllables is estimated by tokenizing the text by whitespace and counting the number of tokens excluding punctuation marks	Vietnamese
McAlpine EFLAW Readability Score (Nirmaldasan, 2009)		English
neue Wiener Literaturformeln¹ (Bamberger & Vanecek, 1984, p. 82)		German²
neue Wiener Sachtextformel¹ (Bamberger & Vanecek, 1984, pp. 83–84)		German²
OSMAN (El-Haj & Rayson, 2016)	where NumFaseehWords is the number of words which have 5 or more syllables and contain ء/ئ/ؤ/ذ/ظ or end with وا/ون. * The number of syllables in each word is estimated by adding up the number of short syllables and twice the number of long and stress syllables in each word.	Arabic
Rix (Anderson, 1983)		All languages
SMOG Grade (McLaughlin, 1969 German: Bamberger & Vanecek, 1984, p.78)	* A sample would be constructed using the first 10 sentences, the last 10 sentences, and the 10 sentences at the middle of the text, so the text should be at least 30 sentences long.	All languages²
Spache Grade Level¹ (Spache, 1953 Revised: Spache, 1974)	* Three samples each of 100 words would be taken randomly from the text and the results would be averaged out, so the text should be at least 100 words long.	All languages
Strain Index (Solomon, 2006)	* A sample would be constructed using the first 3 sentences in the text, so the text should be at least 3 sentences long.	All languages²
Tränkle & Bailer's Readability Formula¹ (Tränkle & Bailer, 1984)	* One sample of 100 words would be taken randomly from the text, so the text should be at least 100 words long.	All languages³
Tuldava's Text Difficulty (Tuldava, 1975)		All languages²
Wheeler & Smith's Readability Formula (Wheeler & Smith, 1954)	where NumUnits is the number of sentence segments ending in periods, question marks, exclamation marks, colons, semicolons, and dashes.	All languages²

Note

Variants available and can be selected via Menu Bar → Preferences → Settings → Measures → Readability
Requires built-in syllable tokenization support
Requires built-in part-of-speech tagging support

12.4.2 Indicators of Lexical Density/Diversity

Lexical density/diversity is the measurement of the extent to which the vocabulary used in the text varies.

The following variables would be used in formulas:
fᵢ: Frequency of the i-th token type ranked descendingly by frequencies
fₘₐₓ: Maximum frequency among all token types
NumTypes: Number of token types
NumTypes_f: Number of token types whose frequencies equal f
NumTokens: Number of tokens

Indicator of Lexical Density/Diversity	Formula
Brunét's Index (Brunét, 1978)
Corrected TTR (Carroll, 1964)
Fisher's Index of Diversity (Fisher et al., 1943)	where W₋₁ is the -1 branch of the Lambert W function
Herdan's Vₘ (Herdan, 1955)
HD-D (McCarthy & Jarvis, 2010)	For detailed calculation procedures, see reference. The sample size could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → HD-D → Sample size.
Honoré's Statistic (Honoré, 1979)
Lexical Density (Ure, 1971)	where NumContentWords is the number of content words. By default, all tokens whose universal part-of-speech tags assigned by built-in part-of-speech taggers are ADJ (adjectives), ADV (adverbs), INTJ (interjections), NOUN (nouns), PROPN (proper nouns), NUM (numerals), VERB (verbs), SYM (symbols), or X (others) are categorized as content words. For some built-in part-of-speech taggers, this behavior could be changed via Menu Bar → Preferences → Settings → Part-of-speech Tagging → Tagsets → Mapping Settings → Content/Function Words.
LogTTR¹ (Herdan: Herdan, 1960, p. 28 Somers: Somers, 1966 Rubet: Dugast, 1979 Maas: Maas, 1972 Dugast: Dugast, 1978; Dugast, 1979)
Mean Segmental TTR (Johnson, 1944)	where n is the number of equal-sized segment, the length of which could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Mean Segmental TTR → Number of tokens in each segment, NumTypesSegᵢ is the number of token types in the i-th segment, and NumTokensSegᵢ is the number of tokens in the i-th segment.
Measure of Textual Lexical Diversity (McCarthy, 2005, pp. 95–96, 99–100; McCarthy & Jarvis, 2010)	For detailed calculation procedures, see references. The factor size could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Measure of Textual Lexical Diversity → Factor size.
Moving-average TTR (Covington & McFall, 2010)	where w is the window size which could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Moving-average TTR → Window size, NumTypesWindowₚ is the number of token types within the moving window starting at position p, and NumTokensWindowₚ is the number of tokens within the moving window starting at position p.
Popescu-Mačutek-Altmann's B₁/B₂/B₃/B₄/B₅ (Popescu et al., 2008)
Popescu's R₁ (Popescu, 2009, pp. 18, 30, 33)	For detailed calculation procedures, see reference.
Popescu's R₂ (Popescu, 2009, pp. 35–36, 38)	For detailed calculation procedures, see reference.
Popescu's R₃ (Popescu, 2009, pp. 48–49, 53)	For detailed calculation procedures, see reference.
Popescu's R₄ (Popescu, 2009, p. 57)	For detailed calculation procedures, see reference.
Repeat Rate¹ (Popescu, 2009, p. 166)
Root TTR (Guiraud, 1954)
Shannon Entropy¹ (Popescu, 2009, p. 173)
Simpson's l (Simpson, 1949)
Type-token Ratio (Johnson, 1944)
vocd-D (Malvern et al., 2004, pp. 51, 56–57)	For detailed calculation procedures, see reference.
Yule's Characteristic K (Yule, 1944, pp. 52–53)
Yule's Index of Diversity (Williams, 1970, p. 100)

Note

Variants available and can be selected via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity

12.4.3 Measures of Dispersion and Adjusted Frequency

For parts-based measures, each file is divided into n (whose value you could modify via Menu Bar → Preferences → Settings → Measures → Dispersion / Adjusted Frequency → General Settings → Divide each file into subsections) sub-sections and the frequency of the word in each part is counted and denoted by F₁, F₂, F₃, ..., Fₙ respectively. The total frequency of the word in each file is denoted by F and the mean value of the frequencies over all sub-sections is denoted by F̅.

For distance-based measures, the distance between each pair of subsequent occurrences of the word is calculated and denoted by d₁, d₂, d₃, ..., d_F respectively. The total number of tokens in each file is denoted by N.

Then, the dispersion and adjusted frequency of the word are calculated as follows:

Measure of Dispersion (Parts-based)	Measure of Adjusted Frequency (Parts-based)	Formula
Carroll's D₂ (Carroll, 1970)	Carroll's Uₘ (Carroll, 1970)
	Engwall's FM (Engwall, 1974)	where R is the number of sub-sections in which the word appears at least once.
Gries's DP (Gries, 2008; Lijffijt & Gries, 2012)		* Normalization is applied by default, which behavior you could change via Menu Bar → Preferences → Settings → Measures → Dispersion → Gries's DP → Apply normalization.
Juilland's D (Juilland & Chang-Rodrigues, 1964)	Juilland's U (Juilland & Chang-Rodrigues, 1964)
	Kromer's U_R (Kromer, 2003)	where ψ is the digamma function, and C is the Euler–Mascheroni constant.
Lyne's D₃ (Lyne, 1985)
Rosengren's S (Rosengren, 1971)	Rosengren's KF (Rosengren, 1971)
Zhang's Distributional Consistency (Zhang, 2004)

Measure of Dispersion (Distance-based)	Measure of Adjusted Frequency (Distance-based)	Formula
Average Logarithmic Distance (Savický & Hlaváčová, 2002)	Average Logarithmic Distance (Savický & Hlaváčová, 2002)
Average Reduced Frequency (Savický & Hlaváčová, 2002)	Average Reduced Frequency (Savický & Hlaváčová, 2002)
Average Waiting Time (Savický & Hlaváčová, 2002)	Average Waiting Time (Savický & Hlaváčová, 2002)

12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, and Measures of Effect Size

In order to calculate the statistical significance, Bayes factor, and effect size (except Mann-Whitney U Test, Student's t-test (2-sample), and Welch's t-test) for two words in the same file (collocates) or for one specific word in two different files (keywords), two contingency tables must be constructed first, one for observed values, the other for expected values.

As for collocates (in Collocation Extractor and Colligation Extractor):

Observed Values	Word 1	Not Word 1	Row Total
Word 2	O₁₁	O₁₂	O₁ₓ = O₁₁ + O₁₂
Not Word 2	O₂₁	O₂₂	O₂ₓ = O₂₁ + O₂₂
Column Total	Oₓ₁ = O₁₁ + O₂₁	Oₓ₂ = O₁₂ + O₂₂	Oₓₓ = O₁₁ + O₁₂ + O₂₁ + O₂₂

Expected Values	Word 1	Not Word 1
Word 2
Not Word 2

O₁₁: Number of occurrences of Word 1 followed by Word 2.
O₁₂: Number of occurrences of Word 1 followed by any word except Word 2.
O₂₁: Number of occurrences of any word except Word 1 followed by Word 2.
O₂₂: Number of occurrences of any word except Word 1 followed by any word except Word 2.

As for keywords (in Keyword Extractor):

Observed Values	Observed File	Reference File	Row Total
Word w	O₁₁	O₁₂	O₁ₓ = O₁₁ + O₁₂
Not Word w	O₂₁	O₂₂	O₂ₓ = O₂₁ + O₂₂
Column Total	Oₓ₁ = O₁₁ + O₂₁	Oₓ₂ = O₁₂ + O₂₂	Oₓₓ = O₁₁ + O₁₂ + O₂₁ + O₂₂

Expected Values	Observed File	Reference File
Word w
Not Word w

O₁₁: Number of occurrences of Word w in the observed file.
O₁₂: Number of occurrences of Word w in the reference file.
O₂₁: Number of occurrences of all words except Word w in the observed file.
O₂₂: Number of occurrences of all words except Word w in the reference file.

To conduct Mann-Whitney U Test, Student's t-test (2-sample), and Welch's t-test on a specific word, each column total is first divided into n (5 by default) sub-sections respectively. To be more specific, in Collocation Extractor and Colligation Extractor, all collocates where Word 1 appears as node and the other collocates where Word 1 does not appear as node are divided into n parts respectively. And in Keyword Extractor, all tokens in the observed file and all tokens in the reference files are equally divided into n parts respectively.

The frequencies of Word 2 (in Collocation Extractor and Colligation Extractor) or Word w (in Keyword Extractor) in each sub-section of the 2 column totals are counted and denoted by F₁₁, F₂₁, F₃₁, ..., Fₙ₁, and F₁₂, F₂₂, F₃₂, ..., Fₙ₂ respectively. The total frequency of Word 2 (in Collocation Extractor and Colligation Extractor) or Word w (in Keyword Extractor) in the 2 column totals are denoted by Fₓ₁ and Fₓ₂ respectively. The mean value of the frequencies over all sub-sections in the 2 column totals are denoted by and respectively.

Then the test statistic, Bayes factor, and effect size are calculated as follows:

Test of Statistical Significance	Measure of Bayes Factor	Formula
Fisher's Exact Test (Pedersen, 1996)		See: Fisher's exact test - Wikipedia
Log-likelihood Ratio Test (Dunning, 1993)	Log-likelihood Ratio Test (Wilson, 2013)
Mann-Whitney U Test (Kilgarriff, 2001)		See: Mann–Whitney U test - Wikipedia
Pearson's Chi-squared Test (Hofland & Johanson, 1982; Oakes, 1998)
Student's t-test (1-sample) (Church et al., 1991)
Student's t-test (2-sample) (Paquot & Bestgen, 2009)	Student's t-test (2-sample) (Wilson, 2013)
z-score (Dennis, 1964)
z-score (Berry-Rogghe) (Berry-Rogghe, 1973)		where S is the average span size on both sides of the node word.

Measure of Effect Size	Formula
%DIFF (Gabrielatos & Marchi, 2012)
Cubic Association Ratio (Daille, 1994, 1995)
Dice's Coefficient (Smadja et al., 1996)
Difference Coefficient (Hofland & Johanson, 1982; Gabrielatos, 2018)
Jaccard Index (Dunning, 1998)
Kilgarriff's Ratio (Kilgarriff, 2009)	where α is the smoothing parameter, whose value could be changed via Menu Bar → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter.
Log Ratio (Hardie, 2014)
Log-Frequency Biased MD (Thanopoulos et al., 2002)
logDice (Rychlý, 2008)
MI.log-f (Lexical Computing Ltd., 2015; Kilgarriff & Tugwell, 2002)
Minimum Sensitivity (Pedersen, 1998)
Mutual Dependency (Thanopoulos et al., 2002)
Mutual Expectation (Dias et al., 1999)
Mutual Information (Dunning, 1998)
Odds Ratio (Pojanapunya & Todd, 2016)
Pointwise Mutual Information (Church & Hanks, 1990)
Poisson Collocation Measure (Quasthoff & Wolff, 2002)
Squared Phi Coefficient (Church & Gale, 1991)

13 References

^ Al-Heeti, K. N. (1984). Judgment analysis technique applied to readability prediction of Arabic reading material [Doctoral dissertation, University of Northern Colorado]. ProQuest Dissertations and Theses Global.
^ Al-Tamimi, A., Jaradat M., Aljarrah, N., & Ghanim, S. (2013). AARI: Automatic Arabic readability index. The International Arab Journal of Information Technology, 11(4), 370–378.
^ Amstad, T. (1978). Wie verständlich sind unsere Zeitungen? [Unpublished doctoral dissertation]. University of Zurich.
^ Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496.
^ ^ ^ ^ Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk.
^ Berry-Rogghe, G. L. M. (1973). The computation of collocations and their relevance in lexical studies. In A. J. Aiken, R. W. Bailey, & N. Hamilton-Smith (Eds.), The computer and literary studies (pp. 103–112). Edinburgh University Press.
^ Bormuth, J. R. (1969). Development of readability analyses. U.S. Department of Health, Education, and Welfare. http://files.eric.ed.gov/fulltext/ED029166.pdf
^ Björnsson, C.-H. (1968). Läsbarhet. Liber.
^ Brouwer, R. H. M. (1963). Onderzoek naar de leesmoeilijkheid van Nederlands proza. Paedagogische studiën, 40, 454–464. https://objects.library.uu.nl/reader/index.php?obj=1874-205260&lan=en
^ Brunét, E. (1978). Le vocabulaire de Jean Giraudoux: Structure et evolution. Slatkine.
^ Carroll, J. B. (1964). Language and thought. Prentice-Hall.
^ ^ Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x
^ Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material. Human Resource Research Organization. https://ia902703.us.archive.org/31/items/ERIC_ED076707/ERIC_ED076707.pdf
^ Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale-Chall readability formula. Brookline Books.
^ Church, K. W., & Gale, W. A. (1991, September 29–October 1). Concordances for parallel text [Paper presentation]. Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, Oxford, United Kingdom.
^ Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Psychology Press.
^ Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
^ Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283–284. https://doi.org/10.1037/h0076540
^ College Entrance Examination Board. (1981). Degrees of reading power brings the students and the text together.
^ Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100. https://doi.org/10.1080/09296171003643098
^ Crawford, A. N. (1985). Fórmula y gráfico para determinar la comprensibilidad de textos de nivel primario en castellano. Lectura y Vida, 6(4). http://www.lecturayvida.fahce.unlp.edu.ar/numeros/a6n4/06_04_Crawford.pdf
^ Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid=
^ Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University.
^ ^ Dale, E. (1931). A comparison of two word lists. Educational Research Bulletin, 10(18), 484–489.
^ Dale, E., & Chall, J. S. (1948a). A formula for predicting readability. Educational Research Bulletin, 27(1), 11–20, 28.
^ ^ Dale, E., & Chall, J. S. (1948b). A formula for predicting readability: Instructions. Educational Research Bulletin, 27(2), 37–54.
^ Danielson, W. A., & Bryan, S. D. (1963). Computer automation of two readability formulas. Journalism Quarterly, 40(2), 201–206. https://doi.org/10.1177/107769906304000207
^ Dawood, B.A.K. (1977). The relationship between readability and selected language variables [Unpublished master’s thesis]. University of Baghdad.
^ Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), Proceedings of the symposium on statistical association methods for mechanized documentation (pp. 61–148). National Bureau of Standards.
^ Dias, G., Guilloré, S., & Pereira Lopes, J. G. (1999). Language independent automatic acquisition of rigid multiword units from unrestricted text corpora. In A. Condamines, C. Fabre, & M. Péry-Woodley (Eds.), TALN'99: 6ème Conférence Annuelle Sur le Traitement Automatique des Langues Naturelles (pp. 333–339). TALN.
^ Dickes, P. & Steiwer, L. (1977). Ausarbeitung von lesbarkeitsformeln für die deutsche sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 9(1), 20–28.
^ Douma, W. H. (1960). De leesbaarheid van landbouwbladen: Een onderzoek naar en een toepassing van leesbaarheidsformules [Readability of Dutch farm papers: A discussion and application of readability-formulas]. Afdeling sociologie en sociografie van de Landbouwhogeschool Wageningen. https://edepot.wur.nl/276323
^ Dugast, D. (1978). Sur quoi se fonde la notion d’étendue théoretique du vocabulaire?. Le Français Moderne, 46, 25–32.
^ ^ Dugast, D. (1979). Vocabulaire et stylistique: I théâtre et dialogue, travaux de linguistique quantitative. Slatkine.
^ Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
^ ^ Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. arxiv.org/pdf/1207.1847.pdf
^ El-Haj, M., & Rayson, P. (2016). OSMAN: A novel Arabic readability metric. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 250–255). European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2016/index.html
^ Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University.
^ Fang, I. E. (1966). The easy listening formula. Journal of Broadcasting, 11(1), 63–68. https://doi.org/10.1080/08838156609363529
^ Farr, J. N., Jenkins, J. J., & Paterson, D. G. (1951). Simplification of Flesch reading ease formula. Journal of Applied Psychology, 35(5), 333–337. https://doi.org/10.1037/h0062427
^ Fernández Huerta, J. (1959). Medidas sencillas de lecturabilidad. Consigna, 214, 29–32.
^ Fisher, R. A., Steven, A. C., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12(1), 42–58. https://doi.org/10.2307/1411
^ Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. https://doi.org/10.1037/h0057532
^ Franchina, V., & Vacca, R. (1986). Adaptation of Flesh readability index on a bilingual text written by the same author both in Italian and English languages. Linguaggi, 3, 47–49.
^ Fucks, W. (1955). Unterschied des Prosastils von Dichtern und anderen Schriftstellern: ein Beispiel mathematischer Stilanalyse. Bouvier.
^ Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review (pp. 225–258). Routledge.
^ Gabrielatos, C., & Marchi, A. (2012, September 13–14). Keyness: Appropriate metrics and practical issues [Conference session]. CADS International Conference 2012, University of Bologna, Italy.
^ Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri
^ Guiraud, P. (1954). Les caractères statistiques du vocabulaire: Essai de méthodologie. Presses universitaires de France.
^ Gunning, R. (1968). The technique of clear writing (revised ed.). McGraw-Hill Book Company.
^ Gutiérrez de Polini, L. E. (1972). Investigación sobre lectura en Venezuela [Paper presentation]. Primeras Jornadas de Educación Primaria, Ministerio de Educación, Caracas, Venezuela.
^ Hardie, A. (2014, April 28). Log ratio: An informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/
^ Herdan, G. (1955). A new derivation and interpretation of Yule's ‘Characteristic’ K. Zeitschrift für Angewandte Mathematik und Physik (ZAMP), 6(4), 332–339. https://doi.org/10.1007/BF01587632
^ Herdan, G. (1960). Type-token mathematics: A textbook of mathematical linguistics. Mouton.
^ ^ Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities.
^ Honoré, A. (1979). Some simple measures of richness of vocabulary. Association of Literary and Linguistic Computing Bulletin, 7(2), 172–177.
^ ^ Johnson, W. (1944). Studies in language behavior: I. a program of research. Psychological Monographs, 56(2), 1–15. https://doi.org/10.1037/h0093508
^ ^ Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton.
^ Kandel, L., & Moles A. (1958). Application de l’indice de flesch la langue francaise [applying flesch index to french language]. The Journal of Educational Research, 21, 283–287.
^ Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil
^ Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (p. 171). University of Liverpool.
^ Kilgarriff, A., & Tugwell, D. (2002). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In Proceedings of the 8th Machine Translation Summit (pp. 187–190). European Association for Machine Translation.
^ ^ ^ Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf
^ Kromer, V. (2003). A usage measure based on psychophysical relations. Journal of Quantitative Linguistics, 10(2), 177–186. https://doi.org/10.1076/jqul.10.2.177.16718
^ Lexical Computing. (2015, July 8). Statistics used in Sketch Engine. Sketch Engine. https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/
^ Liau, T. L., Bassin, C. B., Martin, C. J., & Coleman, E. B. (1976). Modification of the Coleman readability formulas. Journal of Reading Behavior, 8(4), 381–386. https://journals.sagepub.com/doi/pdf/10.1080/10862967609547193
^ Lijffijt, J., & Gries, S. T. (2012). Correction to Stefan Th. Gries’ “dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics, 17(1), 147–149. https://doi.org/10.1075/ijcl.17.1.08lij
^ Lorge, I. (1944). Predicting readability. Teachers College Record, 45, 404–419.
^ Lorge, I. (1948). The Lorge and Flesch readability formulae: A correction. School and Society, 67, 141–142.
^ Lucisano, P., & Emanuela Piemontese, M. (1988). GULPEASE: A formula for the prediction of the difficulty of texts in Italian. Scuola e Città, 39(3), 110–124.
^ ^ Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. 2018 10th International Conference on Knowledge and Systems Engineering (KSE) (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379
^ Lyne, A. A. (1985). Dispersion. In The vocabulary of French business correspondence: Word frequencies, collocations, and problems of lexicometric method (pp. 101–124). Slatkine/Champion.
^ Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan.
^ Maas, H.-D. (1972). Über den zusammenhang zwischen wortschatzumfang und länge eines textes. Zeitschrift für Literaturwissenschaft und Linguistik, 2(8), 73–96.
^ McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global.
^ ^ McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. https://doi.org/10.3758/BRM.42.2.381
^ McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639–646.
^ Muñoz Baquedano, M. (2006). Legibilidad y variabilidad de los textos. Boletín de Investigación Educacional, Pontificia Universidad Católica de Chile, 21(2), 13–26.
^ Nirmaldasan. (2009, April 30). McAlpine EFLAW readability score. Readability Monitor. Retrieved November 15, 2022, from https://strainindex.wordpress.com/2009/04/30/mcalpine-eflaw-readability-score/
^ Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press.
^ Oborneva, I. V. (2006). Автоматизированная оценка сложности учебных текстов на основе статистических параметров [Doctoral dissertation, Institute for Strategy of Education Development of the Russian Academy of Education]. Freereferats.ru. https://static.freereferats.ru/_avtoreferats/01002881899.pdf?ver=3
^ O’Hayre, J. (1966). Gobbledygook has gotta go. U.S. Government Printing Office. https://www.governmentattic.org/15docs/Gobbledygook_Has_Gotta_Go_1966.pdf
^ Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Language and Computers, 68, 247–269.
^ Partiko, Z. V. (2001). Zagal’ne redaguvannja. Normativni osnovi. Afiša.
^ Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188–200). The South–Central Regional SAS Users' Group.
^ Pedersen, T. (1998). Dependent bigram identification. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (p. 1197). AAAI Press.
^ Pisarek, W. (1969). Jak mierzyć zrozumiałość tekstu?. Zeszyty Prasoznawcze, 4(42), 35–48.
^ Pojanapunya, P., & Todd, R. W. (2016). Log-likelihood and odds ratio keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 15(1), 133–167. https://doi.org/10.1515/cllt-2015-0030
^ Popescu I.-I., Mačutek, J, & Altmann, G. (2008). Word frequency and arc length. Glottometrics, 17, 18–42.
^ ^ ^ ^ ^ ^ Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter.
^ ^ ^ ^ Powers, R. D., Sumner, W. A., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of Educational Psychology, 49(2), 99–105. https://doi.org/10.1037/h0043254
^ Quasthoff, U., & Wolff, C. (2002). The poisson collocation measure and its applications. Proceedings of 2nd International Workshop on Computational Approaches to Collocations. IEEE.
^ ^ Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127.
^ Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing. Masaryk University
^ ^ ^ ^ ^ ^ Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124
^ Simpson, E. H. (1949). Measurement of diversity. Nature, 163, p. 688. https://doi.org/10.1038/163688a0
^ Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38.
^ Smith, E. A. (1961). Devereaux readability index. Journal of Educational Research, 54(8), 298–303. https://doi.org/10.1080/00220671.1961.10882728
^ Smith, E. A., & Senter, R. J. (1967). Automated readability index. Aerospace Medical Research Laboratories. https://apps.dtic.mil/sti/pdfs/AD0667273.pdf
^ Solomon, N. W. (2006). Qualitative analysis of media language [Unpublished doctoral dissertation]. Madurai Kamaraj University.
^ Somers, H. H. (1966). Statistical methods in literary analysis. In J. Leeds (Ed.), The computer and literary style (pp. 128–140). Kent State University Press.
^ Spache, G. (1953). A new readability formula for primary-grade reading materials. Elementary School Journal, 53(7), 410–413. https://doi.org/10.1086/458513
^ ^ Spache, G. (1974). Good reading for poor readers (Rev. 9th ed.). Garrard.
^ Szigriszt Pazos, F. (1993). Sistemas predictivos de legibilidad del mensaje escrito: Formula de perspicuidad [Doctoral dissertation, Complutense University of Madrid]. Biblos-e Archivo. https://repositorio.uam.es/bitstream/handle/10486/2488/3907_barrio_cantalejo_ines_maria.pdf?sequence=1&isAllowed=y
^ ^ Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association.
^ Tränkle, U., & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die Deutsche Sprache [Cross-validation and recalculation of the readability formulas for the German language]. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231–244.
^ Tuldava, J. (1975). Ob izmerenii trudnosti tekstov [On measuring the complexity of the text]. Uchenye zapiski Tartuskogo universiteta. Trudy po metodike prepodavaniya inostrannykh yazykov, 345, 102–120.
^ Ure, J. (1971). Lexical density and register differentiation. In G. E. Perren & J. L. M. Trim (Eds.), Applications of Linguistics (pp. 443–452). Cambridge University Press.
^ Wheeler, L. R., & Smith, E. H. (1954). A practical readability formula for the classroom teacher in the primary grades. Elementary English, 31(7), 397–399.
^ Williams, C. B. (1970). Style and vocabulary: Numerical studies. Griffin.
^ ^ Wilson, A. (2013). Embracing Bayes Factors for key item analysis in corpus linguistics. In M. Bieswanger & A. Koll-Stobbe (Eds.), New Approaches to the Study of Linguistic Variability (pp. 3–11). Peter Lang.
^ Yule, G. U. (1944). The statistical study of literary vocabulary. Cambridge University Press.
^ Zhang, H., Huang, C., & Yu, S. (2004). Distributional consistency: As a general method for defining a core lexicon. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of Fourth International Conference on Language Resources and Evaluation (pp. 1119–1122). European Language Resources Association.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc.md

doc.md

📖 Documentation

Table of Contents

1 Main Window

2 File Area

3 Profiler

4 Concordancer

5 Parallel Concordancer

6 Dependency Parser

7 Wordlist Generator

8 N-gram Generator

9 Collocation Extractor

10 Colligation Extractor

11 Keyword Extractor

12 Appendixes

12.1 Supported Languages

12.2 Supported File Types

12.3 Supported File Encodings

12.4 Supported Measures

12.4.1 Readability Formulas

12.4.2 Indicators of Lexical Density/Diversity

12.4.3 Measures of Dispersion and Adjusted Frequency

12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, and Measures of Effect Size

13 References

Files

doc.md

Latest commit

History

doc.md

File metadata and controls

📖 Documentation

Table of Contents