- 1 Main Window
- 2 File Area
- 3 Profiler
- 4 Concordancer
- 5 Parallel Concordancer
- 6 Dependency Parser
- 7 Wordlist Generator
- 8 N-gram Generator
- 9 Collocation Extractor
- 10 Colligation Extractor
- 11 Keyword Extractor
- 12 Appendixes
- 13 References
The main window of Wordless is divided into several sections:
-
1.1 Menu Bar
The Menu Bar resides at the top of the main window. -
1.2 Work Area
The Work Area resides at the upper half of the main window, just below Menu Bar.The Work Area is further divided into the Results Area on the left side and the Settings Area on the right side. You can click on the tabs to toggle between different modules.
-
1.3 File Area
The File Area resides at the lower half of the main window, just above Status Bar. -
1.4 Status Bar
The Status Bar resides at the bottom of the main window.You can show/hide the Status Bar by checking/unchecking Menu Bar → Preferences → Show Status Bar
You can modify the global scaling factor and font settings of the user interface via Menu Bar → Preferences → General → User Interface Settings.
In most cases, the first thing to do in Wordless is open and select your files to be processed via Menu Bar → File → Open Files/Folder.
Files are loaded, cached and selected automatically after being added to the File Table. Only selected files will be processed by Wordless. You can drag and drop files around the File Table to change their orders, which would be reflected in the results.
By default, Wordless would try to detect the encoding and language settings of all files for you, you should double check and make sure that the settings of each and every file are correct. If you prefer changing file settings manually, you could uncheck Open Files dialog → Auto-detect encodings and/or Open Files dialog → Auto-detect languages. The default file settings could be modified via Menu Bar → Preferences → Settings → Files → Default Settings. Additionally, you need to change Open Files dialog → Tokenized and Open Files dialog → Tagged options of each files according to whether or not the file has been tokenized or tagged.
-
2.1 Menu Bar → File
-
2.1.1 Open Files
Open the Open Files dialog to add file(s) to the File Table. -
2.1.2 Reopen Closed Files
Add file(s) that are closed the last time back to the File Table.* The history of all closed files will be erased upon exit of Wordless.
-
2.1.3 Select All
Select all files in the File Table. -
2.1.4 Deselect All
Deselect all files in the File Table. -
2.1.5 Invert Selection
Select files that are not currently selected and deselect files that are currently selected in the File Table. -
2.1.6 Close Selected
Remove files that are currently selected from the File Table. -
2.1.7 Close All
Remove all files from the File Table.
-
-
2.2 Open Files dialog
-
2.2.1 Add files
Add one single file or multiple files into the table.* You can use the Ctrl key (Command key on macOS) and/or the Shift key to select multiple files.
-
2.2.2 Add folder
Add all files in the folder into the table.By default, all files in the chosen folder and the subfolders of the chosen folder (and subfolders of subfolders, and so on) are added to the table. If you do not want to add files in subfolders to the table, you could uncheck Include files in subfolders.
-
2.2.3 Remove files
Remove the selected files from the table. -
2.2.4 Clear table
Remove all files from the table. -
2.2.5 Auto-detect encodings
Auto-detect the encodings of all files when they are added into the table. If the detection results are incorrect, you can manually modify encoding settings in the table. -
2.2.6 Auto-detect languages
Auto-detect the languages of all files when they are added into the table. If the detection results are incorrect, you can manually modify language settings in the table. -
2.2.7 Include files in subfolders
When adding a folder to the table, recursively add all files in the chosen folder and subfolders of the chosen folder (and subfolders of subfolders, and so on) into the table
-
Note
Renamed from Overview to Profiler in Wordless 2.2.0
In Profiler, you can check and compare general linguistic features of different files.
All statistics are grouped into 5 tables for better readability: Readability, Counts, Lexical Density/Diversity, Lengths, Length Breakdown.
-
3.1.1 Readability
Readability statistics of each file calculated according to the different readability tests used. See section 12.4.1 Readability Formulas for more details. -
3.1.2 Counts
-
3.1.2.1 Count of Paragraphs
The number of paragraphs in each file. Each line in the file is counted as one paragraph. Blank lines and lines containing only spaces, tabs and other invisible characters are not counted. -
3.1.2.2 Count of Paragraphs %
The percentage of the number of paragraphs in each file out of the total number of paragraphs in all files. -
3.1.2.3 Count of Sentences
The number of sentences in each file. Wordless automatically applies the built-in sentence tokenizer according to the language of each file to calculate the number of sentences in each file. You can modify sentence tokenizer settings via Menu Bar → Preferences → Settings → Sentence Tokenization → Sentence Tokenizer Settings. -
3.1.2.4 Count of Sentences %
The percentage of the number of sentences in each file out of the total number of sentences in all files. -
3.1.2.5 Count of Sentence Segments
The number of sentence segments in each file. Each part of sentence ending with one or more consecutive terminal punctuation marks (as per the Unicode Standard) is counted as one sentence segment. See here for the full list of terminal punctuation marks. -
3.1.2.6 Count of Sentence Segments %
The percentage of the number of sentence segments in each file out of the total number of sentence segments in all files. -
3.1.2.7 Count of Tokens
The number of tokens in each file. Wordless automatically applies the built-in word tokenizer according to the language of each file to calculate the number of tokens in each file. You can modify word tokenizer settings via Menu Bar → Preferences → Settings → Word Tokenization → Word Tokenizer Settings.You can specify what should be counted as a "token" via Token Settings in the Settings Area
-
3.1.2.8 Count of Tokens %
The percentage of the number of tokens in each file out of the total number of tokens in all files. -
3.1.2.9 Count of Types
The number of token types in each file. -
3.1.2.10 Count of Types %
The percentage of the number of token types in each file out of the total number of token types in all files. -
3.1.2.11 Count of Syllables
The number of syllables in each files. Wordless automatically applies the built-in syllable tokenizer according to the language of each file to calculate the number of syllable in each file. You can modify syllable tokenizer settings via Menu Bar → Preferences → Settings → Syllable Tokenization → Syllable Tokenizer Settings. -
3.1.2.12 Count of Syllables %
The percentage of the number of syllables in each file out of the total number of syllable in all files. -
3.1.2.13 Count of Characters
The number of single characters in each file. Spaces, tabs and all other invisible characters are not counted. -
3.1.2.14 Count of Characters %
The percentage of the number of characters in each file out of the total number of characters in all files.
-
-
3.1.3 Lexical Density/Diversity
Statistics of lexical density/diversity which reflect the the extend to which the vocabulary used in each file varies. See section 12.4.2 Indicators of Lexical Density/Diversity for more details. -
3.1.4 Lengths
-
3.1.4.1 Paragraph Length in Sentences / Sentence Segments / Tokens (Mean)
The average value of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.2 Paragraph Length in Sentences / Sentence Segments / Tokens (Standard Deviation)
The standard deviation of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.3 Paragraph Length in Sentences / Sentence Segments / Tokens (Variance)
The variance of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.4 Paragraph Length in Sentences / Sentence Segments / Tokens (Minimum)
The minimum of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.5 Paragraph Length in Sentences / Sentence Segments / Tokens (25th Percentile)
The 25th percentile of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.6 Paragraph Length in Sentences / Sentence Segments / Tokens (Median)
The median of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.7 Paragraph Length in Sentences / Sentence Segments / Tokens (75th Percentile)
The 75th percentile of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.8 Paragraph Length in Sentences / Sentence Segments / Tokens (Maximum)
The maximum of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.9 Paragraph Length in Sentences / Sentence Segments / Tokens (Range)
The range of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.10 Paragraph Length in Sentences / Sentence Segments / Tokens (Interquartile Range)
The interquartile range of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.11 Paragraph Length in Sentences / Sentence Segments / Tokens (Modes)
The mode(s) of paragraph lengths expressed in sentences / sentence segments / tokens. -
3.1.4.12 Sentence / Sentence Segment Length in Tokens (Mean)
The average value of sentence / sentence segment lengths expressed in tokens. -
3.1.4.13 Sentence / Sentence Segment Length in Tokens (Standard Deviation)
The standard deviation of sentence / sentence segment lengths expressed in tokens. -
3.1.4.14 Sentence / Sentence Segment Length in Tokens (Variance)
The variance of sentence / sentence segment lengths expressed in tokens. -
3.1.4.15 Sentence / Sentence Segment Length in Tokens (Minimum)
The minimum of sentence / sentence segment lengths expressed in tokens. -
3.1.4.16 Sentence / Sentence Segment Length in Tokens (25th Percentile)
The 25th percentile of sentence / sentence segment lengths expressed in tokens. -
3.1.4.17 Sentence / Sentence Segment Length in Tokens (Median)
The median of sentence / sentence segment lengths expressed in tokens. -
3.1.4.18 Sentence / Sentence Segment Length in Tokens (75th Percentile)
The 75th percentile of sentence / sentence segment lengths expressed in tokens. -
3.1.4.19 Sentence / Sentence Segment Length in Tokens (Maximum)
The maximum of sentence / sentence segment lengths expressed in tokens. -
3.1.4.20 Sentence / Sentence Segment Length in Tokens (Range)
The range of sentence / sentence segment lengths expressed in tokens. -
3.1.4.21 Sentence / Sentence Segment Length in Tokens (Interquartile Range)
The interquartile range of sentence / sentence segment lengths expressed in tokens. -
3.1.4.22 Sentence / Sentence Segment Length in Tokens (Modes)
The mode(s) of sentence / sentence segment lengths expressed in tokens. -
3.1.4.23 Token/Type Length in Syllables/Characters (Mean)
The average value of token / token type lengths expressed in syllables/characters. -
3.1.4.24 Token/Type Length in Syllables/Characters (Standard Deviation)
The standard deviation of token / token type lengths expressed in syllables/characters. -
3.1.4.25 Token/Type Length in Syllables/Characters (Variance)
The variance of token / token type lengths expressed in syllables/characters. -
3.1.4.26 Token/Type Length in Syllables/Characters (Minimum)
The minimum of token / token type lengths expressed in syllables/characters. -
3.1.4.27 Token/Type Length in Syllables/Characters (25th Percentile)
The 25th percentile of token / token type lengths expressed in syllables/characters. -
3.1.4.28 Token/Type Length in Syllables/Characters (Median)
The median of token / token type lengths expressed in syllables/characters. -
3.1.4.29 Token/Type Length in Syllables/Characters (75th Percentile)
The 75th percentile of token / token type lengths expressed in syllables/characters. -
3.1.4.30 Token/Type Length in Syllables/Characters (Maximum)
The maximum of token / token type lengths expressed in syllables/characters. -
3.1.4.31 Token/Type Length in Syllables/Characters (Range)
The range of token / token type lengths expressed in syllables/characters. -
3.1.4.32 Token/Type Length in Syllables/Characters (Interquartile Range)
The interquartile range of token / token type lengths expressed in syllables/characters. -
3.1.4.33 Token/Type Length in Syllables/Characters (Modes)
The mode(s) of token / token type lengths expressed in syllables/characters. -
3.1.4.34 Syllable Length in Characters (Mean)
The average value of syllable lengths expressed in characters. -
3.1.4.35 Syllable Length in Characters (Standard Deviation)
The standard deviation of syllable lengths expressed in characters. -
3.1.4.36 Syllable Length in Characters (Variance)
The variance of syllable lengths expressed in characters. -
3.1.4.37 Syllable Length in Characters (Minimum)
The minimum of syllable lengths expressed in characters. -
3.1.4.38 Syllable Length in Characters (25th Percentile)
The 25th percentile of syllable lengths expressed in characters. -
3.1.4.39 Syllable Length in Characters (Median)
The median of syllable lengths expressed in characters. -
3.1.4.40 Syllable Length in Characters (75th Percentile)
The 75th percentile of syllable lengths expressed in characters. -
3.1.4.41 Syllable Length in Characters (Maximum)
The maximum of syllable lengths expressed in characters. -
3.1.4.42 Syllable Length in Characters (Range)
The range of syllable lengths expressed in characters. -
3.1.4.43 Syllable Length in Characters (Interquartile Range)
The interquartile range of Syllable lengths expressed in characters. -
3.1.4.44 Syllable Length in Characters (Modes)
The mode(s) of syllable lengths expressed in characters.
-
-
3.1.5 Length Breakdown
-
3.1.5.1 Count of n-token-long Sentences / Sentence Segments
The number of n-token-long sentences / sentence segments, where n = 1, 2, 3, etc. -
3.1.5.2 Count of n-token-long Sentences / Sentence Segments %
The percentage of the number of n-token-long sentences / sentence segments in each file out of the total number of n-token-long sentences / sentence segments in all files, where n = 1, 2, 3, etc. -
3.1.5.3 Count of n-syllable-long Tokens
The number of n-syllable-long tokens, where n = 1, 2, 3, etc. -
3.1.5.4 Count of n-syllable-long Tokens %
The percentage of the number of n-syllable-long tokens in each file out of the total number of n-syllable-long tokens in all files, where n = 1, 2, 3, etc. -
3.1.5.5 Count of n-character-long Tokens
The number of n-character-long tokens, where n = 1, 2, 3, etc. -
3.1.5.6 Count of n-character-long Tokens %
The percentage of the number of n-character-long tokens in each file out of the total number of n-character-long tokens in all files, where n = 1, 2, 3, etc.
-
In Concordancer, you can search for tokens in different files and generate concordance lines. You can adjust settings for data generation via Generation Settings.
After the concordance lines are generated and displayed in the table, you can sort the results by clicking Sort Results or search in Data Table for parts that might be of interest to you by clicking Search in results. Highlight colors for sorting can be modified via Menu Bar → Preferences → Settings → Tables → Concordancer → Sorting.
You can generate concordance plots for all search terms. You can modify the settings for the generated figure via Figure Settings.
-
4.1 Left
The context before each search term, which displays 10 tokens left to the Node by default. You can change this behavior via Generation Settings. -
4.2 Node
The search term(s) specified in Search Settings → Search Term. -
4.3 Right
The context after each search term, which displays 10 tokens right to the Node by default. You can change this behavior via Generation Settings. -
4.4 Sentiment
The sentiment of the Node combined with its context (Left and Right). -
4.5 Token No.
The position of the first token of Node in each file. -
4.6 Token No. %
The percentage of the position of the first token of Node in each file. -
4.7 Sentence Segment No.
The position of the sentence segment where the Node is found in each file. -
4.8 Sentence Segment No. %
The percentage of the position of the sentence segment where the Node is found in each file. -
4.9 Sentence No.
The position of the sentence where the Node is found in each file. -
4.10 Sentence No. %
The percentage of the position of the sentence where the Node is found in each file. -
4.11 Paragraph No.
The position of the paragraph where the Node is found in each file. -
4.12 Paragraph No. %
The percentage of the position of the paragraph where the Node is found in each file. -
4.13 File
The name of the file where the Node is found.
Note
- Added in Wordless 2.0.0
- Renamed from Concordancer (Parallel Mode) to Parallel Concordancer in Wordless 2.2.0
In Parallel Concordancer, you can search for tokens in parallel corpora and generate parallel concordance lines. You may leave Search Settings → Search Term blank so as to search for instances of additions and deletions.
You can search in Data Table for parts that might be of interest to you by clicking Search in results.
-
5.1 Parallel Unit No.
The position of the alignment unit (paragraph) where the the search term is found. -
5.2 Parallel Unit No. %
The percentage of the position of the alignment unit (paragraph) where the the search term is found. -
5.3 Parallel Units
The parallel unit (paragraph) where the search term is found in each file.Highlight colors for search terms can be modified via Menu Bar → Preferences → Settings → Tables → Parallel Concordancer → Highlight Color Settings.
Note
Added in Wordless 3.0.0
In Dependency Parser, you can search for all dependency relations associated with different tokens and calculate their dependency lengths (distances).
You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.
You can select lines in the Results Area and then click Generate Figure to show dependency graphs for all selected sentences. You can modify the settings for the generated figure via Figure Settings and decide how the figures should be displayed.
-
6.1 Head
The token functioning as the head in the dependency structure. -
6.2 Dependent
The token functioning as the dependent in the dependency structure. -
6.3 Dependency Length
The dependency length (distance) between the head and dependent in the dependency structure. The dependency length is positive when the head follows the dependent and would be negative if the head precedes the dependent. -
6.4 Dependency Length (Absolute)
The absolute value of the dependency length (distance) between the head and dependent in the dependency structure. The absolute dependency length is always positive. -
6.5 Sentence
The sentence where the dependency structure is found.Highlight colors for the head and the dependent can be modified via Menu Bar → Preferences → Settings → Tables → Dependency Parser → Highlight Color Settings.
-
6.6 Sentence No.
The position of the sentence where the dependency structure is found. -
6.7 Sentence No. %
The percentage of the position of the sentence where the dependency structure is found. -
6.8 File
The name of the file where the dependency structure is found.
Note
Renamed from Wordlist to Wordlist Generator in Wordless 2.2.0
In Wordlist Generator, you can generate wordlists for different files and calculate the raw frequency, relative frequency, dispersion and adjusted frequency for each token. You can disable the calculation of dispersion and/or adjusted frequency by setting Generation Settings → Measures of Dispersion / Measure of Adjusted Frequency to None.
You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.
You can generate line charts or word clouds for wordlists using any statistics. You can modify the settings for the generated figure via Figure Settings.
-
7.1 Rank
The rank of the token sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties. -
7.2 Token
You can specify what should be counted as a "token" via Token Settings. -
7.3 Syllabification
The syllabified form of each token.If the token happens to exist in the vocabulary of multiple languages, all syllabified forms with their applicable languages will be listed.
If there is no syllable tokenization support for the language where the token is found, "No language support" is displayed instead. To check which languages have syllable tokenization support, please refer to section 12.1 Supported Languages.
-
7.4 Frequency
The number of occurrences of the token in each file. -
7.5 Dispersion
The dispersion of the token in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details. -
7.6 Adjusted Frequency
The adjusted frequency of the token in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details. -
7.7 Number of Files Found
The number of files in which the token appears at least once. -
7.8 Number of Files Found %
The percentage of the number of files in which the token appears at least once out of the total number of files that are cureently selected.
Note
Renamed from N-gram to N-gram Generator in Wordless 2.2.0
In N-gram Generator, you can search for n-grams (consecutive tokens) or skip-grams (non-consecutive tokens) in different files, count and compute the raw frequency and relative frequency of each n-gram/skip-gram, and calculate the dispersion and adjusted frequency for each n-gram/skip-gram using different measures. You can adjust the settings for the generated results via Generation Settings. You can disable the calculation of dispersion and/or adjusted frequency by setting Generation Settings → Measures of Dispersion / Measure of Adjusted Frequency to None. To allow skip-grams in the results, check Generation Settings → Allow skipped tokens and modify the settings. You can also set constraints on the position of search terms in all n-grams via Search Settings → Search Term Position.
You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.
You can generate line charts or word clouds for n-grams using any statistics. You can modify the settings for the generated figure via Figure Settings.
-
8.1 Rank
The rank of the n-gram sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties. -
8.2 N-gram
You can specify what should be counted as a "n-gram" via Token Settings. -
8.3 Frequency
The number of occurrences of the n-gram in each file. -
8.4 Dispersion
The dispersion of the n-gram in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details. -
8.5 Adjusted Frequency
The adjusted frequency of the n-gram in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See section 12.4.3 Measures of Dispersion & Adjusted Frequency for more details. -
8.6 Number of Files Found
The number of files in which the n-gram appears at least once. -
8.7 Number of Files Found %
The percentage of the number of files in which the n-gram appears at least once out of the total number of files that are currently selected.
Note
Renamed from Collocation to Collocation Extractor in Wordless 2.2.0
In Collocation Extractor, you can search for patterns of collocation (tokens that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of collocates and calculate the Bayes factor and effect size for each pair using different measures. You can adjust the settings for the generated results via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.
You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.
You can generate line charts, word clouds, and network graphs for patterns of collocation using any statistics. You can modify the settings for the generated figure via Figure Settings.
-
9.1 Rank
The rank of the collocating token sorted by the p-value of the significance test conducted on the node and the collocating token in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties. -
9.2 Node
The search term. You can specify what should be counted as a "token" via Token Settings. -
9.3 Collocate
The collocating token. You can specify what should be counted as a "token" via Token Settings. -
9.4 Ln, ..., L3, L2, L1, R1, R2, R3, ..., Rn
The number of co-occurrences of the node and the collocating token with the collocating token at the given position in each file. -
9.5 Frequency
The total number of co-occurrences of the node and the collocating token with the collocating token at all possible positions in each file. -
9.6 Test Statistic
The test statistic of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.Please note that test statistic is not available for some tests of statistical significance.
-
9.7 p-value
The p-value of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
9.8 Bayes Factor
The Bayes factor the node and the collocating token in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
9.9 Effect Size
The effect size of the node and the collocating token in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
9.10 Number of Files Found
The number of files in which the node and the collocating token co-occur at least once. -
9.11 Number of Files Found %
The percentage of the number of files in which the node and the collocating token co-occur at least once out of the total number of files that are currently selected.
Note
Renamed from Colligation to Colligation Extractor in Wordless 2.2.0
In Colligation Extractor, you can search for patterns of colligation (parts of speech that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of parts of speech and calculate the Bayes factor and effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.
Wordless will automatically apply its built-in part-of-speech tagger on every file that are not part-of-speech-tagged already according to the language of each file. If part-of-speech tagging is not supported for the given languages, the user should provide a file that has already been part-of-speech-tagged and make sure that the correct Text Type has been set on each file.
You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.
You can generate line charts or word clouds for patterns of colligation using any statistics. You can modify the settings for the generated figure via Figure Settings.
-
10.1 Rank
The rank of the collocating part of speech sorted by the p-value of the significance test conducted on the node and the collocating part of speech in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties. -
10.2 Node
The search term. You can specify what should be counted as a "token" via Token Settings. -
10.3 Collocate
The collocating part of speech. You can specify what should be counted as a "token" via Token Settings. -
10.4 Ln, ..., L3, L2, L1, R1, R2, R3, ..., Rn
The number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at the given position in each file. -
10.5 Frequency
The total number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at all possible positions in each file. -
10.6 Test Statistic
The test statistic of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.Please note that test statistic is not available for some tests of statistical significance.
-
10.7 p-value
The p-value of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
10.8 Bayes Factor
The Bayes factor of the node and the collocating part of speech in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
10.9 Effect Size
The effect size of the node and the collocating part of speech in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
10.10 Number of Files Found
The number of files in which the node and the collocating part of speech co-occur at least once. -
10.11 Number of Files Found %
The percentage of the number of files in which the node and the collocating part of speech co-occur at least once out of the total number of file that are currently selected.
Note
Renamed from Keyword to Keyword Extractor in Wordless 2.2
In Keyword Extractor, you can search for candidates of potential keywords (tokens that have far more or far less frequency in the observed file than in the reference file) in different files given a reference corpus, conduct different tests of statistical significance on each keyword and calculate the Bayes factor and effect size for each keyword using different measures. You can adjust the settings for the generated data via Generation Settings. You can disable the calculation of statistical significance and/or Bayes factor and/or effect size by setting Generation Settings → Test of Statistical Significance / Measures of Bayes Factor / Measure of Effect Size to None.
You can filter the results by clicking Filter results or search in Data Table for parts that might be of interest to you by clicking Search in results.
You can generate line charts or word clouds for keywords using any statistics. You can modify the settings for the generated figure via Figure Settings.
-
11.1 Rank
The rank of the keyword sorted by the p-value of the significance test conducted on the keyword in the first file in ascending order (by default). You can sort the results again by clicking the column headers. You can use continuous numbering after tied ranks (eg. 1/1/1/2/2/3 instead of 1/1/1/4/4/6) by checking Menu Bar → Preferences → Settings → Tables → Rank Settings → Continue numbering after ties. -
11.2 Keyword
The potential keyword. You can specify what should be counted as a "token" via Token Settings. -
11.3 Frequency (in Reference File)
The number of occurrences of the keyword in the reference file. -
11.4 Frequency (in Observed Files)
The number of occurrences of the keyword in each observed file. -
11.5 Test Statistic
The test statistic of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details.Please note that test statistic is not available for some tests of statistical significance.
-
11.6 p-value
The p-value of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
11.7 Bayes Factor
The Bayes factor of the keyword in each file. You can change the measure of Bayes factor used via Generation Settings → Measure of Bayes Factor. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
11.8 Effect Size
The effect size of on the keyword in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See section 12.4.4 Tests of Statistical Significance, Measures of Bayes Factor, & Measures of Effect Size for more details. -
11.9 Number of Files Found
The number of files in which the keyword appears at least once. -
11.10 Number of Files Found %
The percentage of the number of files in which the keyword appears at least once out of the total number of files that are currently selected.
Language | Sentence Token-ization | Word Token-ization | Syllable Token-ization | Part-of-speech Tagging | Lemma-tization | Stop Word List | Depen-dency Parsing | Senti-ment Analysis |
---|---|---|---|---|---|---|---|---|
Afrikaans | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Albanian | ⭕️ | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
Amharic | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Arabic | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
Armenian (Classical) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Armenian (Eastern) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Armenian (Western) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Assamese | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Asturian | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✖️ |
Azerbaijani | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✔ | ✖️ | ✔ |
Basque | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Belarusian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Bengali | ⭕️ | ✔ | ✖️ | ✖️ | ✔ | ✔ | ✖️ | ✔ |
Bulgarian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Burmese | ✔ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Buryat (Russia) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Catalan | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Chinese (Classical) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Chinese (Simplified) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
Chinese (Traditional) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
Church Slavonic (Old) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Coptic | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Croatian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Czech | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Danish | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Dutch | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
English (Middle) | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✖️ |
English (Old) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
English (United Kingdom) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
English (United States) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Erzya | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Esperanto | ⭕️ | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Estonian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Faroese | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✔ | ✖️ |
Finnish | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
French | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
French (Old) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Galician | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Georgian | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
German (Austria) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
German (Germany) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
German (Switzerland) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Gothic | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Greek (Ancient) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Greek (Modern) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Gujarati | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Hebrew (Ancient) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Hebrew (Modern) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
Hindi | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Hungarian | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Icelandic | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Indonesian | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Irish | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Italian | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Japanese | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Kannada | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Kazakh | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
Khmer | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✖️ | ✔ |
Korean | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Kurdish (Kurmanji) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Kyrgyz | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Lao | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✔ | ✖️ | ✔ |
Latin | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Latvian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Ligurian | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Lithuanian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Luganda | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Luxembourgish | ⭕️ | ✔ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
Macedonian | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Malay | ⭕️ | ✔ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
Malayalam | ✔ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Maltese | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✔ | ✔ |
Manx | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Marathi | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Meitei (Meitei script) | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Mongolian | ⭕️ | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Nepali | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✔ | ✖️ | ✔ |
Nigerian Pidgin | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Norwegian (Bokmål) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Norwegian (Nynorsk) | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Odia | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Persian | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Polish | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Pomak | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Portuguese (Brazil) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Portuguese (Portugal) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Punjabi (Gurmukhi script) | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Romanian | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Russian | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Russian (Old) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Sámi (Northern) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Sanskrit | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Scottish Gaelic | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Serbian (Cyrillic script) | ⭕️ | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
Serbian (Latin script) | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Sindhi | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✖️ | ✔ |
Sinhala | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Slovak | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Slovene | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Sorbian (Lower) | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ |
Sorbian (Upper) | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Spanish | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Swahili | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
Swedish | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Tagalog | ⭕️ | ✔ | ✖️ | ✖️ | ✔ | ✖️ | ✖️ | ✔ |
Tajik | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✔ | ✖️ | ✔ |
Tamil | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Tatar | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Telugu | ✔ | ✔ | ✔ | ✔ | ✖️ | ✖️ | ✔ | ✔ |
Tetun (Dili) | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ |
Thai | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✔ |
Tibetan | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✖️ | ✖️ |
Tigrinya | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Tswana | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ |
Turkish | ✔ | ✔ | ✖️ | ✔ | ✔ | ✔ | ✔ | ✔ |
Ukrainian | ✔ | ✔ | ✔ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Urdu | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Uyghur | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Vietnamese | ✔ | ✔ | ✖️ | ✔ | ✖️ | ✖️ | ✔ | ✔ |
Welsh | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✔ |
Wolof | ✔ | ✔ | ✖️ | ✔ | ✔ | ✖️ | ✔ | ✖️ |
Yoruba | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Zulu | ⭕️ | ⭕️ | ✔ | ✖️ | ✖️ | ✖️ | ✖️ | ✔ |
Other languages | ⭕️ | ⭕️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ |
Note
✔: Supported
⭕️: Supported but falls back to the default English (United States) tokenizer
✖️: Not supported
File Type | File Extensions | Remarks |
---|---|---|
CSV files¹ | *.csv | |
Excel workbooks¹² | *.xlsx | Legacy Microsoft 97-2003 Excel Workbooks (*.xls) are not supported. |
HTML pages¹² | *.htm, *.html | |
Lyrics File¹ | *.lrc | Simple LRC and enhanced LRC formats are supported. |
PDF files¹² | Text could only be extracted from text-searchable PDF files. There is no support for automatically converting scanned PDF files into text-searchable ones. | |
PowerPoint presentations¹² | *.pptx | Legacy Microsoft 97-2003 PowerPoint presentations (*.ppt) are not supported. |
Text files | *.txt | |
Translation memory files¹ | *.tmx | |
Word documents¹² | *.docx | Legacy Microsoft 97-2003 Word documents (*.doc) are not supported. |
XML files¹ | *.xml |
Important
- Non-TXT files will be automatically converted to TXT files when being imported into Wordless. You can check the converted files under folder imports at the installation location of Wordless on your computer (as for macOS users, right click Wordless.app, select Show Package Contents and navigate to Contents/MacOS/imports/). You can change this location via Menu Bar → Preferences → Settings → General → Import → Temporary Files → Default path.
- It is not recommended to directly import non-text files into Wordless and the support for doing so is provided only for convenience, since accuracy of text extraction could never be guaranteed and unintended data loss might occur, for which reason users are encouraged to convert their files using specialized tools and make their own choices on which part of the data should be kept or discarded.
Language | File Encoding | Auto-detection |
---|---|---|
All languages | UTF-8 without BOM | ✔ |
All languages | UTF-8 with BOM | ✔ |
All languages | UTF-16 with BOM | ✔ |
All languages | UTF-16BE without BOM | ✔ |
All languages | UTF-16LE without BOM | ✔ |
All languages | UTF-32 with BOM | ✔ |
All languages | UTF-32BE without BOM | ✔ |
All languages | UTF-32LE without BOM | ✔ |
All languages | UTF-7 | ✔ |
Arabic | CP720 | ✔ |
Arabic | CP864 | ✔ |
Arabic | ISO-8859-6 | ✔ |
Arabic | Mac OS | ✔ |
Arabic | Windows-1256 | ✔ |
Baltic languages | CP775 | ✔ |
Baltic languages | ISO-8859-13 | ✔ |
Baltic languages | Windows-1257 | ✔ |
Celtic languages | ISO-8859-14 | ✔ |
Chinese | GB18030 | ✔ |
Chinese | GBK | ✔ |
Chinese (Simplified) | GB2312 | ✔ |
Chinese (Simplified) | HZ | ✔ |
Chinese (Traditional) | Big-5 | ✔ |
Chinese (Traditional) | Big5-HKSCS | ✔ |
Chinese (Traditional) | CP950 | ✔ |
Croatian | Mac OS | ✔ |
Cyrillic | CP855 | ✔ |
Cyrillic | CP866 | ✔ |
Cyrillic | ISO-8859-5 | ✔ |
Cyrillic | Mac OS | ✔ |
Cyrillic | Windows-1251 | ✔ |
English | ASCII | ✔ |
English | EBCDIC 037 | ✔ |
English | CP437 | ✔ |
European | HP Roman-8 | ✔ |
European (Central) | CP852 | ✔ |
European (Central) | ISO-8859-2 | ✔ |
European (Central) | Mac OS Central European | ✔ |
European (Central) | Windows-1250 | ✔ |
European (Northern) | ISO-8859-4 | ✔ |
European (Southern) | ISO-8859-3 | ✔ |
European (Southeastern) | ISO-8859-16 | ✔ |
European (Western) | EBCDIC 500 | ✔ |
European (Western) | CP850 | ✔ |
European (Western) | CP858 | ✔ |
European (Western) | CP1140 | ✔ |
European (Western) | ISO-8859-1 | ✔ |
European (Western) | ISO-8859-15 | ✔ |
European (Western) | Mac OS Roman | ✔ |
European (Western) | Windows-1252 | ✔ |
French | CP863 | ✔ |
German | EBCDIC 273 | ✔ |
Greek | CP737 | ✔ |
Greek | CP869 | ✔ |
Greek | CP875 | ✔ |
Greek | ISO-8859-7 | ✔ |
Greek | Mac OS | ✔ |
Greek | Windows-1253 | ✔ |
Hebrew | CP856 | ✔ |
Hebrew | CP862 | ✔ |
Hebrew | EBCDIC 424 | ✔ |
Hebrew | ISO-8859-8 | ✔ |
Hebrew | Windows-1255 | ✔ |
Icelandic | CP861 | ✔ |
Icelandic | Mac OS | ✔ |
Japanese | CP932 | ✔ |
Japanese | EUC-JP | ✔ |
Japanese | EUC-JIS-2004 | ✔ |
Japanese | EUC-JISx0213 | ✔ |
Japanese | ISO-2022-JP | ✔ |
Japanese | ISO-2022-JP-1 | ✔ |
Japanese | ISO-2022-JP-2 | ✔ |
Japanese | ISO-2022-JP-2004 | ✔ |
Japanese | ISO-2022-JP-3 | ✔ |
Japanese | ISO-2022-JP-EXT | ✔ |
Japanese | Shift_JIS | ✔ |
Japanese | Shift_JIS-2004 | ✔ |
Japanese | Shift_JISx0213 | ✔ |
Kazakh | KZ-1048 | ✔ |
Kazakh | PTCP154 | ✔ |
Korean | EUC-KR | ✔ |
Korean | ISO-2022-KR | ✔ |
Korean | JOHAB | ✔ |
Korean | UHC | ✔ |
Nordic languages | CP865 | ✔ |
Nordic languages | ISO-8859-10 | ✔ |
Persian/Urdu | Mac OS Farsi | ✔ |
Portuguese | CP860 | ✔ |
Romanian | Mac OS | ✔ |
Russian | KOI8-R | ✔ |
Tajik | KOI8-T | ✔ |
Thai | CP874 | ✔ |
Thai | ISO-8859-11 | ✔ |
Thai | TIS-620 | ✔ |
Turkish | CP857 | ✔ |
Turkish | EBCDIC 1026 | ✔ |
Turkish | ISO-8859-9 | ✔ |
Turkish | Mac OS | ✔ |
Turkish | Windows-1254 | ✔ |
Ukrainian | CP1125 | ✔ |
Ukrainian | KOI8-U | ✔ |
Urdu | CP1006 | ✔ |
Vietnamese | CP1258 | ✔ |
The readability of a text depends on several variables including the average sentence length, average word length in characters, average word length in syllables, number of monosyllabic words, number of polysyllabic words, number of difficult words, etc.
It should be noted that some readability measures are language-specific, or applicable only to texts in languages for which Wordless have built-in syllable tokenization support (check 12.1 for reference), while others can be applied to texts in all languages.
The following variables would be used in formulas:
NumSentences: Number of sentences
NumWords: Number of words
NumWordsSyl₁: Number of monosyllabic words
NumWordsSylsₙ₊: Number of words with n or more syllables
NumWordsLtrsₙ₊: Number of words with n or more letters
NumWordsLtrsₙ₋: Number of words with n or fewer letters
NumConjs: Number of conjunctions
NumPreps: Number of prepositions
NumProns: Number of pronouns
NumWordsDale₇₆₉: Number of words outside the Dale list of 769 easy words (Dale, 1931)
NumWordsDale₃₀₀₀: Number of words outside the Dale list of 3000 easy words (Dale & Chall, 1948b)
NumWordsSpache: Number of words outside the Spache word list (Spache, 1974)
NumWordTypes: Number of word types
NumWordTypesBambergerVanecek: Number of word types outside the Bamberger-Vanecek's list of 1000 most common words (Bamberger & Vanecek, 1984, pp. 176–179)
NumWordTypesDale₇₆₉: Number of word types outside the Dale list of 769 easy words (Dale, 1931)
NumSyls: Number of syllables
NumSylsLuongNguyenDinh₁₀₀₀: Number of syllables outside the Luong-Nguyen-Dinh list of 1000 most frequent syllables extracted from all easy documents of the corpus of Vietnamese text readability dataset on literature domain (Luong et al., 2018)
NumCharsAll: Number of characters (letters, CJK characters, etc., numerals, and punctuation marks)
NumCharsAlnum: Number of alphanumeric characters (letters, CJK characters, etc., and numerals)
NumCharsAlpha: Number of alphabetic characters (letters, CJK characters, etc.)
Readability Formula | Formula | Supported Languages |
---|---|---|
Al-Heeti's Readability Prediction Formula¹ (Al-Heeti, 1984, pp. 102, 104, 106) |
Arabic | |
Automated Arabic Readability Index (Al-Tamimi et al., 2013) |
Arabic | |
Automated Readability Index¹ (Smith & Senter, 1967, p. 8 Navy: Kincaid et al., 1975, p. 14) |
All languages | |
Bormuth's Cloze Mean & Grade Placement (Bormuth, 1969, pp. 152, 160) |
where C is the cloze criterion score, whose value could be changed via Menu Bar → Preferences → Settings → Measures → Readability → Bormuth's Grade Placement → Cloze criterion score |
English |
Coleman-Liau Index (Coleman & Liau, 1975) |
All languages | |
Coleman's Readability Formula¹ (Liau et al., 1976) |
All languages²³ | |
Dale-Chall Readability Formula¹ (Dale & Chall, 1948a; Dale & Chall, 1948b Powers-Sumner-Kearl: Powers et al., 1958 New: Chall & Dale, 1995) |
English | |
Danielson-Bryan's Readability Formula¹ (Danielson & Bryan, 1963) |
All languages | |
Dawood's Readability Formula (Dawood, 1977) |
Arabic | |
Degrees of Reading Power (College Entrance Examination Board, 1981) |
where M is Bormuth's cloze mean. |
English |
Devereux Readability Index (Smith, 1961) |
All languages | |
Dickes-Steiwer Handformel (Dickes & Steiwer, 1977) |
All languages | |
Easy Listening Formula (Fang, 1966) |
All languages² | |
Flesch-Kincaid Grade Level (Kincaid et al., 1975, p. 14) |
All languages² | |
Flesch Reading Ease¹ (Flesch, 1948 Powers-Sumner-Kearl: Powers et al., 1958 Dutch: Douma, 1960, p. 453; Brouwer, 1963 French: Kandel & Moles, 1958 German: Amstad, 1978 Italian: Franchina & Vacca, 1986 Russian: Oborneva, 2006, p. 13 Spanish: Fernández Huerta, 1959; Szigriszt Pazos, 1993, p. 247 Ukrainian: Partiko, 2001) |
All languages² | |
Flesch Reading Ease (Farr-Jenkins-Paterson)¹ (Farr et al., 1951 Powers-Sumner-Kearl: Powers et al., 1958) |
All languages² | |
FORCAST Grade Level (Caylor & Sticht, 1973, p. 3) |
* One sample of 150 words would be taken randomly from the text, so the text should be at least 150 words long. |
All languages² |
Fórmula de comprensibilidad de Gutiérrez de Polini (Gutiérrez de Polini, 1972) |
Spanish | |
Fórmula de Crawford (Crawford, 1985) |
Spanish² | |
Fucks's Stilcharakteristik (Fucks, 1955) |
All languages² | |
Gulpease Index (Lucisano & Emanuela Piemontese, 1988) |
Italian | |
Gunning Fog Index¹ (English: Gunning, 1968, p. 38 Powers-Sumner-Kearl: Powers et al., 1958 Navy: Kincaid et al., 1975, p. 14 Polish: Pisarek, 1969) |
where NumHardWords is the number of words with 3 or more syllables, except proper nouns and words with 3 syllables ending with -ed or -es, for English texts, and the number of words with 4 or more syllables in their base forms, except proper nouns, for Polish texts. |
English & Polish² |
Legibilidad µ (Muñoz Baquedano, 2006) |
where LenWordsAvg is the average word length in letters, and LenWordsVar is the variance of word lengths in letters. |
Spanish |
Lensear Write (O’Hayre, 1966, p. 8) |
where NumWords1Syl is the number of monosyllabic words excluding the, is, are, was, were. * One sample of 100 words would be taken randomly from the text, and if the text is shorter than 100 words, NumWords1Syl and NumSentences would be multiplied by 100 and then divided by NumWords. |
English² |
Lix (Björnsson, 1968) |
All languages | |
Lorge Readability Index¹ (Lorge, 1944 Corrected: Lorge, 1948) |
English³ | |
Luong-Nguyen-Dinh's Readability Formula (Luong et al., 2018) |
* The number of syllables is estimated by tokenizing the text by whitespace and counting the number of tokens excluding punctuation marks |
Vietnamese |
McAlpine EFLAW Readability Score (Nirmaldasan, 2009) |
English | |
neue Wiener Literaturformeln¹ (Bamberger & Vanecek, 1984, p. 82) |
German² | |
neue Wiener Sachtextformel¹ (Bamberger & Vanecek, 1984, pp. 83–84) |
German² | |
OSMAN (El-Haj & Rayson, 2016) |
where NumFaseehWords is the number of words which have 5 or more syllables and contain ء/ئ/ؤ/ذ/ظ or end with وا/ون. * The number of syllables in each word is estimated by adding up the number of short syllables and twice the number of long and stress syllables in each word. |
Arabic |
Rix (Anderson, 1983) |
All languages | |
SMOG Grade (McLaughlin, 1969 German: Bamberger & Vanecek, 1984, p.78) |
* A sample would be constructed using the first 10 sentences, the last 10 sentences, and the 10 sentences at the middle of the text, so the text should be at least 30 sentences long. |
All languages² |
Spache Grade Level¹ (Spache, 1953 Revised: Spache, 1974) |
* Three samples each of 100 words would be taken randomly from the text and the results would be averaged out, so the text should be at least 100 words long. |
All languages |
Strain Index (Solomon, 2006) |
* A sample would be constructed using the first 3 sentences in the text, so the text should be at least 3 sentences long. |
All languages² |
Tränkle & Bailer's Readability Formula¹ (Tränkle & Bailer, 1984) |
* One sample of 100 words would be taken randomly from the text, so the text should be at least 100 words long. |
All languages³ |
Tuldava's Text Difficulty (Tuldava, 1975) |
All languages² | |
Wheeler & Smith's Readability Formula (Wheeler & Smith, 1954) |
where NumUnits is the number of sentence segments ending in periods, question marks, exclamation marks, colons, semicolons, and dashes. |
All languages² |
Note
- Variants available and can be selected via Menu Bar → Preferences → Settings → Measures → Readability
- Requires built-in syllable tokenization support
- Requires built-in part-of-speech tagging support
Lexical density/diversity is the measurement of the extent to which the vocabulary used in the text varies.
The following variables would be used in formulas:
fᵢ: Frequency of the i-th token type ranked descendingly by frequencies
fₘₐₓ: Maximum frequency among all token types
NumTypes: Number of token types
NumTypesf: Number of token types whose frequencies equal f
NumTokens: Number of tokens
Indicator of Lexical Density/Diversity | Formula |
---|---|
Brunét's Index (Brunét, 1978) |
|
Corrected TTR (Carroll, 1964) |
|
Fisher's Index of Diversity (Fisher et al., 1943) |
where W₋₁ is the -1 branch of the Lambert W function |
Herdan's Vₘ (Herdan, 1955) |
|
HD-D (McCarthy & Jarvis, 2010) |
For detailed calculation procedures, see reference. The sample size could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → HD-D → Sample size. |
Honoré's Statistic (Honoré, 1979) |
|
Lexical Density (Ure, 1971) |
where NumContentWords is the number of content words. By default, all tokens whose universal part-of-speech tags assigned by built-in part-of-speech taggers are ADJ (adjectives), ADV (adverbs), INTJ (interjections), NOUN (nouns), PROPN (proper nouns), NUM (numerals), VERB (verbs), SYM (symbols), or X (others) are categorized as content words. For some built-in part-of-speech taggers, this behavior could be changed via Menu Bar → Preferences → Settings → Part-of-speech Tagging → Tagsets → Mapping Settings → Content/Function Words. |
LogTTR¹ (Herdan: Herdan, 1960, p. 28 Somers: Somers, 1966 Rubet: Dugast, 1979 Maas: Maas, 1972 Dugast: Dugast, 1978; Dugast, 1979) |
|
Mean Segmental TTR (Johnson, 1944) |
where n is the number of equal-sized segment, the length of which could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Mean Segmental TTR → Number of tokens in each segment, NumTypesSegᵢ is the number of token types in the i-th segment, and NumTokensSegᵢ is the number of tokens in the i-th segment. |
Measure of Textual Lexical Diversity (McCarthy, 2005, pp. 95–96, 99–100; McCarthy & Jarvis, 2010) |
For detailed calculation procedures, see references. The factor size could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Measure of Textual Lexical Diversity → Factor size. |
Moving-average TTR (Covington & McFall, 2010) |
where w is the window size which could be modified via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity → Moving-average TTR → Window size, NumTypesWindowₚ is the number of token types within the moving window starting at position p, and NumTokensWindowₚ is the number of tokens within the moving window starting at position p. |
Popescu-Mačutek-Altmann's B₁/B₂/B₃/B₄/B₅ (Popescu et al., 2008) |
|
Popescu's R₁ (Popescu, 2009, pp. 18, 30, 33) |
For detailed calculation procedures, see reference. |
Popescu's R₂ (Popescu, 2009, pp. 35–36, 38) |
For detailed calculation procedures, see reference. |
Popescu's R₃ (Popescu, 2009, pp. 48–49, 53) |
For detailed calculation procedures, see reference. |
Popescu's R₄ (Popescu, 2009, p. 57) |
For detailed calculation procedures, see reference. |
Repeat Rate¹ (Popescu, 2009, p. 166) |
|
Root TTR (Guiraud, 1954) |
|
Shannon Entropy¹ (Popescu, 2009, p. 173) |
|
Simpson's l (Simpson, 1949) |
|
Type-token Ratio (Johnson, 1944) |
|
vocd-D (Malvern et al., 2004, pp. 51, 56–57) |
For detailed calculation procedures, see reference. |
Yule's Characteristic K (Yule, 1944, pp. 52–53) |
|
Yule's Index of Diversity (Williams, 1970, p. 100) |
Note
- Variants available and can be selected via Menu Bar → Preferences → Settings → Measures → Lexical Density/Diversity
For parts-based measures, each file is divided into n (whose value you could modify via Menu Bar → Preferences → Settings → Measures → Dispersion / Adjusted Frequency → General Settings → Divide each file into subsections) sub-sections and the frequency of the word in each part is counted and denoted by F₁, F₂, F₃, ..., Fₙ respectively. The total frequency of the word in each file is denoted by F and the mean value of the frequencies over all sub-sections is denoted by F̅.
For distance-based measures, the distance between each pair of subsequent occurrences of the word is calculated and denoted by d₁, d₂, d₃, ..., dF respectively. The total number of tokens in each file is denoted by N.
Then, the dispersion and adjusted frequency of the word are calculated as follows:
Measure of Dispersion (Parts-based) | Measure of Adjusted Frequency (Parts-based) | Formula |
---|---|---|
Carroll's D₂ (Carroll, 1970) |
Carroll's Uₘ (Carroll, 1970) |
|
Engwall's FM (Engwall, 1974) |
where R is the number of sub-sections in which the word appears at least once. |
|
Gries's DP (Gries, 2008; Lijffijt & Gries, 2012) |
* Normalization is applied by default, which behavior you could change via Menu Bar → Preferences → Settings → Measures → Dispersion → Gries's DP → Apply normalization. |
|
Juilland's D (Juilland & Chang-Rodrigues, 1964) |
Juilland's U (Juilland & Chang-Rodrigues, 1964) |
|
Kromer's UR (Kromer, 2003) |
where ψ is the digamma function, and C is the Euler–Mascheroni constant. |
|
Lyne's D₃ (Lyne, 1985) |
||
Rosengren's S (Rosengren, 1971) |
Rosengren's KF (Rosengren, 1971) |
|
Zhang's Distributional Consistency (Zhang, 2004) |
Measure of Dispersion (Distance-based) | Measure of Adjusted Frequency (Distance-based) | Formula |
---|---|---|
Average Logarithmic Distance (Savický & Hlaváčová, 2002) |
Average Logarithmic Distance (Savický & Hlaváčová, 2002) |
|
Average Reduced Frequency (Savický & Hlaváčová, 2002) |
Average Reduced Frequency (Savický & Hlaváčová, 2002) |
|
Average Waiting Time (Savický & Hlaváčová, 2002) |
Average Waiting Time (Savický & Hlaváčová, 2002) |
In order to calculate the statistical significance, Bayes factor, and effect size (except Mann-Whitney U Test, Student's t-test (2-sample), and Welch's t-test) for two words in the same file (collocates) or for one specific word in two different files (keywords), two contingency tables must be constructed first, one for observed values, the other for expected values.
As for collocates (in Collocation Extractor and Colligation Extractor):
Observed Values | Word 1 | Not Word 1 | Row Total |
---|---|---|---|
Word 2 | O₁₁ | O₁₂ | O₁ₓ = O₁₁ + O₁₂ |
Not Word 2 | O₂₁ | O₂₂ | O₂ₓ = O₂₁ + O₂₂ |
Column Total | Oₓ₁ = O₁₁ + O₂₁ | Oₓ₂ = O₁₂ + O₂₂ | Oₓₓ = O₁₁ + O₁₂ + O₂₁ + O₂₂ |
Expected Values | Word 1 | Not Word 1 |
---|---|---|
Word 2 | ||
Not Word 2 |
O₁₁: Number of occurrences of Word 1 followed by Word 2.
O₁₂: Number of occurrences of Word 1 followed by any word except Word 2.
O₂₁: Number of occurrences of any word except Word 1 followed by Word 2.
O₂₂: Number of occurrences of any word except Word 1 followed by any word except Word 2.
As for keywords (in Keyword Extractor):
Observed Values | Observed File | Reference File | Row Total |
---|---|---|---|
Word w | O₁₁ | O₁₂ | O₁ₓ = O₁₁ + O₁₂ |
Not Word w | O₂₁ | O₂₂ | O₂ₓ = O₂₁ + O₂₂ |
Column Total | Oₓ₁ = O₁₁ + O₂₁ | Oₓ₂ = O₁₂ + O₂₂ | Oₓₓ = O₁₁ + O₁₂ + O₂₁ + O₂₂ |
Expected Values | Observed File | Reference File |
---|---|---|
Word w | ||
Not Word w |
O₁₁: Number of occurrences of Word w in the observed file.
O₁₂: Number of occurrences of Word w in the reference file.
O₂₁: Number of occurrences of all words except Word w in the observed file.
O₂₂: Number of occurrences of all words except Word w in the reference file.
To conduct Mann-Whitney U Test, Student's t-test (2-sample), and Welch's t-test on a specific word, each column total is first divided into n (5 by default) sub-sections respectively. To be more specific, in Collocation Extractor and Colligation Extractor, all collocates where Word 1 appears as node and the other collocates where Word 1 does not appear as node are divided into n parts respectively. And in Keyword Extractor, all tokens in the observed file and all tokens in the reference files are equally divided into n parts respectively.
The frequencies of Word 2 (in Collocation Extractor and Colligation Extractor) or Word w (in Keyword Extractor) in each sub-section of the 2 column totals are counted and denoted by F₁₁, F₂₁, F₃₁, ..., Fₙ₁, and F₁₂, F₂₂, F₃₂, ..., Fₙ₂ respectively. The total frequency of Word 2 (in Collocation Extractor and Colligation Extractor) or Word w (in Keyword Extractor) in the 2 column totals are denoted by Fₓ₁ and Fₓ₂ respectively. The mean value of the frequencies over all sub-sections in the 2 column totals are denoted by and respectively.
Then the test statistic, Bayes factor, and effect size are calculated as follows:
Test of Statistical Significance | Measure of Bayes Factor | Formula |
---|---|---|
Fisher's Exact Test (Pedersen, 1996) |
See: Fisher's exact test - Wikipedia | |
Log-likelihood Ratio Test (Dunning, 1993) |
Log-likelihood Ratio Test (Wilson, 2013) |
|
Mann-Whitney U Test (Kilgarriff, 2001) |
See: Mann–Whitney U test - Wikipedia | |
Pearson's Chi-squared Test (Hofland & Johanson, 1982; Oakes, 1998) |
||
Student's t-test (1-sample) (Church et al., 1991) |
||
Student's t-test (2-sample) (Paquot & Bestgen, 2009) |
Student's t-test (2-sample) (Wilson, 2013) |
|
z-score (Dennis, 1964) |
||
z-score (Berry-Rogghe) (Berry-Rogghe, 1973) |
where S is the average span size on both sides of the node word. |
Measure of Effect Size | Formula |
---|---|
%DIFF (Gabrielatos & Marchi, 2012) |
|
Cubic Association Ratio (Daille, 1994, 1995) |
|
Dice's Coefficient (Smadja et al., 1996) |
|
Difference Coefficient (Hofland & Johanson, 1982; Gabrielatos, 2018) |
|
Jaccard Index (Dunning, 1998) |
|
Kilgarriff's Ratio (Kilgarriff, 2009) |
where α is the smoothing parameter, whose value could be changed via Menu Bar → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter. |
Log Ratio (Hardie, 2014) |
|
Log-Frequency Biased MD (Thanopoulos et al., 2002) |
|
logDice (Rychlý, 2008) |
|
MI.log-f (Lexical Computing Ltd., 2015; Kilgarriff & Tugwell, 2002) |
|
Minimum Sensitivity (Pedersen, 1998) |
|
Mutual Dependency (Thanopoulos et al., 2002) |
|
Mutual Expectation (Dias et al., 1999) |
|
Mutual Information (Dunning, 1998) |
|
Odds Ratio (Pojanapunya & Todd, 2016) |
|
Pointwise Mutual Information (Church & Hanks, 1990) |
|
Poisson Collocation Measure (Quasthoff & Wolff, 2002) |
|
Squared Phi Coefficient (Church & Gale, 1991) |
- ^ Al-Heeti, K. N. (1984). Judgment analysis technique applied to readability prediction of Arabic reading material [Doctoral dissertation, University of Northern Colorado]. ProQuest Dissertations and Theses Global.
- ^ Al-Tamimi, A., Jaradat M., Aljarrah, N., & Ghanim, S. (2013). AARI: Automatic Arabic readability index. The International Arab Journal of Information Technology, 11(4), 370–378.
- ^ Amstad, T. (1978). Wie verständlich sind unsere Zeitungen? [Unpublished doctoral dissertation]. University of Zurich.
- ^ Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496.
- ^ ^ ^ ^ Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk.
- ^ Berry-Rogghe, G. L. M. (1973). The computation of collocations and their relevance in lexical studies. In A. J. Aiken, R. W. Bailey, & N. Hamilton-Smith (Eds.), The computer and literary studies (pp. 103–112). Edinburgh University Press.
- ^ Bormuth, J. R. (1969). Development of readability analyses. U.S. Department of Health, Education, and Welfare. http://files.eric.ed.gov/fulltext/ED029166.pdf
- ^ Björnsson, C.-H. (1968). Läsbarhet. Liber.
- ^ Brouwer, R. H. M. (1963). Onderzoek naar de leesmoeilijkheid van Nederlands proza. Paedagogische studiën, 40, 454–464. https://objects.library.uu.nl/reader/index.php?obj=1874-205260&lan=en
- ^ Brunét, E. (1978). Le vocabulaire de Jean Giraudoux: Structure et evolution. Slatkine.
- ^ Carroll, J. B. (1964). Language and thought. Prentice-Hall.
- ^ ^ Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x
- ^ Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material. Human Resource Research Organization. https://ia902703.us.archive.org/31/items/ERIC_ED076707/ERIC_ED076707.pdf
- ^ Chall, J. S., & Dale, E. (1995). Readability revisited: The new Dale-Chall readability formula. Brookline Books.
- ^ Church, K. W., & Gale, W. A. (1991, September 29–October 1). Concordances for parallel text [Paper presentation]. Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, Oxford, United Kingdom.
- ^ Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Psychology Press.
- ^ Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
- ^ Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283–284. https://doi.org/10.1037/h0076540
- ^ College Entrance Examination Board. (1981). Degrees of reading power brings the students and the text together.
- ^ Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100. https://doi.org/10.1080/09296171003643098
- ^ Crawford, A. N. (1985). Fórmula y gráfico para determinar la comprensibilidad de textos de nivel primario en castellano. Lectura y Vida, 6(4). http://www.lecturayvida.fahce.unlp.edu.ar/numeros/a6n4/06_04_Crawford.pdf
- ^ Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid=
- ^ Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University.
- ^ ^ Dale, E. (1931). A comparison of two word lists. Educational Research Bulletin, 10(18), 484–489.
- ^ Dale, E., & Chall, J. S. (1948a). A formula for predicting readability. Educational Research Bulletin, 27(1), 11–20, 28.
- ^ ^ Dale, E., & Chall, J. S. (1948b). A formula for predicting readability: Instructions. Educational Research Bulletin, 27(2), 37–54.
- ^ Danielson, W. A., & Bryan, S. D. (1963). Computer automation of two readability formulas. Journalism Quarterly, 40(2), 201–206. https://doi.org/10.1177/107769906304000207
- ^ Dawood, B.A.K. (1977). The relationship between readability and selected language variables [Unpublished master’s thesis]. University of Baghdad.
- ^ Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), Proceedings of the symposium on statistical association methods for mechanized documentation (pp. 61–148). National Bureau of Standards.
- ^ Dias, G., Guilloré, S., & Pereira Lopes, J. G. (1999). Language independent automatic acquisition of rigid multiword units from unrestricted text corpora. In A. Condamines, C. Fabre, & M. Péry-Woodley (Eds.), TALN'99: 6ème Conférence Annuelle Sur le Traitement Automatique des Langues Naturelles (pp. 333–339). TALN.
- ^ Dickes, P. & Steiwer, L. (1977). Ausarbeitung von lesbarkeitsformeln für die deutsche sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 9(1), 20–28.
- ^ Douma, W. H. (1960). De leesbaarheid van landbouwbladen: Een onderzoek naar en een toepassing van leesbaarheidsformules [Readability of Dutch farm papers: A discussion and application of readability-formulas]. Afdeling sociologie en sociografie van de Landbouwhogeschool Wageningen. https://edepot.wur.nl/276323
- ^ Dugast, D. (1978). Sur quoi se fonde la notion d’étendue théoretique du vocabulaire?. Le Français Moderne, 46, 25–32.
- ^ ^ Dugast, D. (1979). Vocabulaire et stylistique: I théâtre et dialogue, travaux de linguistique quantitative. Slatkine.
- ^ Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
- ^ ^ Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. arxiv.org/pdf/1207.1847.pdf
- ^ El-Haj, M., & Rayson, P. (2016). OSMAN: A novel Arabic readability metric. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 250–255). European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2016/index.html
- ^ Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University.
- ^ Fang, I. E. (1966). The easy listening formula. Journal of Broadcasting, 11(1), 63–68. https://doi.org/10.1080/08838156609363529
- ^ Farr, J. N., Jenkins, J. J., & Paterson, D. G. (1951). Simplification of Flesch reading ease formula. Journal of Applied Psychology, 35(5), 333–337. https://doi.org/10.1037/h0062427
- ^ Fernández Huerta, J. (1959). Medidas sencillas de lecturabilidad. Consigna, 214, 29–32.
- ^ Fisher, R. A., Steven, A. C., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12(1), 42–58. https://doi.org/10.2307/1411
- ^ Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. https://doi.org/10.1037/h0057532
- ^ Franchina, V., & Vacca, R. (1986). Adaptation of Flesh readability index on a bilingual text written by the same author both in Italian and English languages. Linguaggi, 3, 47–49.
- ^ Fucks, W. (1955). Unterschied des Prosastils von Dichtern und anderen Schriftstellern: ein Beispiel mathematischer Stilanalyse. Bouvier.
- ^ Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review (pp. 225–258). Routledge.
- ^ Gabrielatos, C., & Marchi, A. (2012, September 13–14). Keyness: Appropriate metrics and practical issues [Conference session]. CADS International Conference 2012, University of Bologna, Italy.
- ^ Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri
- ^ Guiraud, P. (1954). Les caractères statistiques du vocabulaire: Essai de méthodologie. Presses universitaires de France.
- ^ Gunning, R. (1968). The technique of clear writing (revised ed.). McGraw-Hill Book Company.
- ^ Gutiérrez de Polini, L. E. (1972). Investigación sobre lectura en Venezuela [Paper presentation]. Primeras Jornadas de Educación Primaria, Ministerio de Educación, Caracas, Venezuela.
- ^ Hardie, A. (2014, April 28). Log ratio: An informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/
- ^ Herdan, G. (1955). A new derivation and interpretation of Yule's ‘Characteristic’ K. Zeitschrift für Angewandte Mathematik und Physik (ZAMP), 6(4), 332–339. https://doi.org/10.1007/BF01587632
- ^ Herdan, G. (1960). Type-token mathematics: A textbook of mathematical linguistics. Mouton.
- ^ ^ Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities.
- ^ Honoré, A. (1979). Some simple measures of richness of vocabulary. Association of Literary and Linguistic Computing Bulletin, 7(2), 172–177.
- ^ ^ Johnson, W. (1944). Studies in language behavior: I. a program of research. Psychological Monographs, 56(2), 1–15. https://doi.org/10.1037/h0093508
- ^ ^ Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton.
- ^ Kandel, L., & Moles A. (1958). Application de l’indice de flesch la langue francaise [applying flesch index to french language]. The Journal of Educational Research, 21, 283–287.
- ^ Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil
- ^ Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (p. 171). University of Liverpool.
- ^ Kilgarriff, A., & Tugwell, D. (2002). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In Proceedings of the 8th Machine Translation Summit (pp. 187–190). European Association for Machine Translation.
- ^ ^ ^ Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf
- ^ Kromer, V. (2003). A usage measure based on psychophysical relations. Journal of Quantitative Linguistics, 10(2), 177–186. https://doi.org/10.1076/jqul.10.2.177.16718
- ^ Lexical Computing. (2015, July 8). Statistics used in Sketch Engine. Sketch Engine. https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/
- ^ Liau, T. L., Bassin, C. B., Martin, C. J., & Coleman, E. B. (1976). Modification of the Coleman readability formulas. Journal of Reading Behavior, 8(4), 381–386. https://journals.sagepub.com/doi/pdf/10.1080/10862967609547193
- ^ Lijffijt, J., & Gries, S. T. (2012). Correction to Stefan Th. Gries’ “dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics, 17(1), 147–149. https://doi.org/10.1075/ijcl.17.1.08lij
- ^ Lorge, I. (1944). Predicting readability. Teachers College Record, 45, 404–419.
- ^ Lorge, I. (1948). The Lorge and Flesch readability formulae: A correction. School and Society, 67, 141–142.
- ^ Lucisano, P., & Emanuela Piemontese, M. (1988). GULPEASE: A formula for the prediction of the difficulty of texts in Italian. Scuola e Città, 39(3), 110–124.
- ^ ^ Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. 2018 10th International Conference on Knowledge and Systems Engineering (KSE) (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379
- ^ Lyne, A. A. (1985). Dispersion. In The vocabulary of French business correspondence: Word frequencies, collocations, and problems of lexicometric method (pp. 101–124). Slatkine/Champion.
- ^ Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan.
- ^ Maas, H.-D. (1972). Über den zusammenhang zwischen wortschatzumfang und länge eines textes. Zeitschrift für Literaturwissenschaft und Linguistik, 2(8), 73–96.
- ^ McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global.
- ^ ^ McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. https://doi.org/10.3758/BRM.42.2.381
- ^ McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639–646.
- ^ Muñoz Baquedano, M. (2006). Legibilidad y variabilidad de los textos. Boletín de Investigación Educacional, Pontificia Universidad Católica de Chile, 21(2), 13–26.
- ^ Nirmaldasan. (2009, April 30). McAlpine EFLAW readability score. Readability Monitor. Retrieved November 15, 2022, from https://strainindex.wordpress.com/2009/04/30/mcalpine-eflaw-readability-score/
- ^ Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press.
- ^ Oborneva, I. V. (2006). Автоматизированная оценка сложности учебных текстов на основе статистических параметров [Doctoral dissertation, Institute for Strategy of Education Development of the Russian Academy of Education]. Freereferats.ru. https://static.freereferats.ru/_avtoreferats/01002881899.pdf?ver=3
- ^ O’Hayre, J. (1966). Gobbledygook has gotta go. U.S. Government Printing Office. https://www.governmentattic.org/15docs/Gobbledygook_Has_Gotta_Go_1966.pdf
- ^ Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Language and Computers, 68, 247–269.
- ^ Partiko, Z. V. (2001). Zagal’ne redaguvannja. Normativni osnovi. Afiša.
- ^ Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188–200). The South–Central Regional SAS Users' Group.
- ^ Pedersen, T. (1998). Dependent bigram identification. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (p. 1197). AAAI Press.
- ^ Pisarek, W. (1969). Jak mierzyć zrozumiałość tekstu?. Zeszyty Prasoznawcze, 4(42), 35–48.
- ^ Pojanapunya, P., & Todd, R. W. (2016). Log-likelihood and odds ratio keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 15(1), 133–167. https://doi.org/10.1515/cllt-2015-0030
- ^ Popescu I.-I., Mačutek, J, & Altmann, G. (2008). Word frequency and arc length. Glottometrics, 17, 18–42.
- ^ ^ ^ ^ ^ ^ Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter.
- ^ ^ ^ ^ Powers, R. D., Sumner, W. A., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of Educational Psychology, 49(2), 99–105. https://doi.org/10.1037/h0043254
- ^ Quasthoff, U., & Wolff, C. (2002). The poisson collocation measure and its applications. Proceedings of 2nd International Workshop on Computational Approaches to Collocations. IEEE.
- ^ ^ Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127.
- ^ Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing. Masaryk University
- ^ ^ ^ ^ ^ ^ Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124
- ^ Simpson, E. H. (1949). Measurement of diversity. Nature, 163, p. 688. https://doi.org/10.1038/163688a0
- ^ Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38.
- ^ Smith, E. A. (1961). Devereaux readability index. Journal of Educational Research, 54(8), 298–303. https://doi.org/10.1080/00220671.1961.10882728
- ^ Smith, E. A., & Senter, R. J. (1967). Automated readability index. Aerospace Medical Research Laboratories. https://apps.dtic.mil/sti/pdfs/AD0667273.pdf
- ^ Solomon, N. W. (2006). Qualitative analysis of media language [Unpublished doctoral dissertation]. Madurai Kamaraj University.
- ^ Somers, H. H. (1966). Statistical methods in literary analysis. In J. Leeds (Ed.), The computer and literary style (pp. 128–140). Kent State University Press.
- ^ Spache, G. (1953). A new readability formula for primary-grade reading materials. Elementary School Journal, 53(7), 410–413. https://doi.org/10.1086/458513
- ^ ^ Spache, G. (1974). Good reading for poor readers (Rev. 9th ed.). Garrard.
- ^ Szigriszt Pazos, F. (1993). Sistemas predictivos de legibilidad del mensaje escrito: Formula de perspicuidad [Doctoral dissertation, Complutense University of Madrid]. Biblos-e Archivo. https://repositorio.uam.es/bitstream/handle/10486/2488/3907_barrio_cantalejo_ines_maria.pdf?sequence=1&isAllowed=y
- ^ ^ Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association.
- ^ Tränkle, U., & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die Deutsche Sprache [Cross-validation and recalculation of the readability formulas for the German language]. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231–244.
- ^ Tuldava, J. (1975). Ob izmerenii trudnosti tekstov [On measuring the complexity of the text]. Uchenye zapiski Tartuskogo universiteta. Trudy po metodike prepodavaniya inostrannykh yazykov, 345, 102–120.
- ^ Ure, J. (1971). Lexical density and register differentiation. In G. E. Perren & J. L. M. Trim (Eds.), Applications of Linguistics (pp. 443–452). Cambridge University Press.
- ^ Wheeler, L. R., & Smith, E. H. (1954). A practical readability formula for the classroom teacher in the primary grades. Elementary English, 31(7), 397–399.
- ^ Williams, C. B. (1970). Style and vocabulary: Numerical studies. Griffin.
- ^ ^ Wilson, A. (2013). Embracing Bayes Factors for key item analysis in corpus linguistics. In M. Bieswanger & A. Koll-Stobbe (Eds.), New Approaches to the Study of Linguistic Variability (pp. 3–11). Peter Lang.
- ^ Yule, G. U. (1944). The statistical study of literary vocabulary. Cambridge University Press.
- ^ Zhang, H., Huang, C., & Yu, S. (2004). Distributional consistency: As a general method for defining a core lexicon. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of Fourth International Conference on Language Resources and Evaluation (pp. 1119–1122). European Language Resources Association.