Releases: pemistahl/lingua
Releases · pemistahl/lingua
Lingua 1.2.2
Lingua 1.2.1
Bug Fixes
- An exception was thrown when trying to detect the language of unigrams and bigrams in low accuracy mode which operates only with trigrams and larger strings. This has been fixed.
Lingua 1.2.0
Features
- The library can now be used as a Java 9 module. Thanks to @Marcono1234 for helping with the implementation. (#120, #138)
- The new method
LanguageDetectorBuilder.withLowAccuracyMode()
has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance. (#136)
Improvements
- The memory footprint has been reduced significantly by applying several internal optimizations. Thanks to @Marcono1234, @fvasco and @sigpwned for their help. (#101, #127)
- Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint and a 36% smaller jar file.
Bug Fixes
- A bug in the rule engine has been fixed that caused incorrect language detection for certain texts. Thanks to @bdecarne who has found it.
Other changes
- Due to a refactoring of how the internal thread pool works, the method
LanguageDetector.destroy()
has been deprecated in favor of the newly introduced methodLanguageDetector.unloadLanguageModels()
.
Lingua 1.1.1
Improvements
- The new method
LanguageDetector.destroy()
has been introduced that frees internal resources to prevent memory leaks within application server deployments. (#110, #116) - Language model loading performance has been improved by creating a manually optimized internal thread pool. This replaces the coroutines used in the previous release. (#116)
Bug Fixes
Lingua 1.1.0
Languages
- There is now support for the Maori language which was contributed to the Rust implementation of Lingua. (#93)
Features
- Language models are now loaded asynchronously and in parallel using Kotlin coroutines, making this step more performant. (#84)
- Language Models can now be loaded either lazily (default) or eagerly. (#79)
- Instead of loading multiple copies of the language models into memory for each separate instance of
LanguageDetector
, multiple instances now share the same language models and access them asynchronously. (#91)
Improvements
- Language detection for sentences with more than 120 characters now performs more quickly by iterating through trigrams only which is enough to achieve high detection accuracy.
- Textual input that includes logograms from Chinese, Japanese or Korean is now split at each logogram and not only at whitespace. This provides for more reliable language detection for sentences that include multi-language content. (#85)
Bug Fixes
- For an odd number of words as input, the method
LanguageDetector.computeLanguageConfidenceValues
computed wrong values under certain circumstances. (#87) - When Lingua was used in projects with an explictly set Kotlin version which differed from Lingua's implicitly set version in the Gradle script, several errors occurred during runtime. By explicitly setting Lingua's Kotlin version, these errors are now hopefully gone. (#88, #89)
- Errors in the rule engine for the Latvian language have been resolved. (#92)
Lingua 1.0.3
Bug Fixes
- When two languages had exactly the same confidence values, one of them was erroneously removed from the result map.
Thanks to @mmedek for reporting this bug. (#72) - There was still a problem with the classification of texts consisting of certain alphabets.
Thanks to @nicolabertoldi for reporting this bug. (#76) - The language detection for Spanish did not take the rarely used accented characters á, é, í, ó, ú and ü into account.
Thanks to @joeporter for reporting this bug. (#73) - A bug in the rule engine led to weak detection accuracy for Macedonian and Serbian. This has been fixed.
Other Changes
- The Kotlin compiler and runtime have been updated to version 1.4. This includes the current stable release 1.0.0 of the kotlinx-serialization framework.
- The accuracy report files have been moved to their own Gradle source set. This allows for separate compilation of unit tests and accuracy report tests, leading to more flexible and slightly faster compilation.
Lingua 1.0.2
Bug Fixes
- The language mapping for character ë was incorrect which has been fixed.
Thanks to @sandernugterenedia for reporting this bug. (#66) - The implementation of
LanguageDetector
made use of functionality that was
introduced in Java 8 which made the library unusable for Java 6 and 7.
Thanks to @levant916 for reporting this bug. (#69) - The Gradle shadow plugin has been
added so that./gradlew jarWithDependencies
produces a jar file whose dependencies
do not conflict anymore with the same dependencies of different versions in the same project. (#67)
Lingua 1.0.1
Lingua 1.0.0
Languages
- added 9 new languages, this time with a focus on Africa: Ganda, Shona, Sotho, Swahili, Tsonga, Tswana, Xhosa, Yoruba, Zulu
- removed language Norwegian in favor of Bokmal and Nynorsk (#59)
Features
LanguageDetector
can now provide confidence scores for each evaluated language. (#11)- The public API for creating language model (
LanguageModelFilesWriter
) and test data files (TestDataFilesWriter
) has been stabilized. (#37) - New convenience methods have been added to
LanguageDetectorBuilder
in order to buildLanguageDetector
from languages written in a certain script. (#61)
Improvements
- The rule-based detection algorithm has been made less sensitive so that single words in a different language cannot mislead the algorithm so easily.
- The fastutil library has been added again to reduce memory consumption. (#58)
- The language model-based algorithm has been optimized so that language detection performs approximately 25% faster now. (#58)
- Support for the Kotlin linter
ktlint
has been added to help with a consistent coding style. (#47) - Third-party dependencies have been updated to their latest versions. (#36)
Bug Fixes
- Incorrect regex character classes caused the library to not work properly on Android. (#32)
Test Coverage
- Test coverage has been extended from 59% to 72%.
Documentation
- The README contains a new section describing how users can add their own languages to Lingua.
Other changes
There is a breaking change in this release:
- Methods with the prefix
fromAllBuiltIn...
have been renamed tofromAll...
to make them more succinct and clear. (#61)