Skip to content

Releases: mideind/Tokenizer

Version 1.3.0

21 May 11:25
Compare
Choose a tag to compare
  • Added TOK.DOMAIN and TOK.HASHTAG token types
  • Improved handling of capitalized month name Ágúst, which is now recognized as such when it follows an ordinal number
  • Improved recognition of telephone numbers
  • Added abbreviations

Version 1.2.3

03 May 11:41
Compare
Choose a tag to compare

Added abbreviations; updated GitHub URLs to point to mideind instead of vthorsteinsson

Version 1.2.2

26 Apr 13:17
Compare
Choose a tag to compare

Added support for composites with more than two parts, i.e. „dómsmála-, ferðamála-, iðnaðar- og nýsköpunarráðherra“; added support for ± sign; added several abbreviations

Version 1.2.1

18 Feb 19:19
Compare
Choose a tag to compare

Fixed bug where the name 'Ágúst' was recognized as a month name. Unicode nonbreaking and invisible space characters are now removed before tokenization.

Version 1.2.0

07 Feb 16:34
Compare
Choose a tag to compare

Added support for Unicode fraction characters; enhanced handing of degrees (°, °C, °F); fixed bug in cubic meter measurement unit; more abbreviations

Version 1.1.2

10 Jan 11:37
Compare
Choose a tag to compare

Fixed bug in liter measurement unit (l and ltr); was 1000 times too large

Version 1.1.1

04 Jan 18:23
Compare
Choose a tag to compare

Added the mark_paragraphs() function

Version 1.1.0

02 Jan 14:38
Compare
Choose a tag to compare

All abbreviations in Abbrev.conf are now returned with their meaning in a tuple in token.val; handling of 'mbl.is' fixed

Version 1.0.9

29 Dec 13:08
Compare
Choose a tag to compare

Added MAST abbreviation; harmonized copyright headers

Version 1.0.7

25 Sep 12:22
Compare
Choose a tag to compare

Added NUMWLETTER token type, for numbers with a single-letter suffix (12a, 80D). This will mainly be useful for parsing addresses. Note that if a conflict occurs between NUMWLETTER and MEASUREMENT (such as 16A, meaning 16 ampere), the latter takes precedence.