Releases: mideind/Tokenizer
Version 1.3.0
- Added
TOK.DOMAIN
andTOK.HASHTAG
token types - Improved handling of capitalized month name Ágúst, which is now recognized as such when it follows an ordinal number
- Improved recognition of telephone numbers
- Added abbreviations
Version 1.2.3
Added abbreviations; updated GitHub URLs to point to mideind instead of vthorsteinsson
Version 1.2.2
Added support for composites with more than two parts, i.e. „dómsmála-, ferðamála-, iðnaðar- og nýsköpunarráðherra“; added support for ±
sign; added several abbreviations
Version 1.2.1
Fixed bug where the name 'Ágúst' was recognized as a month name. Unicode nonbreaking and invisible space characters are now removed before tokenization.
Version 1.2.0
Added support for Unicode fraction characters; enhanced handing of degrees (°, °C, °F); fixed bug in cubic meter measurement unit; more abbreviations
Version 1.1.2
Fixed bug in liter measurement unit (l
and ltr
); was 1000 times too large
Version 1.1.1
Added the mark_paragraphs()
function
Version 1.1.0
All abbreviations in Abbrev.conf
are now returned with their meaning in a tuple in token.val
; handling of 'mbl.is' fixed
Version 1.0.9
Added MAST abbreviation; harmonized copyright headers
Version 1.0.7
Added NUMWLETTER
token type, for numbers with a single-letter suffix (12a
, 80D
). This will mainly be useful for parsing addresses. Note that if a conflict occurs between NUMWLETTER
and MEASUREMENT
(such as 16A
, meaning 16 ampere), the latter takes precedence.