Skip to content
This repository has been archived by the owner on Jan 22, 2021. It is now read-only.

Support expanding of compound words into separate tokens #4

Open
tituomin opened this issue Apr 12, 2019 · 2 comments
Open

Support expanding of compound words into separate tokens #4

tituomin opened this issue Apr 12, 2019 · 2 comments

Comments

@tituomin
Copy link

tituomin commented Apr 12, 2019

I have patched our version of the plugin (based on v0.3.0) and added a configuration parameter expandCompounds to optionally support expanding of compound words (yhdyssanat) into separate tokens.

City-of-Helsinki@9a6bd81

I would like to get this feature into master and upstream, if you find it desirable. I can port it to master myself, but currently we are using 0.3.0.

We have found that extracting the parts of compound words is highly desirable in the index analysis stage, for several reasons:

  • users often misspell compound words and write them separately
  • often parts of compound words (for example "terveys" in "terveysasema") are meaningful and relevant even separated from the compound word
@komu
Copy link
Member

komu commented Apr 12, 2019

Sounds great, if you'll open a PR I'll look forward into merging it.

@tituomin tituomin changed the title Support expanding of compound words into separate token Support expanding of compound words into separate tokens Apr 18, 2019
@tituomin
Copy link
Author

@komu here is my attempt at a PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants