Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

词长 #10

Open
shushumumu opened this issue Dec 10, 2020 · 1 comment
Open

词长 #10

shushumumu opened this issue Dec 10, 2020 · 1 comment

Comments

@shushumumu
Copy link

我看你代码里面,词长设定最大是4。我觉得这样有点问题。古汉语里面最大词长设置成2或者3更合适。可能有一些称呼是三音节,比如“秦穆公”。其他的词基本上都是单音节的,双音节的占一定的比例,但是不多。最大词长设置成4的话,情况就是分出来的四音节的都不是词。

@jiaeyan
Copy link
Owner

jiaeyan commented Dec 14, 2020

你好!非常理解你的考量。本项目之所以选择四字词为分词上限,是因为:

  1. 古汉语的确单字成词的概率最高,四字成词低很多,但也并非不存在,例如《庄子》中“内圣外王”这样的词,还是适合四字成词;
  2. 也有不少人名,由于先秦姓氏未经统一,存在大量两字复姓,以及官职地方等作为名字前缀,加上两字名,也会出现四字人名。
  3. 再者,本项目的训练语料也并非只涵盖先秦,一直到明清都有所涉及,因此四字词的概率会随着时代变迁越来越多,例如各种地方、机构等专属名词。
    希望能有帮助!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants