Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何只抽取前缀最长的词? #29

Open
yoopaan opened this issue Oct 14, 2019 · 1 comment
Open

如何只抽取前缀最长的词? #29

yoopaan opened this issue Oct 14, 2019 · 1 comment
Labels

Comments

@yoopaan
Copy link

yoopaan commented Oct 14, 2019

case:
词典: 二、二氧化碳、高、高温、高温催化
句子: 二氧化碳高温催化剂报告

期望抽取结果: 二氧化碳、高温催化
目前的抽取结果: [[0:1]=二, [0:4]=二氧化碳, [4:6]=高温, [4:8]=高温催化]

    @Test
    public void test() {
        // Collect test data set
        TreeMap<String, String> map = new TreeMap<String, String>();
        String[] keyArray = new String[]{
                "二", "二氧化碳", "高温", "高温", "高温催化"
        };

        for (String key : keyArray) {
            map.put(key, key);
        }

        // Build an AhoCorasickDoubleArrayTrie
        AhoCorasickDoubleArrayTrie<String> acdat = new AhoCorasickDoubleArrayTrie<>();
        acdat.build(map);

        // Test it@SpringBootTest
        final String text = "二氧化碳高温催化剂报告";
        List<AhoCorasickDoubleArrayTrie.Hit<String>> wordList = acdat.parseText(text);
        System.out.println(wordList);
    }

输出:

[[0:1]=二, [0:4]=二氧化碳, [4:6]=高温, [4:8]=高温催化]

@hankcs
Copy link
Owner

hankcs commented Oct 14, 2019

使用HanLP提供的AhoCorasickDoubleArrayTrieSegment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants