如何只抽取前缀最长的词? #29

yoopaan · 2019-10-14T05:53:51Z

case:
词典: 二、二氧化碳、高、高温、高温催化
句子: 二氧化碳高温催化剂报告

期望抽取结果: 二氧化碳、高温催化
目前的抽取结果: [[0:1]=二, [0:4]=二氧化碳, [4:6]=高温, [4:8]=高温催化]

    @Test
    public void test() {
        // Collect test data set
        TreeMap<String, String> map = new TreeMap<String, String>();
        String[] keyArray = new String[]{
                "二", "二氧化碳", "高温", "高温", "高温催化"
        };

        for (String key : keyArray) {
            map.put(key, key);
        }

        // Build an AhoCorasickDoubleArrayTrie
        AhoCorasickDoubleArrayTrie<String> acdat = new AhoCorasickDoubleArrayTrie<>();
        acdat.build(map);

        // Test it@SpringBootTest
        final String text = "二氧化碳高温催化剂报告";
        List<AhoCorasickDoubleArrayTrie.Hit<String>> wordList = acdat.parseText(text);
        System.out.println(wordList);
    }

输出:

[[0:1]=二, [0:4]=二氧化碳, [4:6]=高温, [4:8]=高温催化]

The text was updated successfully, but these errors were encountered:

hankcs · 2019-10-14T05:57:50Z

使用HanLP提供的AhoCorasickDoubleArrayTrieSegment

hankcs added the question label Oct 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如何只抽取前缀最长的词? #29

如何只抽取前缀最长的词? #29

yoopaan commented Oct 14, 2019 •

edited

Loading

hankcs commented Oct 14, 2019

如何只抽取前缀最长的词? #29

如何只抽取前缀最长的词? #29

Comments

yoopaan commented Oct 14, 2019 • edited Loading

hankcs commented Oct 14, 2019

yoopaan commented Oct 14, 2019 •

edited

Loading