Skip to content

astrunc: A library that performs sentence break processing of utf-8 encoded text segments in natural language processing.(在自然语言处理中对utf-8编码的文本段进行断句处理的库。)

License

Notifications You must be signed in to change notification settings

Joke-Shi/astrunc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

astrunc (Used C++11)

English-Description

astrunc: A library that performs sentence break processing of utf-8 encoded text segments in natural language processing.

Make-sample

git clone https://github.com/Joke-Shi/astrunc.git
cd astrunc
make

cd sample
ls

API

/**
 * @Brief: Split utf-8 encoded text segment, and then output sentence string vector.
 *
 * @Param: __vs,     Output split text segment result, which sentence string vector.
 * @Param: __seg,    Input utf-8 encoded text segment.
 * @Param: __lang,   Specify in the astrunc.h file.
 * @Param: __nchars, Max a sentence chars size.
 *
 * @Return: Ok->0, Other->-1.
 **/
int astrunc::access::split( std::vector< std::string > &__vs, const std::string &__seg, astrunc::access::lang_t __lang, int __nchars);

SAMPLE

#include "astrunc.h"


int main( int argc, const char **argv)
{         
  std::string s_en = "He said: Chang suggested organizing a canine patrol squad at the "
                     "museum and, in 1987, dogs came back. At 1 year old, each dog is "
                     "given a tailored training course. ";
                     
  std::vector< std::string> vec_s;
  int rc = astrunc::access::split( vec_s, s_en, astrunc::access::EN, 32);
  if ( 0 == rc) {
    for ( auto const &__s : vec_s) {
      std::cout << __s << std::endl;
      std::cout << "------------------------------------------------------------------\n";
    }
  } else { std::cerr << "[Error]: Split error\n"; }


  return rc;
}

中文说明

astrunc:用在自然语言处理中对utf-8编码的文本段进行断句处理的库。

Make-实例

git clone https://github.com/Joke-Shi/astrunc.git
cd astrunc
make

cd sample
ls

使用接口

/**
 * @Brief: Split utf-8 encoded text segment, and then output sentence string vector.
 *
 * @Param: __vs,     分割utf-8编码文本段过后的句子列表
 * @Param: __seg,    输入的utf-8编码的文本段;
 * @Param: __lang,   语言简称,具体在astrunc.h文件中描述;
 * @Param: __nchars, 每个句子限定的字符个数。
 *
 * @Return: Ok->0, Other->-1.
 **/
int astrunc::access::split( std::vector< std::string > &__vs, const std::string &__seg, astrunc::access::lang_t __lang, int __nchars);

使用示例

#include "astrunc.h"


int main( int argc, const char **argv)
{
  std::string s_zh = "Prometheus,它的价值在于可靠性,甚至在很恶劣的环境下,你都可以随时访问它和查"
                     "看系统服务各种指标的统计信息。 如果你对统计数据需要100%的精确,它并不适用,例"
                     "如:它不适用于实时计费系统。";
                     
  std::vector< std::string> vec_s;
  int rc = astrunc::access::split( vec_s, s_zh, astrunc::access::ZH, 32);
  if ( 0 == rc) {
    for ( auto const &__s : vec_s) {
      std::cout << __s << std::endl;
      std::cout << "------------------------------------------------------------------\n";
    }
  } else { std::cerr << "[Error]: Split error\n"; }


  return rc;
}

About

astrunc: A library that performs sentence break processing of utf-8 encoded text segments in natural language processing.(在自然语言处理中对utf-8编码的文本段进行断句处理的库。)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published