hanlp汉语言包

云烟 • 2023年 9月 24日下午3:00 • 未分类

欢迎大家来到IT世界,在知识的湖畔探索吧!

一、简介

在搜索或其它应用领域，我们通常需要对数据进行分词。在汉语言分词处理中，我们可以使用hanlp，它是开源的汉语言处理包，可用于分词、语言处理等操作。

二、组成

hanlp由三部分组成，分别是词库、驱动器（jar包）、hanlp配置。

2.1 词库

词库包含词典和模型，词典（位于data/dictionary目录下）用于词法分析，模型（位于data/model目录下）用于语法分析。对应的数据包有如下几类：

data.full.zip,完整的词库（包括词典和模型）;

data.standary.zip，完整的词典，不包含模型;

data.mini.zip，小体积的词典，不包含模型;

下载地址是http://115.159.41.123/click.php?id=3

详情在地址https://github.com/hankcs/HanLP/releases/tag/v1.3.4中

2.2 驱动器（jar包）

hanlp提供了轻便的jar包，内置了基本的词典，maven依赖如下：

<groupId>com.hankcs</groupId>

<artifactId>hanlp</artifactId>

<version>portable-1.2.8</version>

</dependency>

若在lucene或solr中使用，单独安装词典，则添加对应的依赖包，如下：

<groupId>com.hankcs.nlp</groupId>

<artifactId>hanlp-solr-plugin</artifactId>

</dependency>

<groupId>com.hankcs.nlp</groupId>

<artifactId>hanlp-solr-plugin</artifactId>

</dependency>

2.3 配置文件hanlp.properties

主要是配置词库的地址root=D:/HanLP/，配置文件内容如下：

#本配置文件中的路径的根目录，根目录+其他路径=绝对路径

#Windows用户请注意，路径分隔符统一使用/

root=D:/HanLP/

#核心词典路径

CoreDictionaryPath=data/dictionary/CoreNatureDictionary.txt

#2元语法词典路径

BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.txt

#停用词词典路径

CoreStopWordDictionaryPath=data/dictionary/stopwords.txt

#同义词词典路径

CoreSynonymDictionaryDictionaryPath=data/dictionary/synonym/CoreSynonym.txt

#人名词典路径

PersonDictionaryPath=data/dictionary/person/nr.txt

#人名词典转移矩阵路径

PersonDictionaryTrPath=data/dictionary/person/nr.tr.txt

#繁简词典路径

TraditionalChineseDictionaryPath=data/dictionary/tc/TraditionalChinese.txt

#自定义词典路径，用;隔开多个自定义词典，空格开头表示在同一个目录，使用“文件名词性”形式则表示这个词典的词性默认是该词性。优先级递减。

#另外data/dictionary/custom/CustomDictionary.txt是个高质量的词库，请不要删除

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; 现代汉语补充词库.txt; 全国地名大全.txt ns; 人名词典.txt; 机构名词典.txt; 上海地名.txt ns;data/dictionary/person/nrf.txt nrf

#CRF分词模型路径

CRFSegmentModelPath=data/model/segment/CRFSegmentModel.txt

#HMM分词模型

HMMSegmentModelPath=data/model/segment/HMMSegmentModel.bin

#分词结果是否展示词性

ShowTermNature=true

三、直接使用hanlp的代码实例

3.1 添加maven依赖

<groupId>com.hankcs</groupId>

<artifactId>hanlp</artifactId>

<version>portable-1.2.8</version>

</dependency>

3.2 代码

public class HanlpMain {

public static void main(String[] args) {

String text = “比你聪明的人，请不要让他还比你努力”;

String traditionText= “比妳聰明的人，請不要讓他還比妳努力”;

System.out.println(HanLP.segment(text)); //分词

System.out.println(HanLP.extractKeyword(text,2)); //提取关键字，同时指定提取的个数

System.out.println(HanLP.extractPhrase(text,2)); //提取短语,，同时指定提取的个数

System.out.println(HanLP.extractSummary(text,2)); //提取摘要，同时指定提取的个数

System.out.println(HanLP.getSummary(text,10)); //提取短语，同时指定摘要的最大长度

System.out.println(HanLP.convertToTraditionalChinese(text)); //简体字转为繁体字

System.out.println(HanLP.convertToSimplifiedChinese(traditionText)); //繁体字转为简体字

System.out.println(HanLP.convertToPinyinString(text,” “,false)); //转为拼音

}

输出：

[比/p, 你/r, 聪明/a, 的/uj, 人/n, ，/w, 请/v, 不/d, 要/v, 让/v, 他/r, 还/d, 比/p, 你/r, 努力/ad]

[聪明, 努力]

[]

[请不要让他还比你努力]

请不要让他还比你努力。

比妳聰明的人，請不要讓他還比妳努力

比你聪明的人，请不要让他还比你努力

Disconnected from the target VM, address: ‘127.0.0.1:57424’, transport: ‘socket’

bi ni cong ming de ren qing bu yao rang ta hai bi ni nu li

四、lucene中hanlp使用实例

4.1 添加maven依赖

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-core</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-queryparser</artifactId>

<version>${lucene.version}</version>

</dependency>

<!– 分词器 –>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-analyzers-smartcn</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-analyzers-common</artifactId>

<version>${lucene.version}</version>

</dependency>

<groupId>com.hankcs.nlp</groupId>

<artifactId>hanlp-lucene-plugin</artifactId>

</dependency>

4.2 配置文件hanlp.properties

将配置文件hanlp.properties放到classpath目录下（resources目录下即可），配置文件内容如下

#本配置文件中的路径的根目录，根目录+其他路径=绝对路径

#Windows用户请注意，路径分隔符统一使用/

root=D:/HanLP/

#核心词典路径

CoreDictionaryPath=data/dictionary/CoreNatureDictionary.txt

#2元语法词典路径

BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.txt

#停用词词典路径

CoreStopWordDictionaryPath=data/dictionary/stopwords.txt

#同义词词典路径

CoreSynonymDictionaryDictionaryPath=data/dictionary/synonym/CoreSynonym.txt

#人名词典路径

PersonDictionaryPath=data/dictionary/person/nr.txt

#人名词典转移矩阵路径

PersonDictionaryTrPath=data/dictionary/person/nr.tr.txt

#繁简词典路径

TraditionalChineseDictionaryPath=data/dictionary/tc/TraditionalChinese.txt

#自定义词典路径，用;隔开多个自定义词典，空格开头表示在同一个目录，使用“文件名词性”形式则表示这个词典的词性默认是该词性。优先级递减。

#另外data/dictionary/custom/CustomDictionary.txt是个高质量的词库，请不要删除

#CRF分词模型路径

CRFSegmentModelPath=data/model/segment/CRFSegmentModel.txt

#HMM分词模型

HMMSegmentModelPath=data/model/segment/HMMSegmentModel.bin

#分词结果是否展示词性

ShowTermNature=true

4.3 示例

public class LuceneHanlpMain {

public static void main(String[] args) throws Exception {

String text = “少年强则中国强”;

////////////////标准分词器(长词不做切分的分词器)//////////////////////////////

Analyzer analyzer = new HanLPAnalyzer();

TokenStream ts = analyzer.tokenStream(“field”,text);

ts.reset();

while(ts.incrementToken()){

CharTermAttribute attribute = ts.getAttribute(CharTermAttribute.class); //The term text of a Token.

OffsetAttribute offsetAttribute =ts.getAttribute(OffsetAttribute.class); //偏移量

PositionIncrementAttribute positionIncrementAttribute = ts.getAttribute(PositionIncrementAttribute.class); //距离

System.out.println(attribute+” “

+offsetAttribute.startOffset()+” “+offsetAttribute.endOffset()+” “

+positionIncrementAttribute.getPositionIncrement());

}

ts.close();

System.out.println();

/////////////////////////////////索引分词器(长词全切分的分词器)/////////////////////////////

Analyzer indexAnalyzer = new HanLPIndexAnalyzer();

TokenStream indexTs = indexAnalyzer.tokenStream(“field”,text);

indexTs.reset();

while(indexTs.incrementToken()){

CharTermAttribute attribute = indexTs.getAttribute(CharTermAttribute.class); //The term text of a Token.

OffsetAttribute offsetAttribute =indexTs.getAttribute(OffsetAttribute.class); //偏移量

PositionIncrementAttribute positionIncrementAttribute = indexTs.getAttribute(PositionIncrementAttribute.class); //距离

System.out.println(attribute+” “

+offsetAttribute.startOffset()+” “+offsetAttribute.endOffset()+” “

+positionIncrementAttribute.getPositionIncrement());

}

indexTs.close();

System.out.println();

/////////////////////////////通过query查看分词结果//////////////////////////////

QueryParser queryParser = new QueryParser(“txt”,analyzer);

Query query = queryParser.parse(text);

System.out.println(query.toString(“txt”));

queryParser = new QueryParser(“txt”,indexAnalyzer);

query = queryParser.parse(text);

System.out.println(query.toString(“txt”));

}

结果输出：

少年强 0 3 1

则 3 4 1

中国 4 6 1

强 6 7 1

少年强 0 3 1

少年 0 2 1

则 3 4 1

中国 4 6 1

强 6 7 1

少年强则中国强

少年强少年则中国强

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://itzsg.com/22135.html