【全文检索_03】Lucene 基本使用 - 成就云开发者社区

1.1 分词器

1.1.1 默认分词器

在上一文【全文检索_02】Lucene 入门案例中我们使用 Lucene 默认分词器对中文版双城记进行分词，这个操作其实是有问题的。哎？！我们明明分词成功而且搜索到了啊，怎么会有问题。我们之前成功搜索是因为我们搜索的是一个关键字，而不是一个关键词。我们先来看一下默认分词器的分词效果是怎么样的。

代码语言：javascript

复制

/**
 * @author Demo_Null
 * @version 1.0
 * @date 2021/1/22
 * @desc 默认分词器的分词效果
 */
@SpringBootTest
public class Lucene {
@Test
public void analyzerTest() throws IOException {
    // 1. 创建标准分词器
    Analyzer analyzer = new StandardAnalyzer();

    // 2. 获取 tokenStream 对象, 第一个参数：域名，可以随便给一个, 第二个参数：要分析的文本内容
    TokenStream tokenStream = analyzer.tokenStream(&#34;text&#34;, &#34;中国码农, Chinese programmer&#34;);

    // 3.1 添加一个引用，可以获得每个关键词
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    // 3.2 添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
    OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
    // 3.3 将指针调整到列表的头部
    tokenStream.reset();

    // 4. 遍历关键词列表，通过 incrementToken 方法判断列表是否结束
    while (tokenStream.incrementToken()) {
        System.out.println(&#34;关键词：&#34; + charTermAttribute);
    }

    // 5. 释放资源
    tokenStream.close();
}

}

很明显，默认分词器将每一个字分开，作为一个关键词。对于英文来说可能还凑合，但是对于中文来说是完全不行的。中文的单个字的搜索一般来说是没有意义的，所以我们就需要一个支持中分分词的分词器。

1.1.2 中文分词器

☞ 常用中文分词器

序号	分词器	说明
1	word	⚔ 项目地址
2	Ansj	⚔ 项目地址
3	MMSeg4j	⚔ 项目地址
4	IKAnalyzer	⚔ 项目地址
5	Jcseg	⚔ 项目地址
6	FudanNLP	⚔ 项目地址
7	Paoding	⚔ 项目地址
8	smartcn	⚔ 项目地址
9	HanLP	⚔ 项目地址
10	Stanford	⚔ 项目地址
11	Jieba	⚔ 项目地址

☞ Ik 分词器

IKAnalyzer 是一个开源的，基于 java 语言开发的轻量级的中文分词工具包。从 2006 年 12 月推出 1.0 版，最初，它是以开源项目 Luence 为应用主体的，结合词典分词和文法分析算法的中文分词组件。从 3.0 版本开始，IK 发展为面向 Java 的公用分词组件，独立于 Lucene 项目，同时提供了对 Lucene 的默认优化实现。在 2012 版本中，IK 实现了简单的分词歧义排除算法，标志着 IK 分词器从单纯的词典分词向模拟语义分词衍化。

代码语言：javascript

复制

<dependency>

<groupId>com.janeluo</groupId>

<artifactId>ikanalyzer</artifactId>

<version>2012_u6</version>

</dependency>

代码语言：javascript

复制

/**


@author Demo_Null


@version 1.0


@date 2021/1/22


@desc IK 分词器的分词效果

*/

@SpringBootTest

public class Lucene {
@Test

public void analyzerTest() throws IOException {

// 1. 创建 Ik 分词器

IKAnalyzer analyzer = new IKAnalyzer(true);
 // 2. 获取 tokenStream 对象, 第一个参数：域名，可以随便给一个, 第二个参数：要分析的文本内容
 StringReader stringReader = new StringReader(&#34;中国码农, Chinese programmer&#34;);
 TokenStream tokenStream = analyzer.tokenStream(&#34;text&#34;, stringReader);

 // 3.1 添加一个引用，可以获得每个关键词
 CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
 // 3.2 添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
 OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
 // 3.3 将指针调整到列表的头部
 tokenStream.reset();

 // 4. 遍历关键词列表，通过 incrementToken 方法判断列表是否结束
 while (tokenStream.incrementToken()) {
     System.out.println(&#34;关键词：&#34; + charTermAttribute);
 }

 // 5. 释放资源
 tokenStream.close();

}

}

可以发现 中国 分词没有问题。但是 码农 被分开了，这是因为 IK 分词器里不知道这是一个词，我们需要让 IK 分词器知道这是一个词。IK 分词器提供了扩展词典，让我们将新词添加到扩展词典中，IK 分词器就认识他了。同理，有扩展就有停用，IK 分词器也提供了停词词典，停词词典中的次将不会被分词。

代码语言：javascript

复制

<!-- IKAnalyzer.cfg.xml -->

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

<comment>IK Analyzer 扩展配置</comment>

<!-- 可以配置多个词典文件，文件使用";"号分隔。文件路径为相对 java 包的起始根路径 -->

<!-- 用户可以在这里配置自己的扩展字典, 一行一词 -->

<entry key="ext_dict">ext.dic;</entry>
<!-- 用户可以在这里配置自己的停止词字典, 一行一词 -->

<entry key="ext_stopwords">stopword.dic;</entry>

</properties>

我们将 码农 配置进扩展词典 ext.dic 将 programmer 配置进停词词典 stopword.dic 后再进行分词，就可以得到如下结果

1.1.3 Lucene 使用自定义分词器

1.2 索引库维护

1.2.1 添加

☞ Field 域属性

属性	说明
分词（Tokenized）	是否对域的内容进行分词处理。前提是我们要对域的内容进行查询
索引（Indexed）	将 Field 分析后的词或整个 Field 值进行索引，只有索引方可搜索到
存储（Stored）	将 Field 值存储在文档中，存储在文档中的 Field 才可以从 Document 中获取

☞ Field 类型

Field 类	类型	分词	索引	存储	说明
StringField(FieldName，FieldValue，Store.YES))	字符串	N	Y	Y / N	这个 Field 用来构建一个字符串 Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号、姓名等)，是否存储在文档中用 Store.YES 或 Store.NO 决定
LongPoint(String name，long… point)	Long	Y	Y	N	可以使用 LongPoint、IntPoint 等类型存储数值类型的数据。让数值类型可以进行索引。但是不能存储数据，如果想存储数据还需要使用 StoredField。
StoredField(FieldName，FieldValue)	重载方法，支持多种类型	N	N	Y	这个 Field 用来构建不同类型 Field，不分析，不索引，但要 Field 存储在文档中
TextField(FieldName，FieldValue，Store.NO) 或TextField(FieldName，reader)	字符串或流	Y	Y	Y / N	如果是一个 Reader, lucene 猜测内容比较多,会采用 Unstored 的策略.

☞ 示例

代码语言：javascript

复制

/**


@author Demo_Null


@version 1.0


@date 2021/1/21


@desc Lucene 入门案例, 创建索引

*/

@SpringBootTest

public class LuceneDemo {
@Test

public void create() throws IOException {

// 1. 指定索引库位置

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
 // 2. 创建 IndexWriterConfig 对象
 IndexWriterConfig indexWriterConfig = new IndexWriterConfig();

 // 3. 创建 IndexWriter 对象
 IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

 // 4. 获取原始文档信息
 File file = new File(&#34;C:\\Users\\softw\\Desktop\\file\\双城记.txt&#34;);
 String name = file.getName();
 String path = file.getPath();
 // 使用 org.apache.commons.io.FileUtils 工具类读取信息
 String content = FileUtils.readFileToString(file, &#34;GBK&#34;);
 long size = FileUtils.sizeOf(file);

 // 5. 创建 Field 域, 第一个参数：域的名称, 第二个参数：域的内容, 第三个参数：是否存储
 TextField fileNameField = new TextField(&#34;filename&#34;, name, Field.Store.YES);
 TextField filePathField = new TextField(&#34;path&#34;, path, Field.Store.YES);
 TextField fileContentField = new TextField(&#34;content&#34;, content, Field.Store.YES);
 LongPoint fileSizeField = new LongPoint(&#34;size&#34;, size);

 // 6. 创建 Document 文档, 存入 Field 域
 Document document = new Document();
 document.add(fileNameField);
 document.add(filePathField);
 document.add(fileContentField);
 document.add(fileSizeField);

 // 7. 创建索引并写入索引库
 indexWriter.addDocument(document);

 // 8. 释放资源
 indexWriter.close();

}

}

1.2.2 删除

☞ 删除全部【慎用】

代码语言：javascript

复制

@Test

public void deleteAll() throws IOException {

// 1. 指定索引库位置保存到本地

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
// 2. 创建 IndexWriterConfig 对象

IndexWriterConfig indexWriterConfig = new IndexWriterConfig();
// 3. 创建 IndexWriter 对象

IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
// 4. 删除全部, 将索引目录的索引信息全部删除, 直接彻底删除, 无法恢复

indexWriter.deleteAll();
// 5. 释放资源

indexWriter.close();

}

☞ 指定条件删除

代码语言：javascript

复制

@Test

public void deleteQuery() throws IOException {

// 1. 指定索引库位置保存到本地

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
// 2. 创建 IndexWriterConfig 对象

IndexWriterConfig indexWriterConfig = new IndexWriterConfig();
// 3. 创建 IndexWriter 对象

IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
// 4. 创建查询

Query query = new TermQuery(new Term("filename", "apache"));
// 5. 删除指定条件

indexWriter.deleteDocuments(query);
// 6. 释放资源

indexWriter.close();

}

1.2.3 修改

代码语言：javascript

复制

@Test

public void update() throws IOException {

// 1. 指定索引库位置保存到本地

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
// 2. 创建 IndexWriterConfig 对象

IndexWriterConfig indexWriterConfig = new IndexWriterConfig();
// 3. 创建 IndexWriter 对象

IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
// 4. 创建一个 Document 对象, 存放新数据

Document document = new Document();

document.add(new TextField("filename", "apache lucene", Field.Store.YES));

document.add(new TextField("content", "Lucene 是 Apache 的子项目", Field.Store.YES));
// 5. 更新数据, 其实是先根据 term 删除后添加

indexWriter.updateDocument(new Term("filename", "apache"), document);
// 6. 释放资源

indexWriter.close();

}

1.2.4 查询

☞ TermQuery

代码语言：javascript

复制

/**


@author Demo_Null


@version 1.0


@date 2021/1/26


@desc TermQuery 查询索引库

*/

@SpringBootTest

public class CreateIndex {
@Test

public void search() throws IOException {

// 1. 指定索引库

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
 // 2. 创建 IndexReader 对象
 IndexReader indexReader = DirectoryReader.open(directory);

 // 3. 创建 IndexSearcher 对象
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);

 // 4. 创建查询, TermQuery 不使用分析器所以建议匹配不分词的 Field 域查询
 Query query = new TermQuery(new Term(&#34;content&#34;, &#34;apache&#34;));

 // 5. 执行查询, 第一个参数是查询对象, 第二个参数是查询结果返回的最大值
 TopDocs search = indexSearcher.search(query, 10);

 System.out.println(&#34;查询结果条数：&#34; + search.totalHits);

 // 6. 遍历查询结果
 for (ScoreDoc scoreDoc : search.scoreDocs) {
     // 6.1 根据 id 获取 Document, scoreDoc.doc 属性就是 document 对象的 id
     Document doc = indexSearcher.doc(scoreDoc.doc);
     System.out.println(&#34;文件名：&#34; + doc.get(&#34;filename&#34;));
     System.out.println(&#34;文件路径：&#34; + doc.get(&#34;path&#34;));
     System.out.println(&#34;文件大小：&#34; + doc.get(&#34;size&#34;));
 }

 // 7. 释放资源
 indexReader.close();

}

}

☞ RangeQuery

代码语言：javascript

复制

/**


@author Demo_Null


@version 1.0


@date 2021/1/26


@desc 数字范围查询索引库

*/

@SpringBootTest

public class CreateIndex {
@Test

public void search() throws IOException {

// 1. 指定索引库

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
 // 2. 创建 IndexReader 对象
 IndexReader indexReader = DirectoryReader.open(directory);

 // 3. 创建 IndexSearcher 对象
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);

 // 4. 创建查询
 Query query = LongPoint.newRangeQuery(&#34;size&#34;, 1, 1024000);

 // 5. 执行查询, 第一个参数是查询对象, 第二个参数是查询结果返回的最大值
 TopDocs search = indexSearcher.search(query, 10);

 System.out.println(&#34;查询结果条数：&#34; + search.totalHits);

 // 6. 遍历查询结果
 for (ScoreDoc scoreDoc : search.scoreDocs) {
     // 6.1 根据 id 获取 Document, scoreDoc.doc 属性就是 document 对象的 id
     Document doc = indexSearcher.doc(scoreDoc.doc);
     System.out.println(&#34;文件名：&#34; + doc.get(&#34;filename&#34;));
     System.out.println(&#34;文件路径：&#34; + doc.get(&#34;path&#34;));
     System.out.println(&#34;文件大小：&#34; + doc.get(&#34;size&#34;));
 }

 // 7. 释放资源
 indexReader.close();

}

}

☞ QueryParser

代码语言：javascript

复制

/**


@author Demo_Null


@version 1.0


@date 2021/1/26


@desc QueryParser 查询索引库

*/

@SpringBootTest

public class CreateIndex {
@Test

public void search() throws Exception {

// 1. 指定索引库

Directory directory = FSDirectory.open(new File("C:\Users\softw\Desktop\temp").toPath());
 // 2. 创建 IndexReader 对象
 IndexReader indexReader = DirectoryReader.open(directory);

 // 3. 创建 IndexSearcher 对象
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);

 // 4.1 创建查询, 第一个参数默认搜索的域, 第二个参数就是分析器(最好与创建索引库使用的一致)
 QueryParser queryParser = new QueryParser(&#34;content&#34;, new StandardAnalyzer());
 // 4.2 QueryParser 提供一个 Parse 方法，此方法可以直接根据查询语法来查询
 Query query = queryParser.parse(&#34;双城记&#34;);

 // 5. 执行查询, 第一个参数是查询对象, 第二个参数是查询结果返回的最大值
 TopDocs search = indexSearcher.search(query, 10);

 System.out.println(&#34;查询结果条数：&#34; + search.totalHits);

 // 6. 遍历查询结果
 for (ScoreDoc scoreDoc : search.scoreDocs) {
     // 6.1 获取 Document
     Document doc = indexSearcher.doc(scoreDoc.doc);
     System.out.println(&#34;文件名：&#34; + doc.get(&#34;filename&#34;));
     System.out.println(&#34;文件路径：&#34; + doc.get(&#34;path&#34;));
     System.out.println(&#34;文件大小：&#34; + doc.get(&#34;size&#34;));
 }

 // 7. 释放资源
 indexReader.close();

}

}