lucene.net全文检索(一)相关概念及示例

相关概念

站内搜索

站内搜索通俗来讲是一个网站或商城的“大门口”,一般在形式上包括两个要件:搜索入口和搜索结果页面,但在其后台架构上是比较复杂的,其核心要件包括:中文分词技术、页面抓取技术、建立索引、对搜索结果排序以及对搜索关键词的统计、分析、关联、推荐等。

比较常见的就是电商网站中首页的搜索框,它可以根据关键词(分词)、分类、商品简介、详情等搜索商品信息,可以根据相关度、价格、销量做排序。

全文检索

全文检索是将对站内的网页、文档内容进行分词,然后形成索引,再通过关键词查询匹配索引库中的索引,从而得到索引结果,最后将索引页内容展现给用户。

Lucene.Net

Lucene.net是Lucene的.net移植版本,用C#编写,它完成了全文检索的功能——预先把数据拆分成原子(字/词),保存到磁盘中;查询时把关键字也拆分成原子(字/词),再根据(字/词)进行匹配,返回结果。

Nuget安装“Lucene.Net”和“Lucene.Net.Analysis.PanGu”(盘古分词,一个第三方的分词器)

lucene.net七大对象

1、Analysis

分词器,负责把字符串拆分成原子,包含了标准分词,直接空格拆分。项目中用的是盘古中文分词。

2、Document

数据结构,定义存储数据的格式

3、Index:索引的读写类
4、QueryParser:查询解析器,负责解析查询语句
5、Search:负责各种查询类,命令解析后得到就是查询类
6、Store:索引存储类,负责文件夹等等
7、Util:常见工具类库

 git地址:https://github.com/apache/lucenenet/releases/tag/Lucene.Net_3_0_3_RC2_final

 索引库-写示例

代码语言:javascript
复制
          List<Commodity> commodityList = GetList();//获取数据源
        FSDirectory directory = FSDirectory.Open(StaticConstant.TestIndexPath);//文件夹
        //经过分词以后把内容写入到硬盘
        //PanGuAnalyzer 盘古分词;中华人民共和国,从后往前匹配,匹配到和词典一样的词,就保存起来;建议大家去看看盘古分词的官网;词典是可以我们手动去维护;
        //城会玩---网络流行词--默认没有,盘古分词,可以由我们自己把这些词给添加进去;

        using (IndexWriter writer = new IndexWriter(directory, new PanGuAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED))//索引写入器
        {
            foreach (Commodity commdity in commodityList)
            {
                for (int k = 0; k &lt; 10; k++)
                {
                    Document doc = new Document();//一条数据
                    doc.Add(new Field(&#34;id&#34;, commdity.Id.ToString(), Field.Store.NO, Field.Index.NOT_ANALYZED));//一个字段  列名  值   是否保存值  是否分词
                    doc.Add(new Field(&#34;title&#34;, commdity.Title, Field.Store.YES, Field.Index.ANALYZED));
                    doc.Add(new Field(&#34;url&#34;, commdity.Url, Field.Store.NO, Field.Index.NOT_ANALYZED));
                    doc.Add(new Field(&#34;imageurl&#34;, commdity.ImageUrl, Field.Store.NO, Field.Index.NOT_ANALYZED));
                    doc.Add(new Field(&#34;content&#34;, &#34;this is lucene working,powerful tool &#34; + k, Field.Store.YES, Field.Index.ANALYZED));
                    doc.Add(new NumericField(&#34;price&#34;, Field.Store.YES, true).SetDoubleValue((double)(commdity.Price + k)));
                    //doc.Add(new NumericField(&#34;time&#34;, Field.Store.YES, true).SetLongValue(DateTime.Now.ToFileTimeUtc()));
                    doc.Add(new NumericField(&#34;time&#34;, Field.Store.YES, true).SetIntValue(int.Parse(DateTime.Now.ToString(&#34;yyyyMMdd&#34;)) + k));
                    writer.AddDocument(doc);//写进去
                }
            }
            writer.Optimize();//优化  就是合并
        }</code></pre></div></div><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:100%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218123258895356.png" /></div></div></div></figure><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:70.16%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218123617454158.png" /></div></div></div></figure><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:100%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218123824697157.png" /></div></div></div></figure><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:65.69%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218124087689020.png" /></div></div></div></figure><h3 id="dp884" name="%E7%B4%A2%E5%BC%95%E5%BA%93%E2%80%94%E2%80%94%E8%AF%BB%E7%A4%BA%E4%BE%8B"> 索引库——读示例</h3><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言:</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0">            FSDirectory dir = FSDirectory.Open(StaticConstant.TestIndexPath);
        IndexSearcher searcher = new IndexSearcher(dir);//查找器
        {
           

            FuzzyQuery query = new FuzzyQuery(new Term(&#34;title&#34;, &#34;高中政治&#34;)); 
            //TermQuery query = new TermQuery(new Term(&#34;title&#34;, &#34;周年&#34;));//包含
            TopDocs docs = searcher.Search(query, null, 10000);//找到的数据
            foreach (ScoreDoc sd in docs.ScoreDocs)
            {
                Document doc = searcher.Doc(sd.Doc);
                Console.WriteLine(&#34;***************************************&#34;);
                Console.WriteLine(string.Format(&#34;id={0}&#34;, doc.Get(&#34;id&#34;)));
                Console.WriteLine(string.Format(&#34;title={0}&#34;, doc.Get(&#34;title&#34;)));
                Console.WriteLine(string.Format(&#34;time={0}&#34;, doc.Get(&#34;time&#34;)));
                Console.WriteLine(string.Format(&#34;price={0}&#34;, doc.Get(&#34;price&#34;)));
                Console.WriteLine(string.Format(&#34;content={0}&#34;, doc.Get(&#34;content&#34;)));
            }
            Console.WriteLine(&#34;1一共命中了{0}个&#34;, docs.TotalHits);
        }

        QueryParser parser = new QueryParser(Version.LUCENE_30, &#34;title&#34;, new PanGuAnalyzer());//解析器
        {
            // string keyword = &#34;高中政治人教新课标选修生活中的法律常识&#34;;
            string keyword = &#34;高中政治 人 教 新课 标 选修 生活 中的 法律常识&#34;;
            {
                Query query = parser.Parse(keyword);
                TopDocs docs = searcher.Search(query, null, 10000);//找到的数据

                int i = 0;
                foreach (ScoreDoc sd in docs.ScoreDocs)
                {
                    if (i++ &lt; 1000)
                    {
                        Document doc = searcher.Doc(sd.Doc);
                        Console.WriteLine(&#34;***************************************&#34;);
                        Console.WriteLine(string.Format(&#34;id={0}&#34;, doc.Get(&#34;id&#34;)));
                        Console.WriteLine(string.Format(&#34;title={0}&#34;, doc.Get(&#34;title&#34;)));
                        Console.WriteLine(string.Format(&#34;time={0}&#34;, doc.Get(&#34;time&#34;)));
                        Console.WriteLine(string.Format(&#34;price={0}&#34;, doc.Get(&#34;price&#34;)));
                    }
                }
                Console.WriteLine($&#34;一共命中{docs.TotalHits}&#34;);
            }
            {
                Query query = parser.Parse(keyword);
                NumericRangeFilter&lt;int&gt; timeFilter = NumericRangeFilter.NewIntRange(&#34;time&#34;, 20090101, 20201231, true, true);//过滤
                SortField sortPrice = new SortField(&#34;price&#34;, SortField.DOUBLE, false);//false::降序
                SortField sortTime = new SortField(&#34;time&#34;, SortField.INT, true);//true:升序
                Sort sort = new Sort(sortTime, sortPrice);//排序 哪个前哪个后

                TopDocs docs = searcher.Search(query, timeFilter, 10000, sort);//找到的数据

                //可以做什么?就可以分页查询!
                int i = 0;
                foreach (ScoreDoc sd in docs.ScoreDocs)
                {
                    if (i++ &lt; 1000)
                    {
                        Document doc = searcher.Doc(sd.Doc);
                        Console.WriteLine(&#34;***************************************&#34;);
                        Console.WriteLine(string.Format(&#34;id={0}&#34;, doc.Get(&#34;id&#34;)));
                        Console.WriteLine(string.Format(&#34;title={0}&#34;, doc.Get(&#34;title&#34;)));
                        Console.WriteLine(string.Format(&#34;time={0}&#34;, doc.Get(&#34;time&#34;)));
                        Console.WriteLine(string.Format(&#34;price={0}&#34;, doc.Get(&#34;price&#34;)));
                    }
                }
                Console.WriteLine(&#34;3一共命中了{0}个&#34;, docs.TotalHits);
            }
        }</code></pre></div></div><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:91.08%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218124357562390.png" /></div></div></div></figure><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:100%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218124642135245.png" /></div></div></div></figure><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:100%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218125041460222.png" /></div></div></div></figure><figure class=""><div class="rno-markdown-img-url" style="text-align:center"><div class="rno-markdown-img-url-inner" style="width:44.65%"><div style="width:100%"><img src="https://cdn.static.attains.cn/app/developer-bbs/upload/1723218125334952875.png" /></div></div></div></figure><h3 id="5qi32" name="%E5%A4%9A%E7%BA%BF%E7%A8%8B%E5%86%99%E5%85%A5%E7%B4%A2%E5%BC%95%E5%BA%93%E7%A4%BA%E4%BE%8B"> 多线程写入索引库示例</h3><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言:</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0"> try
        {
            logger.Debug(string.Format(&#34;{0} BuildIndex开始&#34;,DateTime.Now));

            List&lt;Task&gt; taskList = new List&lt;Task&gt;();
            TaskFactory taskFactory = new TaskFactory();
            CTS = new CancellationTokenSource();
            //30个表  30个线程  不用折腾,一线程一表  平均分配
            //30个表  18个线程  1到12号2个表  13到18是一个表?  错的!前12个线程活儿多,后面的活少
            //自己去想想,怎么样可以做,随便配置线程数量,但是可以均匀分配任务?
            for (int i = 1; i &lt; 31; i++)
            {
                IndexBuilderPerThread thread = new IndexBuilderPerThread(i, i.ToString(&#34;000&#34;), CTS);
                PathSuffixList.Add(i.ToString(&#34;000&#34;));
                taskList.Add(taskFactory.StartNew(thread.Process));//开启一个线程   里面创建索引
            }
            taskList.Add(taskFactory.ContinueWhenAll(taskList.ToArray(), MergeIndex));
            Task.WaitAll(taskList.ToArray());
            logger.Debug(string.Format(&#34;BuildIndex{0}&#34;, CTS.IsCancellationRequested ? &#34;失败&#34; : &#34;成功&#34;));
        }
        catch (Exception ex)
        {
            logger.Error(&#34;BuildIndex出现异常&#34;, ex);
        }
        finally
        {
            logger.Debug(string.Format(&#34;{0} BuildIndex结束&#34;, DateTime.Now));
        }</code></pre></div></div><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言:</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0"> private static void MergeIndex(Task[] tasks)
    {
        try
        {
            if (CTS.IsCancellationRequested) return;
            ILuceneBulid builder = new LuceneBulid();
            builder.MergeIndex(PathSuffixList.ToArray());
        }
        catch (Exception ex)
        {
            CTS.Cancel();
            logger.Error(&#34;MergeIndex出现异常&#34;, ex);
        }
    }</code></pre></div></div><p>  ///&lt;summary&gt;</p><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言:</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0">        /// 将索引合并到上级目录
    /// &lt;/summary&gt;
    /// &lt;param name=&#34;sourceDir&#34;&gt;子文件夹名&lt;/param&gt;
    public void MergeIndex(string[] childDirs)
    {
        Console.WriteLine(&#34;MergeIndex Start&#34;);
        IndexWriter writer = null;
        try
        {
            if (childDirs == null || childDirs.Length == 0) return;
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
            string rootPath = StaticConstant.IndexPath;
            DirectoryInfo dirInfo = Directory.CreateDirectory(rootPath);
            LuceneIO.Directory directory = LuceneIO.FSDirectory.Open(dirInfo);
            writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);//删除原有的
            LuceneIO.Directory[] dirNo = childDirs.Select(dir =&gt; LuceneIO.FSDirectory.Open(Directory.CreateDirectory(string.Format(&#34;{0}\\{1}&#34;, rootPath, dir)))).ToArray();
            writer.MergeFactor = 100;//控制多个segment合并的频率,默认10
            writer.UseCompoundFile = true;//创建符合文件 减少索引文件数量
            writer.AddIndexesNoOptimize(dirNo);
        }
        finally
        {
            if (writer != null)
            {
                writer.Optimize();
                writer.Close();
            }
            Console.WriteLine(&#34;MergeIndex End&#34;);
        }
    }</code></pre></div></div>