且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在 SOLR/lucene 中匹配搜索字符串的子集

更新时间:2022-11-14 09:06:16

听起来您想在分析中使用 ShingleFilter,以便索引单词 bigram:因此在查询和索引时都添加 ShingleFilterFactory.

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

在索引时,您的文档将被编入索引:

At index time your documents are then indexed as such:

  • 快速棕色"-> quick_brown
  • "fox over" -> fox_over
  • 懒狗"->lazy_dog

在查询时,您的查询变为:

At query time your query becomes:

  • 敏捷的棕色狐狸跳过懒惰的狗"->the_quick quick_brown brown_fox fox_jumps jumps_over_the_lazy lazy_dog"

这样还是不行,默认会形成词组查询.因此,在您的仅查询分析器中,在 ShingleFilterFactory 之后添加 PositionFilterFactory.这将展平"查询中的位置,以便查询解析器将输出视为同义词,这将产生一个带有这些子项的布尔查询(所有 SHOULD 子句,所以它基本上是一个 OR 查询):

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

布尔查询:

  • the_quick 或
  • quick_brown 或
  • brown_fox 或
  • ...

这应该是最高效的方式,因为它实际上只是术语查询的布尔查询.

this should be the most performant way, as then its really just a booleanquery of termqueries.