且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

弹性搜索查询字符串不要按字部分搜索

更新时间:2023-02-05 13:07:20

这是因为您的标题字段可能已被标准分析器(默认设置)和标题 Cor Interface Monitoring 已被标记为三个令牌 cor interface 监视



为了搜索任何字符串的子字符串,您需要创建一个自定义分析器利用 ngram令牌过滤器为了也索引你的每个令牌的所有子字符串。



你可以这样创建你的索引:

  curl -XPUT localhost:9200 / process_test_3 -d'{
settings:{
analysis:{
analyzer:{
子串_analyzer:{
tokenizer:standard,
filter:[smallcase,substring]
}
},
:{
substring:{
type:nGram,
min_gram:2,
max_gram:15
}



mappings:{
14:{
properties:{
title
type:string,
analyzer:substring_analyzer
}
}
}
}
}'

然后,您可以重新索引您的数据。这样做是标题 Cor Interface Monitoring 现在将被标记为:




  • co , cor

  • in int inte inter interf 等等

  • mo mon moni



    • ,以便您的第二个搜索查询现在将返回您期望的文档,因为令牌 cor inter 现在匹配。


      I'm sending this request

curl -XGET 'host/process_test_3/14/_search' -d '{
  "query" : {
    "query_string" : {
      "query" : "\"*cor interface*\"",
      "fields" : ["title", "obj_id"]
    }
  }
}'

And I'm getting correct result

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 5.421598,
    "hits": [
      {
        "_index": "process_test_3",
        "_type": "14",
        "_id": "141_dashboard_14",
        "_score": 5.421598,
        "_source": {
          "obj_type": "dashboard",
          "obj_id": "141",
          "title": "Cor Interface Monitoring"
        }
      }
    ]
  }
}

But when I want to search by word part, as example

curl -XGET 'host/process_test_3/14/_search' -d '
{
  "query" : {
    "query_string" : {
      "query" : "\"*cor inter*\"",
      "fields" : ["title", "obj_id"]
    }
  }
}'

I'm getting no results back:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : []
  }
}

What am I doing wrong?

This is because your title field has probably been analyzed by the standard analyzer (default setting) and the title Cor Interface Monitoring has been tokenized as the three tokens cor, interface and monitoring.

In order to search any substring of words, you need to create a custom analyzer which leverages the ngram token filter in order to also index all substrings of each of your tokens.

You can create your index like this:

curl -XPUT localhost:9200/process_test_3 -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "substring_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "substring"]
        }
      },
      "filter": {
        "substring": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 15
        }
      }
    }
  },
  "mappings": {
    "14": {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "substring_analyzer"
        }
      }
    }
  }
}'

Then you can reindex your data. What this will do is that the title Cor Interface Monitoring will now be tokenized as:

  • co, cor, or
  • in, int, inte, inter, interf, etc
  • mo, mon, moni, etc

so that your second search query will now return the document you expect because the tokens cor and inter will now match.