且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Django的草垛自动完成返回过宽结果

更新时间:2023-02-16 21:45:08

这很难说肯定,因为我还没有看到完整的映射,但我怀疑的问题是,分析仪(其中之一)正在使用为索引和搜索。所以,当你的索引文件,获得创建和索引大量的ngram术语。如果你搜索和搜索文本进行了分析以同样的方式,获得大量产生的搜索词。由于您的最小的ngram是一个字母,pretty太多的查询是要配合很多文件。

It's hard to tell for sure since I haven't seen your full mapping, but I suspect the problem is that the analyzer (one of them) is being used for both indexing and searching. So when you index a document, lots of ngram terms get created and indexed. If you search and your search text is also analyzed the same way, lots of search terms get generated. Since your smallest ngram is a single letter, pretty much any query is going to match a lot of documents.

我们写了一篇博客文章有关使用n元为自动完成,你可能会发现有用的,此处的http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams.不过,我会给你一个简单的例子来说明我的意思。我不是超级熟悉草堆所以我可能不能帮你,但我可以解释用n元组的问题Elasticsearch。

We wrote a blog post about using ngrams for autocomplete that you might find helpful, here: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. But I'll give you a simpler example to illustrate what I mean. I'm not super familiar with haystack so I probably can't help you there, but I can explain the issue with ngrams in Elasticsearch.

首先,我将设置一个使用为索引和搜索的NGRAM分析仪的索引:

First I'll set up an index that uses an ngram analyzer for both indexing and searching:

PUT /test_index
{
   "settings": {
       "number_of_shards": 1,
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 1,
               "max_gram": 15,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
        "doc": {
            "properties": {
                "title": {
                    "type": "string", 
                    "analyzer": "nGram_analyzer"
                }
            }
        }
   }
}

和添加一些文档:

PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"monopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"oligopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"plutocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"theocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"title":"democracy"}

和运行一个简单的匹配搜索

and run a simple match search for "poly":

POST /test_index/_search
{
    "query": {
        "match": {
           "title": "poly"
        }
    }
}

它返回的所有5个文件:

it returns all five documents:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 4.729521,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 4.729521,
            "_source": {
               "title": "oligopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 4.3608603,
            "_source": {
               "title": "monopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1.0197333,
            "_source": {
               "title": "plutocracy"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.31496215,
            "_source": {
               "title": "theocracy"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.31496215,
            "_source": {
               "title": "democracy"
            }
         }
      ]
   }
}

这是因为搜索词被表征为条件POLY,其中,因为每个文件的标题字段被表征为单字母而言,相匹配的所有文件。

This is because the search term "poly" gets tokenized into the terms "p", "o", "l", and "y", which, since the "title" field in each of the documents was tokenized into single-letter terms, matches every document.

如果我们重建这个映射指数替代(同分析仪和文档):

If we rebuild the index with this mapping instead (same analyzer and docs):

"mappings": {
  "doc": {
     "properties": {
        "title": {
           "type": "string",
           "index_analyzer": "nGram_analyzer",
           "search_analyzer": "standard"
        }
     }
  }
}

查询将返回我们所期望的:

the query will return what we expect:

POST /test_index/_search
{
    "query": {
        "match": {
           "title": "poly"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1.5108256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 1.5108256,
            "_source": {
               "title": "monopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 1.5108256,
            "_source": {
               "title": "oligopoly"
            }
         }
      ]
   }
}

边缘n元组的工作类似,除了开始于将被使用的词语开头,只有术语

Edge ngrams work similarly, except that only terms that start at the beginning of the words will be used.

下面是code我用这个例子:

Here is the code I used for this example:

http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579