且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何修改标准分析仪以包含#?

更新时间:2023-02-26 12:12:59

1)最简单的方法是将空白令牌生成器小写过滤器

1) Simplest way would be to use whitespace tokenizer with lowercase filter.

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

这会给你

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "#celebration",
    "start_offset" : 9,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}

2)如果只想保留一些特殊字符,则可以使用字符过滤器,这样您的文本将在令牌化之前被转换成其他内容发生。这更接近标准分析器。例如,您可以这样创建索引

2) If you only want to preserve some special characters then, you could map them with char filter, so that your text would be transformed into something else before tokenization takes place. This is more closer to standard analyzer. For e.g you can create your index like this

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_analyzer": {
          "char_filter": [
            "special_mapping"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "special_mapping": {
          "type": "mapping",
          "mappings": [
            "#=>hashtag\\u0020"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "tweet": {
          "type": "string",
          "analyzer": "special_analyzer"
        }
      }
    }
  }
}

现在 curl -XPOST'localhost:9200 / my_index / _analyze?analyzer = special_analyzer& pretty'-d'new year #celebration vegas'
自定义分析器将生成以下令牌

Now for curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas' custom analyzer will generate following tokens

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

所以您可以像这样搜索

GET my_index/_search
{
  "query": {
    "match": {
      "tweet": "#celebration"
    }
  }
}

您也将只能搜索庆祝活动,因为我在空间 \\u0020 中使用了Unicode,否则我们总是必须使用#$ c $搜索c>

you will also be able to search for only celebration because I have used unicode for space \\u0020 otherwise we would always have to search with #

希望这会有所帮助!