如何修改标准分析仪以包含＃？

更新时间：2023-02-26 12:12:59

1) Simplest way would be to use whitespace tokenizer with lowercase filter.

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

这会给你

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "#celebration",
    "start_offset" : 9,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}

2）如果只想保留一些特殊字符，则可以使用字符过滤器，这样您的文本将在令牌化之前被转换成其他内容发生。这更接近标准分析器。例如，您可以这样创建索引

2) If you only want to preserve some special characters then, you could map them with char filter, so that your text would be transformed into something else before tokenization takes place. This is more closer to standard analyzer. For e.g you can create your index like this

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_analyzer": {
          "char_filter": [
            "special_mapping"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "special_mapping": {
          "type": "mapping",
          "mappings": [
            "#=>hashtag\\u0020"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "tweet": {
          "type": "string",
          "analyzer": "special_analyzer"
        }
      }
    }
  }
}

现在 curl -XPOST'localhost：9200 / my_index / _analyze？analyzer = special_analyzer& pretty'-d'new year #celebration vegas'
自定义分析器将生成以下令牌

Now for curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas' custom analyzer will generate following tokens

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

所以您可以像这样搜索

GET my_index/_search
{
  "query": {
    "match": {
      "tweet": "#celebration"
    }
  }
}

您也将只能搜索庆祝活动，因为我在空间 \\u0020 中使用了Unicode，否则我们总是必须使用＃$ c $搜索c>

you will also be able to search for only celebration because I have used unicode for space \\u0020 otherwise we would always have to search with #

希望这会有所帮助！

上一篇 : ：英语到印地语的Converson下一篇 : 如何使用希伯来语字体修复批处理文件?

如何修改标准分析仪以包含＃？

相关阅读

技术问答最新文章