更新时间:2023-02-26 12:12:59
1) Simplest way would be to use whitespace tokenizer with lowercase filter.
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'
这会给你
{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "year",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 2
}, {
"token" : "#celebration",
"start_offset" : 9,
"end_offset" : 21,
"type" : "word",
"position" : 3
}, {
"token" : "vegas",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 4
} ]
}
2)如果只想保留一些特殊字符,则可以使用字符过滤器,这样您的文本将在令牌化之前被转换成其他内容
发生。这更接近标准分析器
。例如,您可以这样创建索引
2) If you only want to preserve some special characters then, you could map them with char filter, so that your text would be transformed into something else before tokenization
takes place. This is more closer to standard analyzer
. For e.g you can create your index like this
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"special_analyzer": {
"char_filter": [
"special_mapping"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"special_mapping": {
"type": "mapping",
"mappings": [
"#=>hashtag\\u0020"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"tweet": {
"type": "string",
"analyzer": "special_analyzer"
}
}
}
}
}
现在 curl -XPOST'localhost:9200 / my_index / _analyze?analyzer = special_analyzer& pretty'-d'new year #celebration vegas'
自定义分析器将生成以下令牌
Now for curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas'
custom analyzer will generate following tokens
{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "year",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "hashtag",
"start_offset" : 9,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "celebration",
"start_offset" : 10,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "vegas",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 5
} ]
}
所以您可以像这样搜索
GET my_index/_search
{
"query": {
"match": {
"tweet": "#celebration"
}
}
}
您也将只能搜索庆祝活动,因为我在空间 \\u0020
中使用了Unicode,否则我们总是必须使用#$ c $搜索c>
you will also be able to search for only celebration because I have used unicode for space \\u0020
otherwise we would always have to search with #
希望这会有所帮助!