且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Ruby 超级不敏感的正则表达式将学校名称与重音和其他变音符号匹配

更新时间:2023-02-22 13:57:33

看起来 MongoDB 的解决方案是使用 text 索引,即变音符号不敏感.支持.

自从我上次使用 MongoDB 已经有很长时间了,但是如果您使用的是 Mongoid,我认为您会在模型中创建一个 text 索引,如下所示:

index(name: "text")

...然后像这样搜索:

scope :by_registered_name, ->(str) {where(:$text => { :$search => str })}

查阅$text 查询的文档操作员了解更多信息.

原始(错误)答案

事实证明,我是在向后思考这个问题,最初写了这个答案.我保留它,因为它可能仍然派上用场.如果您使用的数据库不提供此类功能(就像 MongoDB 提供的功能),可能的解决方法是使用以下技术在数据库中存储经过清理的名称和原始名称,并且然后同样清理查询.

由于您使用的是 Rails,因此您可以使用方便的 ActiveSupport::Inflector.transliterate:

regex =/aäoöuü/transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')# =>呜呜呜"new_regex = Regexp.new(音译)# =>/aaoouu/

或者干脆:

Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))

您会注意到我提供了 '\?' 作为第二个参数,它是将替换任何无效 UTF-8 字符的替换字符串.这是因为默认替换字符串是 "?",正如您所知,它在正则表达式中具有特殊含义.

另请注意,ActiveSupport::Inflector.transliterate 比类似的 I18n.transliterate 做得更多.这是它的来源:

def transliterate(string, replacement = "?")I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),:替换 =>替代品)结尾

最里面的方法调用,ActiveSupport::Multibyte::Unicode.tidy_bytes,清除任何无效的 UTF-8 字符.

更重要的是,ActiveSupport::Multibyte::Unicode.normalize 规范化"字符.例如,ê 看起来像一个字符,但实际上是两个字符:拉丁小写字母 E 和组合圆形重音.调用 I18n.transliterate("ê") 会产生 e?,这可能不是你想要的,所以调用 normalizeê 变成 ê,它只是一个字符:带圆环的拉丁文小写字母 E.在 ê(前者)上调用 I18n.transliterate 会产生 e?,这可能不是你想要的,所以 transliterate 之前的 normalize 步骤很重要.(如果您对其工作原理感兴趣,请阅读Unicode 等效和规范化.)>

The question has been asked in other programming languages, but how would you perform an accent insensitive regex on Ruby ?

My current code is something like

scope :by_registered_name, ->(regex){
  where(:name => /#{Regexp.escape(regex)}/i)
}

I thought maybe I could replace non-alphanumeric+whitespace characters by dots, and remove the escape, but is there not a better way ? I'm afraid I could catch weird things if I do that...

I am targeting French right now, but if I could also fix it for other languages that would be cool.

I am using Ruby 2.3 if that can help.


I realize my requirements are actually a bit stronger, I also need to catch things like dashes, etc. I am basically importing a school database (URL here, the tag is <nom>), and I want people to be able to find their schools by typing its name. Both the search query and search request may contain accents, I believe the easiest way would be to make "both" insensitive.

  • "Télécom" should be matched by "Telecom"
  • "établissement" should be matched by "etablissement"
  • "Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle" should be matched by "artisanat chambre de métiers
  • "Ecole hôtelière d'Avignon (CCI du Vaucluse)" Should be matched by Ecole hoteliere d'avignon" (for the parenthesis it's okay to skip it)
  • "Ecole française d'hôtesses" should be matched by "ecole francaise d'hot"

Also crazy stuff I found in that DB, I will consider sanitizing this input I think

  • "Académie internationale de management - Hotel & Tourism Management Academy" Should be matched by "Hotel Tourism" (note the & is actually written &amp; in the XML)

It looks like the solution for MongoDB is to use a text index, which is diacritic insensitive. French is supported.

It's been a long time since I last used MongoDB, but if you're using Mongoid I think you would create a text index in your model like this:

index(name: "text")

...and then search like this:

scope :by_registered_name, ->(str) {
  where(:$text => { :$search => str })
}

Consult the documentation for the $text query operator for more information.

Original (wrong) answer

As it turns out I was thinking about the question backwards, and wrote this answer initially. I'm preserving it since it might still come in handy. If you were using a database that didn't offer this kind of functionality (like, it seems, MongoDB does), a possible workaround would be to use the following technique to store a sanitized name along with the original name in the database, and then likewise sanitize queries.

Since you're using Rails you can use the handy ActiveSupport::Inflector.transliterate:

regex = /aäoöuü/
transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')
# => "aaoouu"
new_regex = Regexp.new(transliterated)
# => /aaoouu/

Or simply:

Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))

You'll note that I supplied '\?' as the second argument, which is the replacement string that will replace any invalid UTF-8 characters. This is because the default replacement string is "?", which as you know has special meaning in a regular expression.

Also note that ActiveSupport::Inflector.transliterate does a little bit more than the similar I18n.transliterate. Here's its source:

def transliterate(string, replacement = "?")
  I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(
    ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),
      :replacement => replacement)
end

The innermost method call, ActiveSupport::Multibyte::Unicode.tidy_bytes, cleans up any invalid UTF-8 characters.

More importantly, ActiveSupport::Multibyte::Unicode.normalize "normalizes" the characters. For example, looks like one character but it's actually two: LATIN SMALL LETTER E and COMBINING CIRCUMFLEX ACCENT. Calling I18n.transliterate("ê") would yield e?, which probably isn't what you want, so normalize is called to turn into ê, which is just one character: LATIN SMALL LETTER E WITH CIRCUMFLEX. Calling I18n.transliterate on (the former) would yield e?, which probably isn't what you want, so that normalize step before transliterate is important. (If you're interested in how that works, read about Unicode equivalence and normalization.)