且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用正则表达式的***过滤器(100个单词的列表)

更新时间:2023-02-26 15:58:29

这是一个很难解决的问题,您需要确定正则表达式是否对您有用,以及如何处理嵌入(将***字词添加到字典中时)像frackface,但带有真正的F字).

This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).

正则表达式通常对它们的长度有限制,这通常会阻止您对所有单词使用单个正则表达式.对一个字符串执行多个正则表达式的速度确实很慢,具体取决于所需的性能和黑名单的大小.最初,我们将 CleanSpeak 实现为正则表达式系统,但它无法缩放,因此我们使用了不同的机制对其进行了重新编写.

Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.

您还需要考虑短语,标点符号,空格,讲方言和其他语言.所有这些使正则表达式作为解决方案不那么吸引人.以下是一些使用hello一词的示例(假设这是***行为):

You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):

  • 列表项
  • h e l l o
  • h.e.l.o
  • h_e_l_l_o
  • |-|你好
  • h3llo
  • 你好,那里"(此短语可能不包含任何***性词语,但组合起来就是***性语言)

您还需要处理两个或更多个词典(白名单)单词彼此相邻时包含***行为的极端情况.包含s-word的一些示例:

You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:

  • 扑灭
  • 这是安静的时间

这些显然不是***的,但是大多数本地出产的和许多商业解决方案在这些情况下都存在问题.

These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.

过去三年来,我们一直在完善 CleanSpeak 所使用的过滤器,以确保能够处理所有这些情况,我们继续进行调整,使其变得更好.我们还花了8个月的时间完善我们的性能系统,它每秒可以处理大约5,000条消息.并不是说您无法构建某些有用的东西,而是要做好准备处理很多可能出现的问题,并创建一个不使用正则表达式的系统.

We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.