且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Java Regex,如何检查字符串是否包含集合中的任何单词?

更新时间:2023-11-25 21:55:40


TL; DR 对于简单的子串 contains()是***的,但只匹配整个单词正则表达式可能更好。

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better.

查看哪种方法更有效的***方法是测试它。

The best way to see which method is more efficient is to test it.

您可以使用 String.contains()代替 String.indexOf()简化你的非正则表达式代码。

You can use String.contains() instead of String.indexOf() to simplify your non-regexp code.

要搜索不同的单词,正则表达式如下所示:

To search for different words the Regular Expression looks like this:

apple|orange|pear|banana|kiwi

| 在正则表达式中用作 OR

The | works as an OR in Regular Expressions.

我非常简单的测试代码如下所示:

My very simple test code looks like this:

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

我得到的结果如下:

Contains took 5962ms
Regular Expression took 63475ms

显然,时间会根据搜索的字数和搜索的字符串而有所不同,但包含()似乎确实如此对于像这样的简单搜索,比正则表达式快〜10倍。

Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains() does seem to be ~10 times faster than regular expressions for a simple search like this.

通过使用正则表达式在另一个字符串中搜索字符串,你正在使用大锤来破解所以我想我们不应该对它的速度感到惊讶。保存正则表达式,以了解您想要查找的模式何时更复杂。

By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.

您可能希望使用正则表达式的一种情况是 indexOf( )包含()将无法完成工作,因为你只想匹配整个单词而不仅仅是子串,例如你想匹配 pear 但不是 spears 。正则表达式可以很好地处理这种情况,因为它们具有字边界的概念>。

One case where you may want to use Regular Expressions is if indexOf() and contains() won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear but not spears. Regular Expressions handle this case well as they have the concept of word boundaries.

在这种情况下,我们将模式更改为:

In this case we'd change our pattern to:

\b(apple|orange|pear|banana|kiwi)\b

\b 表示只匹配单词的开头或结尾,括号将OR表达式组合在一起。

The \b says to only match the beginning or end of a word and the brackets group the OR expressions together.

注意,在代码中定义此模式时,需要使用另一个反斜杠来转义反斜杠:

Note, when defining this pattern in your code you need to escape the backslashes with another backslash:

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");