且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从java中的给定文本中提取阿拉伯语短语

更新时间:2022-11-15 07:42:26

[...]字符类 和字符类只能匹配它指定的一个字符.例如像 [abc] 这样的字符类只能匹配 a OR b OR c.因此,如果您只想找到单词 abc,请不要用 [...] 将其括起来.

[...] is character class and character class can match only one character it specifies. For instance character class like [abc] can match only a OR b OR c. So if you want to find only word abc don't surround it with [...].

另一个问题是您使用 \\s 作为单词分隔符,因此在以下字符串中

Another problem is that you are using \\s as word separator, so in following String

String data = "foo foo foo foo";

regex \\sfoo\\s 将无法匹配第一个 foo 因为前面没有空格.
所以它会找到的第一个匹配是

regex \\sfoo\\s will not be able to match first foo because there is no space before.
So first match it will find will be

String data = "foo foo foo foo";
//      this one--^^^^^

现在,由于正则表达式在第二个 foo 之后消耗了空间,它不能在下一场比赛中重用它,所以第三个 foo 也将被跳过,因为没有可用空间来匹配在它之前.
您也不会匹配 foo,因为这次 后面 没有空格.

Now, since regex consumed space after second foo it can't reuse it in next match so third foo will also be skipped because there is no space available to match before it.
You will also not match forth foo because this time there is no space after it.

要解决这个问题,您可以使用 \\b - word边界检查它所代表的位置是否在字母数字和非字母数字字符(或字符串的开始/结束)之间.

To solve this problem you can use \\b - word boundary which checks if place it represents is between alphanumeric and non-alphanumeric characters (or start/end of string).

所以代替

Pattern p = Pattern.compile("[\\s" + qp + "\\s]");

使用

Pattern p = Pattern.compile("\\b" + qp + "\\b");

或者像 蒂姆提到的更好

Pattern p = Pattern.compile("\\b" + qp + "\\b",Pattern.UNICODE_CHARACTER_CLASS);

确保 \\b 将在预定义的字母数字类中包含阿拉伯字符.

to make sure that \\b will include Arabic characters in predefined alphanumeric class.

更新:

我不确定您的话是否可以包含正则表达式元字符,例如 { [ + * 等等,以防万一您还可以添加转义机制以将此类字符更改为文字.

I am not sure if your words can contain regex metacharacters like { [ + * and so on, so just in case you can also add escaping mechanism to change such characters into literals.

所以

"\\b" + qp + "\\b"

可以变成

"\\b" + Pattern.quote(qp) + "\\b"