且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用正则表达式JAVA将文本拆分为段落

更新时间:2022-11-12 09:41:56

^ 默认情况下表示字符串的开头,而不是行的开头.如果要使其代表行的开头,则需要添加对您的正则表达式(?m) multiline 标志.

^ by default represents start of the string, not start of the line. If you want to it to represent start of the line you need to add multiline flag to your regex (?m).

也可以考虑使用Java 8中的预读功能,

Also consider using look-ahead which in Java 8 will automatically get rid of first empty result in your split array.

因此,请尝试使用此正则表达式:

So try with this regex:

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

要摆脱字符串开头或结尾的多余分隔符(如空格或换行),您可以简单地使用 trim 方法,如

To get rid of unwanted separators like spaces or new lines at start or end of your string you can simply use trim method like

public static void parseText(String text) {
    String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX);
    for (String paragraph : paragraphs) {
        System.out.println("Paragraph: " + paragraph.trim());
    }
}

示例:

 String s = 
        "    Hello, World!\r\n" + 
        "    Hello, World!\r\n" + 
        "    Hello, World!";
 parseText(s);

输出:

Paragraph: Hello, World!
Paragraph: Hello, World!
Paragraph: Hello, World!


Java 8之前的版本:


Pre Java 8 version:

如果您需要在Java的较早版本上使用此代码,则需要防止在字符串开头分割(以防止第一个元素为空).为此,您可以在miltiline标志之前使用(?!^).这样,在(?m)之前的 ^ 仍然只能表示字符串的开头,而不是行的开头.或者更明确地说,您可以使用 \ A 来表示String的开始,而不管多行标志如何.

If you need to use this code on older versions of Java then you will need to prevent splitting at start of the string (to avoid getting first element empty). To do this you can use (?!^) before miltiline flag. This way ^ before (?m) can still be representing only start of string, not start of the line. Or to be more explicit you can use \A which represents start of String regardless of multiline flag.

因此Java 8之前的正则表达式看起来像

So pre Java 8 version of regex can look like

private static final String PARAGRAPH_SPLIT_REGEX = "(?!^)(?m)(?=^\\s{4})";

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?!\\A)(?=^\\s{4})";