且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将java中的字符串拆分为相等长度的子字符串,同时保持字边界

更新时间:2022-11-14 15:55:37

如果我理解你的问题正确然后这个代码应该做你需要的(但它假设 maxLenght 等于或大于最长的单词

If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)

String data = "Hello there, my name is not importnant right now."
        + " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
    System.out.println(m.group(1));

输出:

Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.

的短(或非)解释\\G\ \ * *(。{1,+ maxLenght +})(?= \\\\ | $)正则表达式:

(让我们记住,在Java \ 中不仅特殊于正则表达式,而且还在字符串文字中,所以要使用预定义的字符集,如 \\ \\ n 我们需要将其写为\\d因为我们需要转义 \ 也在字符串文字中)

(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)


  • \ G - 是代表先前创建的匹配结束的锚,或者如果还没有匹配(当我们刚刚开始搜索时)字符串的开头(与 ^ 相同)

  • \s * - 表示零个或多个空格( \s 表示空格, * 零或多量词)

  • (。{1,+ maxLenght +}) - 让我们把它分成更多部分(在运行时:maxLenght 将保留一些数值li ke 10所以正则表达式会将其视为。{1,10}


    • 表示任何字符(实际上默认情况下,它可以表示除 \ n 或等行分隔符之外的任何字符\\ r \\ n ,但感谢 Pattern.DOTALL 标记它现在可以代表任何字符 - 你可以摆脱这种方法如果你想分别开始分割每个句子,因为它的开始将以新行打印

    • {1,10 } - 这是量词,它允许先前描述的元素出现1到10次(默认情况下会尝试找到匹配重复的最大数量),

    • 。{1,10} - 所以基于我们刚才说的,它只代表1到10个任何字符

    • - 括号创建,允许我们进行的结构保持匹配的特定部分(这里我们在 \\\\ * 之后添加括号,因为我们只想在空格后使用部分)

    • \G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
    • \s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
    • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
      • . represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
      • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
      • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
      • ( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)

      (?= \\\\ | $) - 是预见机制,确保文本与匹配。{1,10 } 将在它之后:

      (?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:


      • 空格( \\ s

      OR(写成 |

      结束字符串 $ 之后。

      所以感谢。{ 1,10} 我们最多可以匹配10个字符。但是在(?= \\\\ | $)之后,我们要求最后一个字符与匹配。{1,10} 不是未完成单词的一部分(后面必须有空格或字符串结尾)。

      So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).