且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

查找给定两个字符串的所有常见子字符串

更新时间:2022-11-14 10:28:04

使用适当的算法算法而不是蛮力方法会更好。***描述了最常见的子字符串问题的两种常见解决方案:后缀树动态编程

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

动态编程解决方案需要O( nm )时间和O( nm )空间。对于最长的公共子字符串,这几乎是对Wikipedia伪代码的直接Java翻译:

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

public static Set<String> longestCommonSubstrings(String s, String t) {
    int[][] table = new int[s.length()][t.length()];
    int longest = 0;
    Set<String> result = new HashSet<>();

    for (int i = 0; i < s.length(); i++) {
        for (int j = 0; j < t.length(); j++) {
            if (s.charAt(i) != t.charAt(j)) {
                continue;
            }

            table[i][j] = (i == 0 || j == 0) ? 1
                                             : 1 + table[i - 1][j - 1];
            if (table[i][j] > longest) {
                longest = table[i][j];
                result.clear();
            }
            if (table[i][j] == longest) {
                result.add(s.substring(i - longest + 1, i + 1));
            }
        }
    }
    return result;
}

现在,您需要所有常见的子串,而不仅仅是最长的子串。您可以增强此算法以包含更短的结果。让我们检查表格中的示例输入 eatsleepnightxyz eatsleepabcxyz

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

  e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3




  • eatsleep 结果很明显:这是左上角的 12345678 对角条纹。

  • xyz 结果是右下角的 123 对角线。

  • a 结果由顶部附近的 1 表示(第二行,第九列)。

  • t 结果由左下角附近的 1 表示。

    • The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
    • The xyz result is the 123 diagonal at the bottom-right.
    • The a result is indicated by the 1 near the top (second row, ninth column).
    • The t result is indicated by the 1 near the bottom left.
    • 左边,顶部和 1 怎么样? $ c> 6 和 7 ?那些不计算,因为它们出现在由 12345678 对角线形成的矩形内 - 换句话说,它们已被 eatsleep $ c覆盖$ c>。

      What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

      我建议除了构建表之外什么都不做。然后,进行第二次传递,从右下角向后迭代,以收集结果集。

      I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.