且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

找出字符串中最长的重复序列

更新时间:2022-11-11 10:25:55

这个问题是的变体最长重复子串问题,并且有一个 O(n) 时间算法来解决它,它使用 后缀树.这个想法(如***所建议的)是构建一个后缀树(时间 O(n)),用后代的数量(时间 O(n) 使用 DFS)注释树中的所有节点,然后找到树中至少有三个后代的最深节点(使用 DFS 的时间为 O(n)).这个整体算法需要时间 O(n).

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

也就是说,众所周知,后缀树很难构建,因此在尝试此实现之前,您可能希望找到一个为您实现后缀树的 Python 库.一个快速的谷歌搜索出现了这个库,虽然我不确定这是否是一个很好的实现.

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

另一种选择是将后缀数组LCP 数组.您可以迭代 LCP 数组中的相邻元素对,取每对中的最小值,并存储您以这种方式找到的最大数字.这将对应于重复至少 3 次的最长字符串的长度,然后您可以从那里读出字符串本身.

Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.

有几种简单的算法可用于构建后缀数组(Manber-Myers 算法在 O(n log n) 时间内运行并且不太难编码),而 Kasai 的算法在 O(n) 时间内构建 LCP 数组并且编码起来相当简单.

There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.

希望这有帮助!