查找并保留文本文件中的所有重复行(而不是唯一行)

更新时间：2023-08-28 16:29:04

此处是基于正则表达式和书签的解决方案，它适用于排序文件(即每条重复的行后都跟着重复的行):

Here is a solution based on regular Expressions and bookmarks, it works for a sorted file (i.e. each duplicated line is followed by its duplicates):

打开标记"对话框(搜索->标记....)
点击右侧的清除所有标记
检查书签行
检查包裹
查找内容: ((.*)\R(\2\R?)+)*\K.*
检查正则表达式，然后取消选中. matches newline
全部标记
点击关闭
搜索->书签->删除书签行

Open the Mark Dialog (Search -> Mark ....)
click Clear all Marks on the right
check Bookmark line
check Wrap aound
Find What: ((.*)\R(\2\R?)+)*\K.*
Check regular expression and uncheck . matches newline
Mark All
Click Close
Search -> Bookmark -> Remove Bookmarked Lines

说明

正则表达式由三部分组成:

The regular expression is made up of three parts:

((.*)\R(\2\R?)+)*:这是一个可选的重复块，由一个或多个行块组成

((.*)\R(\2\R?)+)* : this is an optional block of duplicates consisting of one ore more line blocks

外围设备( ... )*匹配零个或多个这样的重复行块(如果在您的示例中，三个4后跟两个5，我们将需要一个重复块序列的概念)
(.*)\R(\2\R?)+:\2引用了(.*)的内容:这都是一行的重复项
第二个\R是可选的(由于?)换行符.因此，如果文件的最后一行不以换行符结尾，则可以匹配该文件的最后一行

the outher ( ... )* matches zero or more such blocks of duplicated lines (if in your example the three 4 would be followed by two 5 we will need a concept of sequences of duplicate blocks)
(.*)\R(\2\R?)+: \2 references the content of (.*): this are all duplicates of one line
the second \R is an optional ( due to the ?) linebreak. Thus it is possible to match a duplicate in the last line of the file if that line does not end with a linebreak

如果从您开始的光标位置后面有一行重复的行，它将与之匹配.

If there is a block of duplicated lines after the cursor position from which you start, this will match it.

现在\K丢弃到目前为止已匹配的内容(重复项)，并在第一行唯一行之前放置光标"

now \K discards what we have matched so far (the duplicates) and "puts the cursor" before the first unique line

使用全部标记，我们将所有这些独特的行添加为书签，以便我们可以使用搜索"->书签"菜单中的条目"将其删除.

Using Mark All we bookmark all such unique lines, so that we can remove them using the Entry from the Search -> Bookmark menu.

上一篇 : ：如何使用R根据数据框中单个列的最小值对特定列中的行进行子集下一篇 : 唯一编号生成算法

查找并保留文本文件中的所有重复行(而不是唯一行)

相关阅读

推荐文章