更新时间:2023-02-09 14:23:35
UPDATE: 2016
如果您乐于使用有用的more_itertools
外部库:
from more_itertools import unique_everseen以 open('1.csv','r') 作为 f, open('2.csv','w') 作为 out_file:out_file.writelines(unique_everseen(f))
@IcyFlame 解决方案的更高效版本
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:see = set() # 设置为快速 O(1) 分摊查找对于 in_file 中的行:如果看到一行:继续#跳过重复看到.添加(行)out_file.write(行)
要就地编辑相同的文件,您可以使用它
导入文件输入see = set() # 设置为快速 O(1) 分摊查找对于 fileinput.FileInput('1.csv', inplace=1) 中的行:如果看到一行:继续#跳过重复看到.添加(行)打印行,#标准输出现在重定向到文件
Goal
I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them.
I want to get rid of the duplicates.
Approach
Write a python script to remove duplicates.
Technical specification
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
UPDATE: 2016
If you are happy to use the helpful more_itertools
external library:
from more_itertools import unique_everseen
with open('1.csv','r') as f, open('2.csv','w') as out_file:
out_file.writelines(unique_everseen(f))
A more efficient version of @IcyFlame's solution
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
To edit the same file in-place you could use this
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue # skip duplicate
seen.add(line)
print line, # standard output is now redirected to the file