更新时间:2023-12-04 22:16:46
删除重复项的标准方法是转换为 set
.
The standard way to remove duplicates is to convert to a set
.
但是我认为您阅读文件的方式有些问题.第一个问题:它不是一个 csv 文件(前两个字段之间有一个冒号).其次是什么
However I think there's some stuff wrong with the way you're reading the file. First problem: it isn't a csv file (you have a colon between the first two fields). Second what is
gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]
应该做什么?
如果我写这篇文章,我会将:"替换为另一个,".因此,通过此修改并使用集合字典,您的代码将类似于:
If I was writing this I would replace the ":" with another ",". So with this modification and using a dictionary of sets your code would look something like:
# Read in csv file and convert to list of list of entries. Use with so that
# the file is automatically closed when we are done with it
csvlines = []
with open("CSV-sorted.csv") as f:
for line in f:
# Use strip() to clean up trailing whitespace, use split() to split
# on commas.
a = [entry.strip() for entry in line.split(',')]
csvlines.append(a)
# I'll print it here so you can see what it looks like:
print(csvlines)
# Next up: converting our list of lists to a dict of sets.
# Create empty dict
sample_dict = {}
# Fill in the dict
for line in csvlines:
gene = line[0] # gene is first entry
samples = set(line[1:]) # rest of the entries are samples
# If this gene is in the dict already then join the two sets of samples
if gene in sample_dict:
sample_dict[gene] = sample_dict[gene].union(samples)
# otherwise just put it in
else:
sample_dict[gene] = samples
# Now you can print the dictionary:
print(sample_dict)
输出为:
[['AHCTF1', 'Sample1', 'Sample2', 'Sample4'], ['AHCTF1', 'Sample2', 'Sample7', 'Sample12'], ['AHCTF1', 'Sample5', 'Sample6', 'Sample7']]
{'AHCTF1': {'Sample12', 'Sample1', 'Sample2', 'Sample5', 'Sample4', 'Sample7', 'Sample6'}}
其中第二行是您的字典.
where the second line is your dictionary.