且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

删除列表列中的重复列表元素

更新时间:2022-10-17 23:18:55

如果你使用的是 python 3.7>,你可以

This is my dataframe:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})

I want to get set\drop duplicate values of column C per row but not drop duplicate rows.

This what I hope to get:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})

If you're using python 3.7>, you could could map with dict.fromkeys, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):

df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))

For older pythons you have collections.OrderedDict:

from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))

print(df)

   A  B             C
0  1  0        [1, 4]
1  3  2        [1, 4]
2  3  3     [3, 4, 5]
3  4  4     [3, 4, 5]
4  5  5     [4, 2, 1]
5  3  6  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]

As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set for a more concise approach:

df['c'] = df.C.map(lambda x: [*{*x}])


Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:

df = pd.concat([df]*50000, axis=0).reset_index(drop=True)

perfplot.show(
    setup=lambda n: df.iloc[:int(n)], 

    kernels=[
        lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
        lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
        lambda df: [np.unique(item) for item in df['C'].values],
        lambda df: df['C'].explode().groupby(level=0).unique(),
        lambda df: df.C.map(lambda x: [*{*x}]),
    ],

    labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
    n_range=[2**k for k in range(0, 18)],
    xlabel='N',
    equality_check=None
)