且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何将Pandas DataFrame中的字典列表扁平化为几列?

更新时间:2021-10-22 22:41:46

假设您已经导入了MCWE中定义的数据,

Assuming you have already imported your data, as defined in your MCWE:

data = [{'user': 1,'query': 'abc', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'1992'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 1,'query': 'efg', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2000'},{u'Op': u'and', u'Type': u'col', u'Val': u'Blue'}]},
              {'user': 1,'query': 'fgs', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'col', u'Val': u'Red'}]},
              {'user': 2 ,'query': 'hij', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2002'}]},
              {'user': 2 ,'query': 'dcv', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 2 ,'query': 'tyu', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'1999'},{u'Op': u'and', u'Type': u'col', u'Val': u'Yellow'}]},
              {'user': 3 ,'query': 'jhg', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'M'}]},
              {'user': 4 ,'query': 'mlh', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'}]},
             ]

然后,您正在寻找熊猫 json_normalize 用于数据规范化的方法:

Then, you are looking for Pandas json_normalize method for data normalization:

from pandas.io.json import json_normalize
df = json_normalize(data, 'Filters', ['query', 'user'])

它返回归一化的您的 json 列扩展为同义类型的列的DataFrame版本:

It returns a normalized DataFrame version where your column of json is expanded into eponymous typed columns:

     Op  Type     Val  user query
0   and  date    1992     1   abc
1   and   sex       F     1   abc
2   and  date    2000     1   efg
3   and   col    Blue     1   efg
4   and  date    2001     1   fgs
5   and   col     Red     1   fgs
6   and  date    2002     2   hij
7   and  date    2001     2   dcv
8   and   sex       F     2   dcv
9   and  date    1999     2   tyu
10  and   col  Yellow     2   tyu
11  and  date    2001     3   jhg
12  and   sex       M     3   jhg
13  and  date    2001     4   mlh

现在,您将枢纽您的DataFrame将类型模态转换为列:

Now, you would pivot your DataFrame to convert Type modalities into columns:

df = df.pivot_table(index=['user', 'query', 'Op'], columns='Type', aggfunc='first')

它导致:

                   Val            
Type               col  date   sex
user query Op                     
1    abc   and    None  1992     F
     efg   and    Blue  2000  None
     fgs   and     Red  2001  None
2    dcv   and    None  2001     F
     hij   and    None  2002  None
     tyu   and  Yellow  1999  None
3    jhg   and    None  2001     M
4    mlh   and    None  2001  None

最后,您可以清理并重置索引(如果它们麻烦的话)您:

Finally, you can clean and reset index, if they bother you:

df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)

哪个返回您请求的MCVE输出:

Which returns your requested MCVE output:

Type  user query   Op     col  date   sex
0        1   abc  and    None  1992     F
1        1   efg  and    Blue  2000  None
2        1   fgs  and     Red  2001  None
3        2   dcv  and    None  2001     F
4        2   hij  and    None  2002  None
5        2   tyu  and  Yellow  1999  None
6        3   jhg  and    None  2001     M
7        4   mlh  and    None  2001  None

非列

在此最终DataFrame中,第一列似乎称为 Type ,但并非如此。而是一个不带名称的整数索引:

In this final DataFrame the first column seems to be called Type, but it is not. It is instead a Integer Index without Name:

df.index
RangeIndex(start=0, stop=8, step=1)

列索引称为 Type 不包含任何称为 Type 的模式(因此没有此名称的列)。

And Columns index is called Type which does not hold any modality called Type (therefore no column with this name).

df.columns
Index(['user', 'query', 'Op', 'col', 'date', 'sex'], dtype='object', name='Type')

这就是为什么您无法删除列 Type 数据透视表中使用的列),因为它不存在。

This is why you cannot remove the column Type (column used in pivot_table), because it does not exist.

如果要删除此伪列,您需要为行创建一个新索引:

If you want to remove this fake column, you need to create a new index for rows:

df.set_index(['user', 'query'], inplace=True)

如果列索引名称困扰您,您可以重置它:

If Column index Name bothers you, you can reset it:

df.columns.name = None

它导致:

             Op     col  date   sex
user query                         
1    abc    and    None  1992     F
     efg    and    Blue  2000  None
     fgs    and     Red  2001  None
2    dcv    and    None  2001     F
     hij    and    None  2002  None
     tyu    and  Yellow  1999  None
3    jhg    and    None  2001     M
4    mlh    and    None  2001  None

当您创建新索引以始终检查其唯一性时,这是一个好习惯:

It is a good practice when you create a new index to always check it is unique:

df.index.is_unique
True

文件中的数据

如果数据在文件中,则应首先使用PSL json 模块:

If your data are in a file, you should first import it into a variable using PSL json module:

import json
with open(path) as file:
    data = json.load(file)

这可以解决问题,然后回到我的答案的开头。

This will do the trick, then come back to the beginning of my answer.