更新时间:2021-10-22 22:41:46
假设您已经导入了MCWE中定义的数据,
Assuming you have already imported your data, as defined in your MCWE:
data = [{'user': 1,'query': 'abc', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'1992'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
{'user': 1,'query': 'efg', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2000'},{u'Op': u'and', u'Type': u'col', u'Val': u'Blue'}]},
{'user': 1,'query': 'fgs', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'col', u'Val': u'Red'}]},
{'user': 2 ,'query': 'hij', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2002'}]},
{'user': 2 ,'query': 'dcv', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
{'user': 2 ,'query': 'tyu', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'1999'},{u'Op': u'and', u'Type': u'col', u'Val': u'Yellow'}]},
{'user': 3 ,'query': 'jhg', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'M'}]},
{'user': 4 ,'query': 'mlh', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'}]},
]
然后,您正在寻找熊猫 json_normalize 用于数据规范化的方法:
Then, you are looking for Pandas json_normalize method for data normalization:
from pandas.io.json import json_normalize
df = json_normalize(data, 'Filters', ['query', 'user'])
它返回归一化的您的 json
列扩展为同义类型的列的DataFrame版本:
It returns a normalized DataFrame version where your column of json
is expanded into eponymous typed columns:
Op Type Val user query
0 and date 1992 1 abc
1 and sex F 1 abc
2 and date 2000 1 efg
3 and col Blue 1 efg
4 and date 2001 1 fgs
5 and col Red 1 fgs
6 and date 2002 2 hij
7 and date 2001 2 dcv
8 and sex F 2 dcv
9 and date 1999 2 tyu
10 and col Yellow 2 tyu
11 and date 2001 3 jhg
12 and sex M 3 jhg
13 and date 2001 4 mlh
现在,您将枢纽您的DataFrame将类型模态转换为列:
Now, you would pivot your DataFrame to convert Type modalities into columns:
df = df.pivot_table(index=['user', 'query', 'Op'], columns='Type', aggfunc='first')
它导致:
Val
Type col date sex
user query Op
1 abc and None 1992 F
efg and Blue 2000 None
fgs and Red 2001 None
2 dcv and None 2001 F
hij and None 2002 None
tyu and Yellow 1999 None
3 jhg and None 2001 M
4 mlh and None 2001 None
最后,您可以清理并重置索引(如果它们麻烦的话)您:
Finally, you can clean and reset index, if they bother you:
df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)
哪个返回您请求的MCVE输出:
Which returns your requested MCVE output:
Type user query Op col date sex
0 1 abc and None 1992 F
1 1 efg and Blue 2000 None
2 1 fgs and Red 2001 None
3 2 dcv and None 2001 F
4 2 hij and None 2002 None
5 2 tyu and Yellow 1999 None
6 3 jhg and None 2001 M
7 4 mlh and None 2001 None
非列
在此最终DataFrame中,第一列似乎称为 Type
,但并非如此。而是一个不带名称的整数索引:
In this final DataFrame the first column seems to be called Type
, but it is not. It is instead a Integer Index without Name:
df.index
RangeIndex(start=0, stop=8, step=1)
列索引称为 Type
不包含任何称为 Type
的模式(因此没有此名称的列)。
And Columns index is called Type
which does not hold any modality called Type
(therefore no column with this name).
df.columns
Index(['user', 'query', 'Op', 'col', 'date', 'sex'], dtype='object', name='Type')
这就是为什么您无法删除列 Type
(数据透视表
中使用的列),因为它不存在。
This is why you cannot remove the column Type
(column used in pivot_table
), because it does not exist.
如果要删除此伪列,您需要为行创建一个新索引:
If you want to remove this fake column, you need to create a new index for rows:
df.set_index(['user', 'query'], inplace=True)
如果列索引名称困扰您,您可以重置它:
If Column index Name bothers you, you can reset it:
df.columns.name = None
它导致:
Op col date sex
user query
1 abc and None 1992 F
efg and Blue 2000 None
fgs and Red 2001 None
2 dcv and None 2001 F
hij and None 2002 None
tyu and Yellow 1999 None
3 jhg and None 2001 M
4 mlh and None 2001 None
当您创建新索引以始终检查其唯一性时,这是一个好习惯:
It is a good practice when you create a new index to always check it is unique:
df.index.is_unique
True
文件中的数据
如果数据在文件中,则应首先使用PSL json
模块:
If your data are in a file, you should first import it into a variable using PSL json
module:
import json
with open(path) as file:
data = json.load(file)
这可以解决问题,然后回到我的答案的开头。
This will do the trick, then come back to the beginning of my answer.