且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将多嵌套的dict/json加载到 pandas 中

更新时间:2023-01-21 20:19:55

我相信您可以使用:

need = ['member.id', 'member.name', 'member.level', 'member.attacks','id']
df1 = json_normalize(war['teams'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df1)

df2 = json_normalize(war['opponent'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df2)


df1.columns = np.where(df1.columns.str.startswith('member.'), 
                       df1.columns.str.split('.', n=1).str[1],
                       'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'), 
                       df2.columns.str.split('.', n=1).str[1],
                       'member.' + df2.columns)


df = pd.concat([df1, df2], sort=False, ignore_index=True)
print (df)
  id   name  level                                            attacks  \
0  1   John     12                                                NaN   
1  2    Tom     11  [{'attackerTag': '2', 'defenderTag': '4', 'dam...   
2  3  Betty     17  [{'attacker_id': '3', 'defender_id': '1', 'dam...   
3  4   Fred      9  [{'attacker_id': '4', 'defender_id': '9', 'dam...   

   member.id  
0   12345679  
1   12345679  
2  987654321  
3  987654321  

I'm trying to load a pretty confusing multi-nested JSON into pandas. I'm already using json_normalize but trying to figure out how I'd join 2 similair nested dicts as well as unpack their sub dicts and lists has been stumping me. I have limited understanding of pandas but I'm assuming I can leverage it's performance advantages if I can just get this down.

I have 2 dict's containing war data, one loaded from a JSON API response and one in a database. I'm trying to compare the 2 for new attacks and defenses.

example war

{
  "state": "active",
  "team_size": 20,
  "teams": {
    "id": "12345679",
    "name": "Good Guys",
    "level": 10,
    "attacks": 4,
    "destruction_percentage": 22.6,
    "members": [
      {
        "id": "1",
        "name": "John",
        "level": 12
      },
      {
        "id": "2",
        "name": "Tom",
        "level": 11,
        "attacks": [
          {
            "attackerTag": "2",
            "defenderTag": "4",
            "damage": 64,
            "order": 7
          }
        ]
      }
    ]
  },
  "opponent": {
    "id": "987654321",
    "name": "Bad Guys",
    "level": 17,
    "attacks": 5,
    "damage": 20.95,
    "members": [
      {
        "id": "3",
        "name": "Betty",
        "level": 17,
        "attacks": [
          {
            "attacker_id": "3",
            "defender_id": "1",
            "damage": 70,
            "order": 1
          },
          {
            "attacker_id": "3",
            "defender_id": "7",
            "damage": 100,
            "order": 11
          }
        ],
        "opponentAttacks": 0,
        "some_useless_data": "Want to ignore, this doesn't show in every record"
      },
      {
        "id": "4",
        "name": "Fred",
        "level": 9,
        "attacks": [
          {
            "attacker_id": "4",
            "defender_id": "9",
            "damage": 70,
            "order": 4
          }
        ],
        "opponentAttacks": 0
      }
    ]
  }
}

Now I'm assuming pandas would be my best option as far as performance goes rather than zipping them together and looping through each member and comparing them.

So my attempt to get a dataframe that flattens nicely and easy to traverse have been difficult to say the least. Optimally I'd assume the following layout. I'm only trying to get both teams into a single df of just all the members. we can omit the state and team_size keys and focus on just getting each member and their respective attacks and team_id's

example df (expected, result):

id   name   level  attacks         member.team_id  ...
1    John   12     NaN             "123456789"
2    Tom    11     [{...}]         "123456789"
3    Betty  17     [{...}, {...}]  "987654321"
4    Fred   9      [{...}]         "987654321"

That's the basic gist of what I'd like as a df. So then I can take both dataframes and compare for new attacks.

Note I just pop()'d state and team_size from the dict before my attempts as all I want is all members, with the team embedded into it pretty much

I have tried the following with no luck and I know it's not the correct way because it's working backwards up the dict tree.

old_df = json_normalize(war,
                        'members',
                        ['id', 'name', 'level', 'attacks'],
                        record_prefix='member')

#Traceback (most recent call last):
#  File "test.py", line 83, in <module>
#    new_parse(old_war, new_war)
#  File "test.py", line 79, in new_parse
#    record_prefix='member')
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize
#    _recursive_extract(data, record_path, {}, level=0)
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract
#    recs = _pull_field(obj, path[0])
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field
#    result = result[spec]
#KeyError: 'members'

I assumed I could use something like the following but that isn't working either.

df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')

#Traceback (most recent call last):
#  File "test.py", line 106, in <module>
#    new_parse(old_war, new_war)
#  File "test.py", line 87, in new_parse
#    df.droplevel('members')
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
#    return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'

I appreciate any guidance! Hopefully I put enough in to help understand my intended result, if not please let me know!

Edit To be fair, I do know how to do this just looping the dict and creating a new list of members with the appropriate date, but this I feel is much more inefficient than using pandas as I'm doing this for millions of wars in a threaded application and every bit of performance I can get out of it is a bonus for me and the app. - Thanks again!

I believe you can use:

need = ['member.id', 'member.name', 'member.level', 'member.attacks','id']
df1 = json_normalize(war['teams'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df1)

df2 = json_normalize(war['opponent'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df2)


df1.columns = np.where(df1.columns.str.startswith('member.'), 
                       df1.columns.str.split('.', n=1).str[1],
                       'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'), 
                       df2.columns.str.split('.', n=1).str[1],
                       'member.' + df2.columns)


df = pd.concat([df1, df2], sort=False, ignore_index=True)
print (df)
  id   name  level                                            attacks  \
0  1   John     12                                                NaN   
1  2    Tom     11  [{'attackerTag': '2', 'defenderTag': '4', 'dam...   
2  3  Betty     17  [{'attacker_id': '3', 'defender_id': '1', 'dam...   
3  4   Fred      9  [{'attacker_id': '4', 'defender_id': '9', 'dam...   

   member.id  
0   12345679  
1   12345679  
2  987654321  
3  987654321