且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Pandas:将操作应用于 MultiIndex 中的重复列

更新时间:2022-10-24 18:04:15

设置

将pandas导入为pddf = pd.DataFrame([[5777, 5385, 5419, 4887],[4849, 3759, 4539, 3381],[4971, 3824, 4645, 3424],[4827, 3459, 4552, 3153],[5207, 3670, 4876, 3358],],index=pd.to_datetime(['2001-01-01','2002-01-01','2003-01-01','2004-01-01','2005-01-01']),列=pd.MultiIndex.from_tuples([('非农总人数', '雇佣人数'), ('非农总人数', '职位空缺'),('私人总数', '招聘'), ('私人总数', '职位空缺')]))打印文件非农合计 私营合计招聘职位空缺 招聘职位空缺2001-01-01 5777 5385 5419 48872002-01-01 4849 3759 4539 33812003-01-01 4971 3824 4645 34242004-01-01 4827 3459 4552 31532005-01-01 5207 3670 4876 3358

试试:

df.T.groupby(level=0).diff(-1).dropna().T非农合计 私营合计雇用雇用2001-01-01 392.0 532.02002-01-01 1090.0 1158.02003-01-01 1147.0 1221.02004-01-01 1368.0 1399.02005-01-01 1537.0 1518.0

要应用其他变换,比如比率,你可以这样做:

print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T非农合计 私营合计雇用雇用2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055

或者:

print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T非农合计 私营合计雇用雇用2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055

要重命名列并与原始数据框结合,您可以:

df2 = df.T.groupby(level=0).diff(-1).dropna().Tdf2.columns = pd.MultiIndex.from_tuples([('非农总量', '差异'),('总私人', '差异')])pd.concat([df, df2],axis=1).sort_index(axis=1)

看起来像:

 非农总量 私人总量\招聘职位空缺 招聘职位空缺2001-01-01 5777 5385 392.0 5419 48872002-01-01 4849 3759 1090.0 4539 33812003-01-01 4971 3824 1147.0 4645 34242004-01-01 4827 3459 1368.0 4552 31532005-01-01 5207 3670 1537.0 4876 3358区别2001-01-01 532.02002-01-01 1158.02003-01-01 1221.02004-01-01 1399.02005-01-01 1518.0

I have MultiColumns: the second level repetitively contains Job Openings and Hires. I would like to subtract one from another for each of the top-level columns - but all I try gets me into index-errors or slice errors. How can I compute it?

Sample data:

>>> df.head()
Out[25]: 
           Total nonfarm              Total private               
                   Hires Job openings         Hires Job openings   
date                                                               
2001-01-01          5777         5385          5419         4887   
2002-01-01          4849         3759          4539         3381   
2003-01-01          4971         3824          4645         3424   
2004-01-01          4827         3459          4552         3153   
2005-01-01          5207         3670          4876         3358  

expected output:

Out[25]: 
           Total nonfarm   Total private              
              difference      difference   
date                                                               
2001-01-01          1234            5678          
2002-01-01          1234            5678          
2003-01-01          1234            5678         
2004-01-01          1234            5678      
2005-01-01          1234            5678    

where the numbers obviously are not correct.

Specifically within an apply()

In order to have a generally applicable way, I was trying to set up

def apply(group):
    result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values)
    result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']])
    return result.values
foo = df.groupby(axis=1, level=0).apply(apply)

Which suffers from two problems:

  • I need to cheat around with .values in order to get the divide properly
  • foo is not a proper dataframe:

    Accommodation and food services [[0.76], [0.480349344978], [0.501388888889], [... Arts, entertainment, and recreation [[0.558139534884], [0.46017699115], [0.2483221... Construction [[0.35], [0.274881516588], [0.267260579065], [...

I first tried to return result, instead of result.values, but that just lead to a data frame full of NaN

Specifically with using the column names

What I don't like about the highest-voted answer is that it requires on .diff() or .div() - hacks, which make the code hard to read and are hard to implement when there's more than two columns at the sub-level.

Setup

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df

           Total nonfarm              Total private             
                   Hires Job Openings         Hires Job Openings
2001-01-01          5777         5385          5419         4887
2002-01-01          4849         3759          4539         3381
2003-01-01          4971         3824          4645         3424
2004-01-01          4827         3459          4552         3153
2005-01-01          5207         3670          4876         3358

Try:

df.T.groupby(level=0).diff(-1).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01         392.0         532.0
2002-01-01        1090.0        1158.0
2003-01-01        1147.0        1221.0
2004-01-01        1368.0        1399.0
2005-01-01        1537.0        1518.0

To apply other transforms, say a ratio, you could do:

print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01      1.072795      1.108860
2002-01-01      1.289971      1.342502
2003-01-01      1.299948      1.356600
2004-01-01      1.395490      1.443704
2005-01-01      1.418801      1.452055

Or:

print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01      1.072795      1.108860
2002-01-01      1.289971      1.342502
2003-01-01      1.299948      1.356600
2004-01-01      1.395490      1.443704
2005-01-01      1.418801      1.452055

To rename columns and combine with the original dataframe you can:

df2 = df.T.groupby(level=0).diff(-1).dropna().T
df2.columns = pd.MultiIndex.from_tuples(
    [('Total nonfarm', 'difference'),
     ('Total private', 'difference')])
pd.concat([df, df2], axis=1).sort_index(axis=1)

Looks like:

           Total nonfarm                         Total private               \
                   Hires Job Openings difference         Hires Job Openings   
2001-01-01          5777         5385      392.0          5419         4887   
2002-01-01          4849         3759     1090.0          4539         3381   
2003-01-01          4971         3824     1147.0          4645         3424   
2004-01-01          4827         3459     1368.0          4552         3153   
2005-01-01          5207         3670     1537.0          4876         3358   

           difference  
2001-01-01      532.0  
2002-01-01     1158.0  
2003-01-01     1221.0  
2004-01-01     1399.0  
2005-01-01     1518.0