更新时间:2022-10-24 18:04:15
将pandas导入为pddf = pd.DataFrame([[5777, 5385, 5419, 4887],[4849, 3759, 4539, 3381],[4971, 3824, 4645, 3424],[4827, 3459, 4552, 3153],[5207, 3670, 4876, 3358],],index=pd.to_datetime(['2001-01-01','2002-01-01','2003-01-01','2004-01-01','2005-01-01']),列=pd.MultiIndex.from_tuples([('非农总人数', '雇佣人数'), ('非农总人数', '职位空缺'),('私人总数', '招聘'), ('私人总数', '职位空缺')]))打印文件非农合计 私营合计招聘职位空缺 招聘职位空缺2001-01-01 5777 5385 5419 48872002-01-01 4849 3759 4539 33812003-01-01 4971 3824 4645 34242004-01-01 4827 3459 4552 31532005-01-01 5207 3670 4876 3358
试试:
df.T.groupby(level=0).diff(-1).dropna().T非农合计 私营合计雇用雇用2001-01-01 392.0 532.02002-01-01 1090.0 1158.02003-01-01 1147.0 1221.02004-01-01 1368.0 1399.02005-01-01 1537.0 1518.0
要应用其他变换,比如比率,你可以这样做:
print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T非农合计 私营合计雇用雇用2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055
或者:
print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T非农合计 私营合计雇用雇用2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055
要重命名列并与原始数据框结合,您可以:
df2 = df.T.groupby(level=0).diff(-1).dropna().Tdf2.columns = pd.MultiIndex.from_tuples([('非农总量', '差异'),('总私人', '差异')])pd.concat([df, df2],axis=1).sort_index(axis=1)
看起来像:
非农总量 私人总量\招聘职位空缺 招聘职位空缺2001-01-01 5777 5385 392.0 5419 48872002-01-01 4849 3759 1090.0 4539 33812003-01-01 4971 3824 1147.0 4645 34242004-01-01 4827 3459 1368.0 4552 31532005-01-01 5207 3670 1537.0 4876 3358区别2001-01-01 532.02002-01-01 1158.02003-01-01 1221.02004-01-01 1399.02005-01-01 1518.0
I have MultiColumns: the second level repetitively contains Job Openings
and Hires
. I would like to subtract one from another for each of the top-level columns - but all I try gets me into index-errors or slice errors. How can I compute it?
Sample data:
>>> df.head()
Out[25]:
Total nonfarm Total private
Hires Job openings Hires Job openings
date
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
expected output:
Out[25]:
Total nonfarm Total private
difference difference
date
2001-01-01 1234 5678
2002-01-01 1234 5678
2003-01-01 1234 5678
2004-01-01 1234 5678
2005-01-01 1234 5678
where the numbers obviously are not correct.
In order to have a generally applicable way, I was trying to set up
def apply(group):
result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values)
result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']])
return result.values
foo = df.groupby(axis=1, level=0).apply(apply)
Which suffers from two problems:
.values
in order to get the divide properlyfoo
is not a proper dataframe:
Accommodation and food services [[0.76], [0.480349344978], [0.501388888889], [... Arts, entertainment, and recreation [[0.558139534884], [0.46017699115], [0.2483221... Construction [[0.35], [0.274881516588], [0.267260579065], [...
I first tried to return result
, instead of result.values
, but that just lead to a data frame full of NaN
What I don't like about the highest-voted answer is that it requires on .diff()
or .div()
- hacks, which make the code hard to read and are hard to implement when there's more than two columns at the sub-level.
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
Total nonfarm Total private
Hires Job Openings Hires Job Openings
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
Try:
df.T.groupby(level=0).diff(-1).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 392.0 532.0
2002-01-01 1090.0 1158.0
2003-01-01 1147.0 1221.0
2004-01-01 1368.0 1399.0
2005-01-01 1537.0 1518.0
To apply other transforms, say a ratio, you could do:
print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 1.072795 1.108860
2002-01-01 1.289971 1.342502
2003-01-01 1.299948 1.356600
2004-01-01 1.395490 1.443704
2005-01-01 1.418801 1.452055
Or:
print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 1.072795 1.108860
2002-01-01 1.289971 1.342502
2003-01-01 1.299948 1.356600
2004-01-01 1.395490 1.443704
2005-01-01 1.418801 1.452055
To rename columns and combine with the original dataframe you can:
df2 = df.T.groupby(level=0).diff(-1).dropna().T
df2.columns = pd.MultiIndex.from_tuples(
[('Total nonfarm', 'difference'),
('Total private', 'difference')])
pd.concat([df, df2], axis=1).sort_index(axis=1)
Looks like:
Total nonfarm Total private \
Hires Job Openings difference Hires Job Openings
2001-01-01 5777 5385 392.0 5419 4887
2002-01-01 4849 3759 1090.0 4539 3381
2003-01-01 4971 3824 1147.0 4645 3424
2004-01-01 4827 3459 1368.0 4552 3153
2005-01-01 5207 3670 1537.0 4876 3358
difference
2001-01-01 532.0
2002-01-01 1158.0
2003-01-01 1221.0
2004-01-01 1399.0
2005-01-01 1518.0