且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

numpy.sum行为不同的numpy.array VS pandas.DataFrame

更新时间:2023-12-01 08:08:04

OK,以下是我PDB调试会话的转储它展示了如何在这土地大熊猫结束了:

 在[*]:A1 = np.random.random((3,2))
进口PDB
A2 = pd.DataFrame(a1)的
打印(np.sum(A1))#的款项的所有单元格
pdb.set_trace()
np.sum(A2)行以上的款项#
3.02993889742
- 返回 -
> &所述; IPython中输入-50-92405dd4ed52>(5)&所述;模块>() - >无
- > pdb.set_trace()
(PDB)B 6
断点2 AT< IPython的输入-50-92405dd4ed52>:6
(PDB)C
> < IPython的输入-50-92405dd4ed52>(6)LT;模块>() - GT;无
- > np.sum(A2)行以上的款项#
(PDB)■
- 呼叫 -
> C:\\ winpython-64-3.4.2.4 \\中的python-3.4.2.amd64 \\ LIB \\站点包\\ numpy的\\核心\\ fromnumeric.py(1623)和()
- > DEF总和(A,轴=无,DTYPE =无,走出=无,keepdims = FALSE):
(PDB)打印(轴)
没有
(PDB)■
> C:\\ winpython-64-3.4.2.4 \\中的python-3.4.2.amd64 \\ LIB \\站点包\\ numpy的\\核心\\ fromnumeric.py(1700)和()
- >如果isinstance(一,_gentype):
(PDB)■
> C:\\ winpython-64-3.4.2.4 \\中的python-3.4.2.amd64 \\ LIB \\站点包\\ numpy的\\核心\\ fromnumeric.py(1706)和()
- > elif的类型(a)是不mu.ndarray:
(PDB)sssssss
*** NameError:名称'sssssss没有定义
(PDB)SS
*** NameError:名称'SS'没有定义
(PDB)■
> C:\\ winpython-64-3.4.2.4 \\中的python-3.4.2.amd64 \\ LIB \\站点包\\ numpy的\\核心\\ fromnumeric.py(1707)和()
- >尝试:
(PDB)■
> C:\\ winpython-64-3.4.2.4 \\中的python-3.4.2.amd64 \\ LIB \\站点包\\ numpy的\\核心\\ fromnumeric.py(1708)和()
- >总和= a.sum
(PDB)■
> C:\\ winpython-64-3.4.2.4 \\中的python-3.4.2.amd64 \\ LIB \\站点包\\ numpy的\\核心\\ fromnumeric.py(1713)和()
- >返回总和(轴=轴,DTYPE DTYPE =,OUT = OUT)
(PDB)打印(轴)
没有
(PDB)■
- 呼叫 -
> c:\\winpython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\pandas\\core\\generic.py(3973)stat_func()
- > @Substitution(outname =名称,DESC = DESC)
(PDB)■
> c:\\winpython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\pandas\\core\\generic.py(3977)stat_func()
- >如果skipna是无:
(PDB)■
> c:\\winpython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\pandas\\core\\generic.py(3978)stat_func()
- > skipna = TRUE
(PDB)■
> c:\\winpython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\pandas\\core\\generic.py(3979)stat_func()
- >如果轴是无:
(PDB)■
> c:\\winpython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\pandas\\core\\generic.py(3980)stat_func()
- >轴= self._stat_axis_number
(PDB)打印(self._stat_axis_number)
0
(PDB)

所以基本上,一旦它在大熊猫最终降落有一些完整性检查,其中一个是,如果轴无那么它从 self._stat_axis_number 是 0 ,因此,行为的差异。我不是一个熊猫开发,使他们可以揭示出这个更多的光线,但这个解释的输出差异

为了实现你必须调用相同的输出两次:

 在[6]:a2.sum(轴= 0)的.sum()
出[6]:
3.9180334059883006

或者

 在[7]:np.sum(np.sum(A2))
出[7]:
3.9180334059883006

In short, numpy.sum(a, axis=None) sums all cells of an array, but sums over rows of a data frame. I thought that pandas.DataFrame is built on top of numpy.array, and should not have this different behavior? What's the under-the-hood conversion?

a1 = numpy.random.random((3,2))
a2 = pandas.DataFrame(a1)
numpy.sum(a1) # Sums all cells
numpy.sum(a2) # Sums over rows

OK the following is a dump of my pdb debugging session which shows how this ends up in pandas land:

In [*]:

a1 = np.random.random((3,2))
import pdb
a2 = pd.DataFrame(a1)
print(np.sum(a1)) # Sums all cells
pdb.set_trace()
np.sum(a2) # Sums over rows
3.02993889742
--Return--
> <ipython-input-50-92405dd4ed52>(5)<module>()->None
-> pdb.set_trace()
(Pdb) b 6
Breakpoint 2 at <ipython-input-50-92405dd4ed52>:6
(Pdb) c
> <ipython-input-50-92405dd4ed52>(6)<module>()->None
-> np.sum(a2) # Sums over rows
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1623)sum()
-> def sum(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) print(axis)
None
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1700)sum()
-> if isinstance(a, _gentype):
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1706)sum()
-> elif type(a) is not mu.ndarray:
(Pdb) sssssss
*** NameError: name 'sssssss' is not defined
(Pdb) ss
*** NameError: name 'ss' is not defined
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1707)sum()
-> try:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1708)sum()
-> sum = a.sum
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1713)sum()
-> return sum(axis=axis, dtype=dtype, out=out)
(Pdb) print(axis)
None
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3973)stat_func()
-> @Substitution(outname=name, desc=desc)
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3977)stat_func()
-> if skipna is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3978)stat_func()
-> skipna = True
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3979)stat_func()
-> if axis is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3980)stat_func()
-> axis = self._stat_axis_number
(Pdb) print(self._stat_axis_number)
0
(Pdb) 

So basically once it ends up in pandas land there are some integrity checks, one of which is that if axis is None then it's assigned the value from self._stat_axis_number which is 0, hence the difference in behaviour. I'm not a pandas dev so they may shed more light on this but this explains the difference in output

In order to achieve the same output you have to call sum twice:

In [6]:

a2.sum(axis=0).sum()
Out[6]:
3.9180334059883006

Or

In [7]:

np.sum(np.sum(a2))
Out[7]:
3.9180334059883006