且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

numpy.sum() 在大数组上给出奇怪的结果

更新时间:2023-01-28 16:49:38

这显然是 numpy 的整数类型溢出 32 位.通常,您可以使用 np.seterr 将 numpy 配置为在这种情况下失败:

>>>将 numpy 导入为 np>>>np.seterr(over='raise'){'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}>>>np.int8(127) + np.int8(2)FloatingPointError:byte_scalars 中遇到溢出

但是,sum 被明确记录为行为溢出时不​​会引发错误",因此您可能不走运.使用 numpy 通常是为了方便而牺牲性能!

然而,您可以手动指定累加器的 dtype,如下所示:

>>>a = np.ones(129)>>>a.sum(dtype=np.int8) # 会溢出-127>>>a.sum(dtype=np.int64) # 无溢出129

观看票 #593,因为这是一个未解决的问题,它可能是某个时候由 numpy 开发人员修复.

I seem to have found a pitfall with using .sum() on numpy arrays but I'm unable to find an explanation. Essentially, if I try to sum a large array then I start getting nonsensical answers but this happens silently and I can't make sense of the output well enough to Google the cause.

For example, this works exactly as expected:

a = sum(xrange(2000)) 
print('a is {}'.format(a))

b = np.arange(2000).sum()
print('b is {}'.format(b))

Giving the same output for both:

a is 1999000
b is 1999000

However, this does not work:

c = sum(xrange(200000)) 
print('c is {}'.format(c))

d = np.arange(200000).sum()
print('d is {}'.format(d))

Giving the following output:

c is 19999900000
d is -1474936480

And on an even larger array, it's possible to get back a positive result. This is more insidious because I might not identify that something unusual was happening at all. For example this:

e = sum(xrange(100000000))
print('e is {}'.format(e))

f = np.arange(100000000).sum()
print('f is {}'.format(f))

Gives this:

e is 4999999950000000
f is 887459712

I guessed that this was to do with data types and indeed even using the python float seems to fix the problem:

e = sum(xrange(100000000))
print('e is {}'.format(e))

f = np.arange(100000000, dtype=float).sum()
print('f is {}'.format(f))

Giving:

e is 4999999950000000
f is 4.99999995e+15

I have no background in Comp. Sci. and found myself stuck (perhaps this is a dupe). Things I've tried:

  1. numpy arrays have a fixed size. Nope; this seems to show I should hit a MemoryError first.
  2. I might somehow have a 32-bit installation (probably not relevant); nope, I followed this and confirmed I have 64-bit.
  3. Other examples of weird sum behaviour; nope (?) I found this but I can't see how it applies.

Can someone please explain briefly what I'm missing and tell me what I need to read up on? Also, other than remembering to define a dtype each time, is there a way to stop this happening or give a warning?

Possibly relevant:

Windows 7

numpy 1.11.3

Running out of Enthought Canopy on Python 2.7.9

This is clearly numpy's integer type overflowing 32-bits. Normally you can configure numpy to fail in such situations using np.seterr:

>>> import numpy as np
>>> np.seterr(over='raise')
{'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
>>> np.int8(127) + np.int8(2)
FloatingPointError: overflow encountered in byte_scalars

However, sum is explicitly documented with the behaviour "No error is raised on overflow", so you might be out of luck here. Using numpy is often a trade-off of performance for convenience!

You can however manually specify the dtype for the accumulator, like this:

>>> a = np.ones(129)
>>> a.sum(dtype=np.int8)  # will overflow
-127
>>> a.sum(dtype=np.int64)  # no overflow
129

Watch ticket #593, because this is an open issue and it might be fixed by numpy devs sometime.