且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

numpy ufuncs速度与循环速度

更新时间:2023-02-10 15:42:57

这是正常且预期的行为.太简单了,以至于无法应用避免使用numpy进行循环" 语句.如果您要处理内部循环,那么它(几乎)总是正确的.但是在外循环的情况下(如您的情况),会有更多例外.特别是如果替代方法是使用广播,因为这会通过使用更多内存来加快您的操作.

This is normal and expected behaviour. It's just too simplified to apply the "avoid for loops with numpy" statement everywere. If you're dealing with inner loops it's (almost) always true. But in case of outer loops (like in your case) there are far more exceptions. Especially if the alternative is to use broadcasting because this speeds up your operation by using a lot more memory.

只需为该避免使用numpy进行循环" 语句添加背景:

Just to add a bit of background to that "avoid for loops with numpy" statement:

NumPy数组存储为具有的问题的连续数组>类型. Python int与C int不同!因此,每当遍历数组中的每个项目时,都需要从数组中插入该项目,将其转换为Python int,然后对其进行任何处理,最后可能需要再次将其转换为ac整数. (称为装箱和拆箱值).例如,您要使用Python sum数组中的项目:

NumPy arrays are stored as contiguous arrays with c types. The Python int is not the same as a C int! So whenever you iterate over each item in an array you need to plug the item from the array, convert it to a Python int and then do whatever you want to do with it and finally you may need to convert it to a c integer again (called boxing and unboxing the value). For example you want to sum the items in an array using Python:

import numpy as np
arr = np.arange(1000)
%%timeit
acc = 0
for item in arr:
    acc += item
# 1000 loops, best of 3: 478 µs per loop

您***使用numpy:

You better use numpy:

%timeit np.sum(arr)
# 10000 loops, best of 3: 24.2 µs per loop

即使将循环推入Python C代码中,您也无法达到numpy的性能:

Even if you push the loop into Python C code you're far away from the numpy performance:

%timeit sum(arr)
# 1000 loops, best of 3: 387 µs per loop

此规则可能会有例外,但这些例外确实很少.至少只要有一些等效的numpy功能.因此,如果要遍历单个元素,则应使用numpy.

There might be exceptions from this rule but these will be really sparse. At least as long as there is some equivalent numpy functionality. So if you would iterate over single elements then you should use numpy.

有时候,简单的python循环就足够了.它并不广为人知,但与Python函数相比,numpy函数具有巨大的开销.例如,考虑一个3元素数组:

Sometimes a plain python loop is enough. It's not widly advertised but numpy functions have a huge overhead compared to Python functions. For example consider a 3-element array:

arr = np.arange(3)
%timeit np.sum(arr)
%timeit sum(arr)

哪个会更快?

解决方案:Python函数的性能优于numpy解决方案:

Solution: The Python function performs better than the numpy solution:

# 10000 loops, best of 3: 21.9 µs per loop  <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python


但这与您的示例有什么关系?实际上并没有那么多,因为您总是在数组上使用numpy函数(不是单个元素,甚至不是几个元素),因此您的内部循环已经使用了优化函数.这就是为什么两者的性能大致相同的原因(只有很少的元素的+/- 10到大约500个元素的2).但这并不是真正的循环开销,而是函数调用开销!


But what does this have to do with your example? Not all that much in fact because you always use numpy-functions on arrays (not single elements and not even few elements) so your inner loop already uses the optimized functions. That's why both perform roughly the same (+/- a factor 10 with very few elements to a factor 2 at roughly 500 elements). But that's not really the loop overhead, it's the function call overhead!

使用 line-profiler resolution = 100:

def fun_func(tim, prec, values):
    for i, ti in enumerate(tim):
        values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2       101          752      7.4      5.7      for i, ti in enumerate(tim):
     3       100        12449    124.5     94.3          values[i] = np.sum(np.sin(prec * ti))

95%的钱都花在了循环中,我什至将循环主体分为几部分来验证:

95% is spent inside the loop, I've even split the loop body into several parts to verify this:

def fun_func(tim, prec, values):
    for i, ti in enumerate(tim):
        x = prec * ti
        x = np.sin(x)
        x = np.sum(x)
        values[i] = x
%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2       101          609      6.0      3.5      for i, ti in enumerate(tim):
     3       100         4521     45.2     26.3          x = prec * ti
     4       100         4646     46.5     27.0          x = np.sin(x)
     5       100         6731     67.3     39.1          x = np.sum(x)
     6       100          714      7.1      4.1          values[i] = x

这里的时间使用者为np.multiplynp.sinnp.sum,因为您可以通过比较每次通话的时间和开销来轻松查看:

The time consumers are np.multiply, np.sin, np.sum here, as you can easily check by comparing their time per call with their overhead:

arr = np.ones(1, float)
%timeit np.sum(arr)
# 10000 loops, best of 3: 22.6 µs per loop

因此,与计算运行时相比,通信功能调用的开销较小时,您将拥有类似的运行时.即使有100个项目,您也非常接近间接费用时间.诀窍是要知道他们在什么时候收支平衡.对于1000个项目,通话费用仍然很可观:

So as soon as the comulative function call overhead is small compared to the calculation runtime you'll have similar runtimes. Even with 100 items you're quite close to the overhead time. The trick is to know at which point they break-even. With 1000 items the call-overhead is still significant:

%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2      1001         5864      5.9      2.4      for i, ti in enumerate(tim):
     3      1000        42817     42.8     17.2          x = prec * ti
     4      1000       119327    119.3     48.0          x = np.sin(x)
     5      1000        73313     73.3     29.5          x = np.sum(x)
     6      1000         7287      7.3      2.9          values[i] = x

但是使用resolution = 5000时,与运行时相比,开销非常低:

But with resolution = 5000 the overhead is quite low compared to the runtime:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2      5001        29412      5.9      0.9      for i, ti in enumerate(tim):
     3      5000       388827     77.8     11.6          x = prec * ti
     4      5000      2442460    488.5     73.2          x = np.sin(x)
     5      5000       441337     88.3     13.2          x = np.sum(x)
     6      5000        36187      7.2      1.1          values[i] = x

当您在每个np.sin呼叫中花费500us时,您不再关心20us的开销了.

When you spent 500us in each np.sin call you don't care about the 20us overhead anymore.

可能需要注意一点:line_profiler每行可能包括一些额外的开销,也许每个函数调用也要包括一些额外的开销,因此可以忽略的函数调用开销可能会更低!

A word of caution is probably in order: line_profiler probably includes some additional overhead per line, maybe also per function call, so the point at which the function call overhead becomes negligable might be lower!!!

我首先介绍了第一个解决方案,让我们对第二个解决方案做同样的事情:

I started by profiling the first solution, let's do the same with the second solution:

def fun_func(tim, prec, values):
    x = tim[:, np.newaxis]
    x = x * prec
    x = np.sin(x)
    x = np.sum(x, axis=1)
    return x

再次将line_profiler与resolution=100一起使用:

Again using the line_profiler with resolution=100:

%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           27     27.0      0.5      x = tim[:, np.newaxis]
     3         1          638    638.0     12.9      x = x * prec
     4         1         3963   3963.0     79.9      x = np.sin(x)
     5         1          326    326.0      6.6      x = np.sum(x, axis=1)
     6         1            4      4.0      0.1      return x

这已经大大超过了开销时间,因此与循环相比,我们的速度快了10倍.

This already exceeds the overhead time significantly and thus we end up a factor 10 faster compared to the loop.

我也对resolution=1000做了分析:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           28     28.0      0.0      x = tim[:, np.newaxis]
     3         1        17716  17716.0     14.6      x = x * prec
     4         1        91174  91174.0     75.3      x = np.sin(x)
     5         1        12140  12140.0     10.0      x = np.sum(x, axis=1)
     6         1           10     10.0      0.0      return x

以及precision=5000:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           34     34.0      0.0      x = tim[:, np.newaxis]
     3         1       333685 333685.0     11.1      x = x * prec
     4         1      2391812 2391812.0    79.6      x = np.sin(x)
     5         1       280832 280832.0      9.3      x = np.sum(x, axis=1)
     6         1           14     14.0      0.0      return x

1000大小仍然更快,但是正如我们已经看到的那样,循环解决方案中的呼叫开销仍然不可忽略.但是对于resolution = 5000,每个步骤花费的时间几乎相同(有些慢一些,有些快一些,但总体上很相似)

The 1000 size is still faster, but as we've seen there the call overhead was still not-negligable in the loop-solution. But for resolution = 5000 the time spent in each step is almost identical (some are a bit slower, others faster but overall quite similar)

另一个效果是,乘法时实际的广播变得很重要.即使使用非常聪明的Numpy解决方案,它仍然包括一些其他计算.对于resolution=10000,您看到广播乘法相对于循环解决方案开始占用更多的%time":

Another effect is that the actual broadcasting when you do the multiplication becomes significant. Even with the very smart numpy solutions this still includes some additional calculations. For resolution=10000 you see that the broadcasting multiplication starts to take up more "% time" in relation to the loop solution:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def broadcast_solution(tim, prec, values):
     2         1           37     37.0      0.0      x = tim[:, np.newaxis]
     3         1      1783345 1783345.0    13.9      x = x * prec
     4         1      9879333 9879333.0    77.1      x = np.sin(x)
     5         1      1153789 1153789.0     9.0      x = np.sum(x, axis=1)
     6         1           11     11.0      0.0      return x


Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           def loop_solution(tim, prec, values):
     9     10001        62502      6.2      0.5      for i, ti in enumerate(tim):
    10     10000      1287698    128.8     10.5          x = prec * ti
    11     10000      9758633    975.9     79.7          x = np.sin(x)
    12     10000      1058995    105.9      8.6          x = np.sum(x)
    13     10000        75760      7.6      0.6          values[i] = x


但是除了实际花费的时间之外,还有另一件事:内存消耗.您的循环解决方案需要O(n)内存,因为您始终要处理n元素.但是,广播解决​​方案需要O(n*n)内存.如果将resolution=20000与循环一起使用,则可能需要等待一段时间,但仍然只需要8bytes/element * 20000 element ~= 160kB,而对于广播,则需要~3GB.而且这忽略了常数因子(例如临时数组又称为中间数组)!假设您走得更远,您将很快耗尽内存!


But there is another thing besides actual time spent: the memory consumption. Your loop solution requires O(n) memory because you always process n elements. The broadcasting solution however requires O(n*n) memory. You probably have to wait some time if you use resolution=20000 with your loop but it would still only require 8bytes/element * 20000 element ~= 160kB but with the broadcasting you'll need ~3GB. And this neglects the constant factor (like temporary arrays aka intermediate arrays)! Suppose you go even further you'll run out of memory very fast!

是时候再次总结要点了:

Time to summarize the points again:

  • 如果您对numpy数组中的单个项目执行python循环操作,那是错误的.
  • 如果您遍历numpy数组的子数组,请确保与使用该函数所花费的时间相比,每个循环中的函数调用开销都可以忽略不计.
  • 如果广播numpy数组,请确保不会耗尽内存.

但是关于优化的最重要的一点仍然是:

But the most important point about optimization is still:

  • 仅在速度太慢时优化代码!如果速度太慢,则仅在分析代码后进行优化.

  • Only optimize the code if it's too slow! If it's too slow then optimize only after profiling your code.

不要盲目地相信简化的语句,并且再次永远不会在没有概要分析的情况下进行优化.

Don't blindly trust simplified statements and again never optimize without profiling.

最后一个想法:

可以使用的问题,如果 scipy .

Such functions that either require a loop or broadcasting can be easily implemented using cython, numba or numexpr if there is no already existing solution in numpy or scipy.

例如,将循环解决方案的内存效率与低resolutions的广播解决方案的速度相结合的numba函数看起来像这样:

For example a numba function that combines the memory efficiency from the loop solution with the speed of the broadcast solution at low resolutions would look like this:

from numba import njit

import math

@njit
def numba_solution(tim, prec, values):
    size = tim.size
    for i in range(size):
        ti = tim[i]
        x = 0
        for j in range(size):
            x += math.sin(prec[j] * ti)
        values[i] = x

如评论中所指出的,numexpr还可以非常快速且无需评估O(n*n)内存来评估广播的计算:

As pointed out in the comments numexpr can also evaluate the broadcasted calculation very fast and without requiring O(n*n) memory:

>>> import numexpr
>>> tim_2d = tim[:, np.newaxis]
>>> numexpr.evaluate('sum(sin(tim_2d * prec), axis=1)')