且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Pandas:如何在滚动窗口中选择一列

更新时间:2022-01-17 22:42:41

你的错误源于假设 apply 里面的函数是一个数据帧,它实际上是一个 ndarray 而不是数据帧.

Your error stems from assuming what comes to the function inside apply is a dataframe, it is actually a ndarray not a dataframe.

Pandas 数据框 apply 适用于数据框的每一列/系列,因此任何传递给 apply 的函数都沿着每一列/系列应用,就像一个内部 lambda.在窗口数据帧的情况下,apply 获取每个窗口内的每个列/系列,并作为 ndarray 传递给函数,并且该函数必须仅返回每个窗口每个系列的长度为 1 的数组.知道这一点可以节省很多痛苦.

Pandas dataframe apply works on each column/series of the dataframe, so any function passed to apply is applied along each column/series like an internal lambda. In case of windowed dataframe, apply takes each column/series inside the each window and passes to the function as ndarray and the function has to return only array of length 1 per one series per one window. Knowing this saves a lot of pain.

所以在你的情况下你不能使用任何应用,除非你有一个复杂的函数来记住每个窗口的 a 系列的第一个值.

so in your case you cannot use any apply unless you have a complex function that remembers first value of the series a for each window.

对于 OP 的情况,如果窗口的一列说 a 满足条件,请说 >10

For OP's case if a column of the window say a is meeting a condition, say > 10

  1. 对于窗口第一行a满足条件的情况,与在数据帧中搜索df[df['a']>10]代码>.

  1. For case where a in the first row of a window meets condition it is same as searching in dataframe df[df['a']>10].

对于其他条件,例如窗口第二行中的 a>10,除了数据框的第一个窗口外,检查整个数据框都有效.

For other conditions like a in second row of a window is > 10, checking the entire dataframe works except for the first window of the dataframe.

以下示例展示了另一种解决方法.

Following example demonstrates another way to solution.

import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)), columns=list('abcd'))

df 看起来像

    a   b   b   d
0   13  2   2   6
1   17  19  10  1
2   0   17  15  9
3   0   14  0   15
4   19  14  4   0
5   16  4   17  3
6   2   7   2   15
7   16  7   9   3
8   6   1   2   1
9   12  8   3   10
10  5   0   11  2
11  10  13  18  4
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
15  1   11  5   18
16  17  12  18  17
17  1   19  12  9
18  16  17  3   3
19  11  7   9   2

现在选择一个窗口,如果 a 的滚动窗口内的第二行满足条件 a >10 就像 OP 的问题.

now to select a window if second row inside rolling window of a meets a condition a > 10 like in OP's question.

roll_window=5
search_index=1

df_roll = df['a'].rolling(roll_window)
df_y = df_roll.apply(lambda x:x[1] if x[1] > 10 else np.nan).dropna()

以上行返回窗口第二行中与条件 a 对应的 a 的所有值,大于 10.请注意,基于上面的示例数据帧,这些值是正确的,但索引由滚动窗口的居中方式定义.

above line returns all values of a corresponding to condition a in second row of a window greater then 10. Note the values are right based on example dataframe above but the indexes are defined by how rolling window was centered.

4     17.0
7     19.0
8     16.0
10    16.0
12    12.0
15    15.0
16    13.0
17    14.0
19    17.0

在第一个数据框中获取正确的索引位置和整行

to get the right index location and entire row inside the first dataframe

df.loc[df_y.index+searchindex-rollwindow+1]

返回

    a   b   b   d
1   17  19  10  1
4   19  14  4   0
5   16  4   17  3
7   16  7   9   3
9   12  8   3   10
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
16  17  12  18  17

也可以使用 np.array(df) 制作一个对应滚动窗口的滚动切片,并相应地使用切片过滤数组.

one could also use np.array(df) and make a rolling slice corresponding to rolling window and filter the array using slices correspondingly.