且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

用重复序列替换df

更新时间:2022-04-14 19:45:33

您说有些团队重新出现",那时我认为

You say some teams "reappear" and at that point I thought the little intergroup helper function from this answer might be just the right tool here. It is useful when in your case, there are teams e.g. "w" that reappear in the same year, e.g. 2013, after another team has been there for some time, e.g. "c". Now if you want to treat each sequence of occurence per team as separate groups in order to get the first or last date of that sequence, that when this function is useful. Note that if you only group by "team" and "year" as you would normally do, each team, e.g. "w" could only have one first/last date (for example when using "summarise" in dplyr).

定义功能:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

现在先按年份对数据进行分组,然后再使用团队"列上的组间功能进行分组:

Now group your data first by year and then additionally by using the intergroup function on the teams column:

library(dplyr)
df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  filter(dense_rank(dates) == 1)

最后,您可以根据需要进行过滤.例如,在这里,我过滤了最小日期.结果将是:

Finally, you can filter according to your needs. Here for example, I filter the min dates. The result would be:

#Source: local data frame [3 x 4]
#Groups: year, teamindex
#
#       dates teams year teamindex
#1 2013-01-01     w 2013         1
#2 2013-01-04     c 2013         2
#3 2013-01-12     w 2013         3

请注意,再次出现团队"w"是因为我们通过使用组间函数创建的"teamindex"进行了分组.

Note that team "w" reappears because we grouped by "teamindex" which we created by using intergroup function.

执行过滤的另一种方法是这样的(先使用排列,然后再使用slice):

Another option to do the filtering is like this (using arrange and then slice):

df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  arrange(dates) %>%
  slice(1)

我使用的数据来自akrun的答案.

The data I used is from akrun's answer.