且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何为每个分组元素选择随机非连续日期?

更新时间:2023-02-17 17:58:30

请检查是否达到目的?实际上,使用提供标准选择最大可能日期很困难(至少对我而言).我们可以通过以下策略识别连续和非连续组中的日期.但是考虑来自一组连续 3 个日期的两个场景.如果 random 样本包含 2 个单位,则这些单位也可以是连续的或非连续的.假设如果我们进一步选择奇数 (2) 或偶数 (1) 行,那么在我看来,样本将是判断性的而不是随机的.这是采用的策略-

Please check whether it serves the purpose? Actually, selecting maximum possible dates with the provide criteria is difficult (at least for me). We can identify dates in consecutive and non-consecutive groups by the following strategy. But consider two scenarios from a group of say 3 consecutive dates. If the random sample contains 2 units, these can be consecutive or non-consecutive as well. Suppose if we further select either odd (2) or even(1) rows then the sample would have been judgmental and not random in my opinion. This is the strategy adopted -

  • 将数据分组
  • 通过purrr::map_df对每组分别进行操作,最后行绑定数据
  • 将数据(现在是组)划分为连续和非连续日期(每个连续日期都在其自己的组中).从每个组中选择唯一的行.
  • 最后从这些行中的每一行中选择三个(或根据小组结果选择更少).
  • splitted the data in groups
  • carried out operations in each group separately through purrr::map_df which finally row binds the data
  • divided the data (now groups) in consecutive and non-consecutive dates (each consecutive date will be in its own group). Select unique row from each group.
  • finally select three (or less as per group outcome) from each of these rows.
library(tidyverse)

df %>% 
  ungroup() %>% 
  group_split(Site) %>% 
  map_df(., ~ .x %>% ungroup() %>%
        arrange(Date) %>%
        mutate(n = 1) %>%
        complete(Date = seq.Date(first(Date), last(Date), by = 'days')) %>%
        group_by(n = cumsum(is.na(n))) %>%
        filter(!is.na(Site)) %>%
        sample_n(1) %>%
        ungroup() %>%
        sample_n(min(n(), 3))) %>%
  select(-n)

# A tibble: 86 x 2
   Date       Site   
   <date>     <chr>  
 1 2020-03-04 HP36P1B
 2 2020-03-04 HP36P3B
 3 2020-03-04 HP36P4B
 4 2020-03-07 HP37P1B
 5 2020-03-12 HP37P1B
 6 2020-03-07 HP37P2B
 7 2020-03-12 HP37P2B
 8 2020-03-07 HP37P4B
 9 2020-03-12 HP37P4B
10 2020-03-04 HP4008R
# ... with 76 more rows

注意:您的 dput 已分组,因此我必须在代码的第二行添加 ungroup(),您可以将其删除

Note: Your dput was grouped so I had to add ungroup() in second line of the code, which you may remove