且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R将时间序列中的重复行与数据表中的不同列类型组合起来

更新时间:2023-12-01 13:38:16

也可以实现预期结果(对于给定的样本数据集)随后调用 separate_rows()cSplit():

library(data.table) # 版本 1.12.9dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE)else unlist(na.omit(unique(x)))), by = time]时间 abst farbe gier goff huft 模式1:1 1 keine 2.5 豪斯<NA>4.02:1 1 keine 2.5 鼠 <NA>4.03:1 1 keine 2.5 toll <NA>4.04: 2 0 keine 0.0 鼠狼 2.55:3 0 keine 0.0 <NA>沃尔南6:4 3 rot 3.0 鼠尾草 3.07:4 3 blau 3.0 鼠 holz 3.0

请注意,这种方法适用于数字列和因子列的任意组合;no 列名需要明确说明.

但是,我确实相信对潜在问题的正确答案是时间返回一行,而不是一种部分聚合(当然,您的里程可能会有所不同):

dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE)else list(na.omit(unique(x)))), by = time]

 time abst farbe gier goff huft mode1:1 1 keine 2.5 haus,maus,toll 4.02: 2 0 keine 0.0 鼠狼 2.53: 3 0 keine 0.0 wolle NaN4:4 3 rot,blau 3.0 maus holz 3.0

请注意,已使用 list() 而不是 toString() 来聚合因子列.这有利于避免在因子级别之一偶然包含逗号 , 的情况下出现问题.此外,在大型生产数据集中,更容易识别每个 time 具有非唯一因素的案例:

# 像以前一样计算聚合dtRes <- dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE)else list(na.omit(unique(x)))), by = time]# 查找每组具有非唯一因素的案例# note .SDcols = is.list 适用于 data.table 版本 1.12.9tmp  1L)) > 0), .SDcols = is.list, by = time]时间

 时间 V11:1 12:4 1

# 显示受影响的行dtRes[tmp, on = "时间"]

 time abst farbe gier goff huft mode V11:1 1 keine 2.5 豪斯,鼠,收费 4 12:4 3 腐烂,蓝 3.0 鼠 holz 3 1

# 显示不受影响的行dtRes[!tmp, on = "时间"]

 time abst farbe gier goff huft mode1:2 0 keine 0 鼠窝 2.52: 3 0 keine 0 wolle NaN

This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time and some other columns of different types (factors and numerics). Here is an example:

dt <- data.table(time  = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
             abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
             farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
             gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
             goff  = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
             huft  = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
             mode  = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))

Now I want to combine the duplicate times in column time. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

dtRes <- data.table(time  = c(1, 1, 1, 2, 3, 4, 4),
                abst  = c(1, 1, 1, 0, 0, 3, 3),
                farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
                gier  = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
                goff  = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
                huft  = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
                mode  = c(4, 4, 4, 2.5, NA, 3, 3))

I need some fast calculation for this, because I have about a million observations.

Some extra thoughts to this problem: farbe may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe, so there are 2 identical times and all the rest stays the same but different values for farbe. This should be just very rare case, but would be a great addition.

Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time is always there and numeric) or factor columns.

Thx in advance!

The expected result (for the given sample dataset) can also be achieved without a subsequent call to separate_rows() or cSplit():

library(data.table) # version 1.12.9
dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
            else unlist(na.omit(unique(x)))), by = time]

   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    1    1 keine  2.5 toll  <NA>  4.0
4:    2    0 keine  0.0 maus wolle  2.5
5:    3    0 keine  0.0 <NA> wolle  NaN
6:    4    3   rot  3.0 maus  holz  3.0
7:    4    3  blau  3.0 maus  holz  3.0

Please, note that this approach will work for an arbitrary mix of numeric and factor columns; no column names need to be stated explicitly.


However, I do believe the correct answer to the underlying problem is to return one row per time instead of a kind of partial aggregate (your mileage may vary, of course):

dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]

   time abst    farbe gier           goff  huft mode
1:    1    1    keine  2.5 haus,maus,toll        4.0
2:    2    0    keine  0.0           maus wolle  2.5
3:    3    0    keine  0.0                wolle  NaN
4:    4    3 rot,blau  3.0           maus  holz  3.0

Please, note that list() instead of toString() has been used to aggregate the factor columns. This has the benefit to avoid problems in case one of the factor levels includes a comma , by chance. Furthermore, it is easier to identify cases with non-unique factors per time in a large production dataset:

# compute aggregate as before
dtRes <- dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
# find cases with non-unique factors per group
# note .SDcols = is.list is available with data.table version 1.12.9
tmp <- dtRes[, which(Reduce(sum, lapply(.SD, function(x) lengths(x) > 1L)) > 0), .SDcols = is.list, by = time]
tmp

   time V1
1:    1  1
2:    4  1

# show affected rows
dtRes[tmp, on = "time"]

   time abst    farbe gier           goff huft mode V1
1:    1    1    keine  2.5 haus,maus,toll         4  1
2:    4    3 rot,blau  3.0           maus holz    3  1

# show not affected rows
dtRes[!tmp, on = "time"]

   time abst farbe gier goff  huft mode
1:    2    0 keine    0 maus wolle  2.5
2:    3    0 keine    0      wolle  NaN