且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

按因子列汇总混合数据

更新时间:2023-02-06 17:34:38

You could write your own functions and then use lapply. First, write a function to find the most frequent level in a factor variable

getmode <- function(v) {
  levels(v)[which.max(table(v))]
}

Then write a function to return either the mean or mode depending on the type of variable passed to it

my_summary <- function(x, id, ...){
  if (is.numeric(x)) {
    return(tapply(x, id, mean))
  }  
  if (is.factor(x)) {
    return(tapply(x, id, getmode))
  }  
}

Finally, use lapply to calculate the summaries

data.frame(lapply(df, my_summary, id = df$IDnumber))
  IDnumber Gender   Weight LikesSoda
1        1   Male 81.33333        No
2        2 Female 68.00000       Yes
3        3 Female 52.00000       Yes

If there might be two or more levels in a factor with the same, maximum frequency then which.max will just return the first one. I understand from your comment that you just want to know how many of them there are, so one option might be to amend the getmode function slightly, so it adds an asterisk to the level when there is a tie:

getmode <- function(v) {
  tab <- table(v)
  if (sum(tab %in% max(tab)) > 1)  return(paste(levels(v)[which.max(tab)], '*'))
  levels(v)[which.max(tab)]
}

(Changing your sample data so there is one Female and one Male with IDnumber == "2")

data.frame(lapply(df, my_summary, id = df$IDnumber))

  IDnumber   Gender   Weight LikesSoda
1        1     Male 81.33333        No
2        2 Female * 68.00000       Yes
3        3   Female 52.00000       Yes

I'm afraid that's a bit of a messy 'solution', but if you just want to get an idea of how common that issue is, perhaps it will be sufficient for your needs.