且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R在一列中查找重复项,并在第二列中折叠

更新时间:2022-11-14 13:14:28

code>在基础R中单击

  data.frame(probes = unique(olap $探针),
基因=***(olap $ genes,olap $ probes,paste,collapse =))

或使用plyr:

  library(plyr)
ddply(olap,probes总结基因= paste(基因,collapse =))

更新



在第一个版本中可能更安全:

   

只要以独一无二的方式将探测器以不同的顺序发送到 tapply 。我个人总是使用 ddply


I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case. for example, if I have this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1
3   cg00061679  DAZ4
4   cg00061679  DAZ4

I want to change it to this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1 DAZ4 DAZ4

obviously there is no problem doing this for a single probe using which, and then paste and collapse

ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")

but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?

Thanks in advance

You can use tapply in base R

data.frame(probes=unique(olap$probes), 
           genes=tapply(olap$genes, olap$probes, paste, collapse=" "))

or use plyr:

library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))

UPDATE

It's probably safer in the first version to do this:

tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)

Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.