且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用tidyr从列中提取值

更新时间:2023-01-30 19:27:56

您可以尝试strsplit

res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, ";"), 
      function(x) tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")))

res1 <-  data.frame(Name=annot[,1], setNames(res, c('Component',
     'Function', 'P')), stringsAsFactors=FALSE)

res1
#   Name                             Component
#1 dd_1     C:extracellular space;C:cell body
#2 dd_2       C:Signal transduction;C:nucleus
#3 dd_3 C:cardiomyceltes;C:intracellular pace
#                                                 Function
#1                                      F:transport carrier
#2                                    F:positive regulation
#3 F:putative;F:magnesium ion binding;F:calcium ion binding
#                                       P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive regulation
#3 P:visual perception;P:blood coagulation

或者您可以从tidyr

library(tidyr)
extract(annot, GOs, c('C', 'F', 'P'), '(C:[^F]+);(F:[^P]+);(P:.*)')
# Name                                      C
#1 dd_1     C:extracellular space;C:cell body
#2 dd_2       C:Signal transduction;C:nucleus
#3 dd_3 C:cardiomyceltes;C:intracellular pace
#                                                        F
#1                                      F:transport carrier
#2                                    F:positive regulation
#3 F:putative;F:magnesium ion binding;F:calcium ion binding
#                                       P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive regulation
#3 P:visual perception;P:blood coagulation

更新

新数据集的每一行都缺少一些元素(即"C","F"等).您可以修改第一个解决方案

Update

The new dataset have some elements absent (i.e. "C" , "F" etc) for each row. You could modify the first solution

res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, "; "),function(x){
      x1 <- tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")
      x1[match(c('C', 'F', 'P'),  names(x1))]}))
 res1 <-  data.frame(Name=annot[,1], setNames(res, c('Component',
          'Function', 'P')), stringsAsFactors=FALSE)
 head(res1,2)
 #  Name                         Component              Function
 #1 dd_1 C:extracellular space;C:cell body                  <NA>
 #2 dd_2   C:Signal transduction;C:nucleus F:positive regulation
 #                                          P
 #1    P:cell migration process;P:NF/ß pathway
 #2 P:single organism;P:positive(+) regulation