且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么plyr这么慢?

更新时间:2023-12-01 07:54:16

为什么这么慢?一个小的研究发现一个邮件组从2011年8月发布,@hadley,包作者,


这是ddply总是与数据一起使用的一个缺点
帧。如果你使用summarize而不是
data.frame(因为data.frame很慢)会更快一些,但我仍然认为
如何克服ddply $的基本限制b $ b方法。







plyr代码我也不知道。



summarize()在你的命令是一个只是帮助函数,纯和简单。我们可以用我们自己的sum函数来代替它,因为它不帮助任何不简单的东西, .data (price )参数可以更明确。结果是

  ddply(dd [,2:3],〜price,function(x)sum(x $ volume) )

summarize 看起来不错,它只是不是比一个简单的函数调用快。这说得通;只需查看我们的小功能与代码 summarise 。使用修改后的公式运行基准可以获得明显的增益。不要认为这意味着你使用plyr不正确,你没有,它只是没有效率;你可以做的任何事情都会使其与其他选项一样快。



在我看来,优化的函数仍然臭,因为它不清楚,必须精神上解析与data.table(甚至增加了60%)相比仍然是可笑的。






href =https://groups.google.com/forum/?fromgroups#!msg/manipulatr/Xo3-2FBI35k/9pClNUuxoPIJ%5B1-25%5D>线程,关于plyr的缓慢,提到了一个plyr2项目。自从对问题的原始回答的时间,plyr作者已经发布 dplyr 作为plyr的后继者。虽然plyr和dplyr都是作为数据操作工具计费的,您的主要声明的兴趣是聚合,您可能仍然对新包的基准结果感兴趣,以便进行比较,因为它有一个重做的后端以提高性能。

  plyr_Original  plyr_Optimized& -  function(dd)ddply(dd [,2:3],〜price,function(x)sum(x $ volume))

dplyr
data_table< - function(dd)dd [,sum(volume),keyby = price]

数据框包已从CRAN中删除,



这里是 i = 5,j = 8 基准结果:

  $`obs = 500,000独立价格= 158,286 reps = 5` 
test已过相对
9 data_table(d.dt)0.074 1.000
4 dplyr(d.dt)0.133 1.797
3 dplyr(d.df)1.832 24.757
6 l.apply d.df)5.049 68.230
5 t.apply(d.df)8.078 109.162
8 agg(d.df)11.822 159.757
7由(d.df)48.569 656.338
2 plyr_Optimized(d.df)148.030 2000.405
1 plyr_Original(d.df)401.890 5430.946

毫无疑问,优化帮助了一下。看看 d.df 函数;他们只是不能竞争。



有关data.frame结构缓慢的一点看法是data_table和dplyr的聚合时间的微观基准使用更大的测试数据集( i = 8,j = 8 )。

  $`obs = 50,000,000独立价格= 15,836,476 reps = 5` 
单位:秒
expr min lq median uq max neval
data_table(d.dt)1.190 1.193 1.198 1.460 1.574 10
dplyr(d.dt)2.346 2.434 2.542 2.942 9.856 10
dplyr(d.df)66.238 66.688 67.436 69.226 86.641 10

data.frame仍在 中。不仅如此,下面是用测试数据填充数据结构的已用系统时间:

 `d.df' (data.frame)3.181秒。 
`d.dt`(data.table)0.418秒。

data.frame的创建和聚合都比data.table 。



使用 R 中的data.frame 比一些替代方法慢,基准显示内置的R功能吹出水。即使管理data.frame作为dplyr,这改善了内置,不提供***的速度;其中as data.table 更快在创建和聚集 data.table在处理/在data.frames上工作时执行。



最后...



Plyr很慢,因为它管理数据框架操作



[punt ::查看原始问题的意见]。






  ## R版本3.0.2(2013-09-25)
##平台:x86_64-pc- linux-gnu(64位)
##
##附加的基本软件包:
## [1] stats graphics grDevices utils数据集方法base
##
## other attached packages:
## [1] microbenchmark_1.3-0 rbenchmark_1.0.0 xts_0.9-7
## [4] zoo_1.7-11 data.table_1.9.2 dplyr_0.1.2
## [7] plyr_1.8.1 knitr_1.5.22
##
##通过命名空间加载(未附加):
## [1] assertthat_0.1 evaluate_0 .5.2 formatR_0.10.4 grid_3.0.2
## [5] lattice_0.20-27 Rcpp_0.11.0 reshape2_1.2.2 stringr_0.6.2
## [9] tools_3.0.2

Data-生成gist .rmd


I think I am using plyr incorrectly. Could someone please tell me if this is 'efficient' plyr code?

require(plyr)
plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 

A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R.

I tested a few aggregation methods - and found myself waiting around all day.

When I finally got results back, I discovered a huge gap between the plyr method and the others - which makes me think that I've done something dead wrong.

I ran the following code (I thought I'd check out the new dataframe package while I was at it):

require(plyr)
require(data.table)
require(dataframe)
require(rbenchmark)
require(xts)

plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 
t.apply <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
t.apply.x <- function(dd) unlist(tapply(dd[,2], dd[,1], sum))
l.apply <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
l.apply.x <- function(dd) unlist(lapply(split(dd[,2], dd[,1]), sum))
b.y <- function(dd) unlist(by(dd$volume, dd$price, sum))
b.y.x <- function(dd) unlist(by(dd[,2], dd[,1], sum))
agg <- function(dd) aggregate(dd$volume, list(dd$price), sum)
agg.x <- function(dd) aggregate(dd[,2], list(dd[,1]), sum)
dtd <- function(dd) dd[, sum(volume), by=(price)]

obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
timS <- timeBasedSeq('20110101 083000/20120101 083000')

bmkRL <- list(NULL)

for (i in 1:5){
  tt <- timS[1:obs[i]]

  for (j in 1:8){
    pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
    px <- sample(pxl, length(tt), replace=TRUE)
    vol <- rnorm(length(tt), 1000, 100)

    d.df <- base::data.frame(time=tt, price=px, volume=vol)
    d.dfp <- dataframe::data.frame(time=tt, price=px, volume=vol)
    d.matrix <- as.matrix(d.df[,-1])
    d.dt <- data.table(d.df)

    listLabel <- paste('i=',i, 'j=',j)

    bmkRL[[listLabel]] <- benchmark(plyr(d.df), plyr(d.dfp), t.apply(d.df),     
                         t.apply(d.dfp), t.apply.x(d.matrix), 
                         l.apply(d.df), l.apply(d.dfp), l.apply.x(d.matrix),
                         b.y(d.df), b.y(d.dfp), b.y.x(d.matrix), agg(d.df),
                         agg(d.dfp), agg.x(d.matrix), dtd(d.dt),
          columns =c('test', 'elapsed', 'relative'),
          replications = 10,
          order = 'elapsed')
  }
}

The test was supposed to check up to 5e8, but it took too long - mostly due to plyr. The 5e5 the final table shows the problem:

$`i= 5 j= 8`
                  test  elapsed    relative
15           dtd(d.dt)    4.156    1.000000
6        l.apply(d.df)   15.687    3.774543
7       l.apply(d.dfp)   16.066    3.865736
8  l.apply.x(d.matrix)   16.659    4.008422
4       t.apply(d.dfp)   21.387    5.146054
3        t.apply(d.df)   21.488    5.170356
5  t.apply.x(d.matrix)   22.014    5.296920
13          agg(d.dfp)   32.254    7.760828
14     agg.x(d.matrix)   32.435    7.804379
12           agg(d.df)   32.593    7.842397
10          b.y(d.dfp)   98.006   23.581809
11     b.y.x(d.matrix)   98.134   23.612608
9            b.y(d.df)   98.337   23.661453
1           plyr(d.df) 9384.135 2257.972810
2          plyr(d.dfp) 9384.448 2258.048123

Is this right? Why is plyr 2250x slower than data.table? And why didn't using the new data frame package make a difference?

The session info is:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.8-6        zoo_1.7-7        rbenchmark_0.3   dataframe_2.5    data.table_1.8.1     plyr_1.7.1      

loaded via a namespace (and not attached):
[1] grid_2.15.1    lattice_0.20-6 tools_2.15.1 

Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states

This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), but I'm still thinking about how to overcome this fundamental limitation of the ddply approach.


As for being efficient plyr code I didn't know either. After a bunch of param testing and bench-marking it looks like we can do better.

The summarize() in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn't helping with anything that isn't already simple and the .data and .(price) arguments can be made more explicit. The result is

ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

The summarize may seem nice, but it just isn't quicker than a simple function call. It makes sense; just look at our little function versus the code for summarize. Running your benchmarks with the revised formula yields a noticeable gain. Don't take that to mean you've used plyr incorrectly, you haven't, it just isn't efficient; nothing you can do with it will make it as fast as other options.

In my opinion the optimized function still stinks as it isn't clear and must be mentally parsed along with still being ridiculously slow compared with data.table ( even with a 60% gain ).


In the same thread mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released dplyr as the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison as it has a reworked backend to improve performance.

plyr_Original   <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
plyr_Optimized  <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

dplyr <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )    

data_table <- function(dd) dd[, sum(volume), keyby=price]

The dataframe package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.

Here's the i=5, j=8 benchmark results:

$`obs= 500,000 unique prices= 158,286 reps= 5`
                  test elapsed relative
9     data_table(d.dt)   0.074    1.000
4          dplyr(d.dt)   0.133    1.797
3          dplyr(d.df)   1.832   24.757
6        l.apply(d.df)   5.049   68.230
5        t.apply(d.df)   8.078  109.162
8            agg(d.df)  11.822  159.757
7            b.y(d.df)  48.569  656.338
2 plyr_Optimized(d.df) 148.030 2000.405
1  plyr_Original(d.df) 401.890 5430.946

No doubt the optimizing helped a bit. Take a look at the d.df functions; they just can't compete.

For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using a larger test dataset (i=8,j=8).

$`obs= 50,000,000 unique prices= 15,836,476 reps= 5`
Unit: seconds
             expr    min     lq median     uq    max neval
 data_table(d.dt)  1.190  1.193  1.198  1.460  1.574    10
      dplyr(d.dt)  2.346  2.434  2.542  2.942  9.856    10
      dplyr(d.df) 66.238 66.688 67.436 69.226 86.641    10

The data.frame is still left in the dust. Not only that, but here's the elapsed system.time to populate the data structures with the test data:

`d.df` (data.frame)  3.181 seconds.
`d.dt` (data.table)  0.418 seconds.

Both creation and aggregation of the data.frame is slower than that of the data.table.

Working with the data.frame in R is slower than some alternatives but as the benchmarks show the built in R functions blow plyr out of the water. Even managing the data.frame as dplyr does, which improves upon the built-ins, doesn't give optimal speed; where as data.table is faster both in creation and aggregation and data.table does what it does while working with/upon data.frames.

In the end...

Plyr is slow because of the way it works with and manages the data.frame manipulation.

[punt:: see the comments to the original question].


## R version 3.0.2 (2013-09-25)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] microbenchmark_1.3-0 rbenchmark_1.0.0     xts_0.9-7           
## [4] zoo_1.7-11           data.table_1.9.2     dplyr_0.1.2         
## [7] plyr_1.8.1           knitr_1.5.22        
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  evaluate_0.5.2  formatR_0.10.4  grid_3.0.2     
## [5] lattice_0.20-27 Rcpp_0.11.0     reshape2_1.2.2  stringr_0.6.2  
## [9] tools_3.0.2

Data-Generating gist .rmd