更新时间:2023-01-15 19:12:04
此方法采用向量 v
(对应于您的 read_vec
),标识要读取的行序列, 将它们提供给对 fread(...)
的顺序调用,并将结果 rbinds
放在一起.
This approach takes a vector v
(corresponding to your read_vec
), identifies sequences of rows to read, feeds those to sequential calls to fread(...)
, and rbinds
the result together.
如果您想要的行随机分布在整个文件中,这可能不会更快.但是,如果行在块中(例如,c(1:50, 55, 70, 100:500, 700:1500)
),那么对 fread(...)
并且您可能会看到显着的改进.
If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)
) then there will be few calls to fread(...)
and you may see a significant improvement.
# create sample dataset
set.seed(1)
m <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)
seq <- rle(v)
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))