且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用 fread 函数读取 CSV 文件的特定行

更新时间:2023-01-15 19:12:04

此方法采用向量 v(对应于您的 read_vec),标识要读取的行序列, 将它们提供给对 fread(...) 的顺序调用,并将结果 rbinds 放在一起.

This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.

如果您想要的行随机分布在整个文件中,这可能不会更快.但是,如果行在块中(例如,c(1:50, 55, 70, 100:500, 700:1500)),那么对 fread(...) 并且您可能会看到显着的改进.

If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.

# create sample dataset
set.seed(1)
m   <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)

seq  <- rle(v)
idx  <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])

library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))