且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R 快速 XML 解析

更新时间:2023-11-25 13:49:22

更新评论

d = xmlRoot(doc)
size = xmlSize(d)

names = NULL
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    names = unique(c(names, names(v)))
}

for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}

对于 1000x100 xml 记录,这在大约 0.4 秒内完成.如果您知道变量名称,您甚至可以省略第一个 for 循环.

This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.

注意:如果您的 xml 内容包含逗号、引号,您可能需要特别注意它们.在这种情况下,我推荐下一个方法.

Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.

如果你想动态构造data.frame,可以用data.table来做,data.table比上面的csv方法慢一点,但比 data.frame

if you want to construct the data.frame dynamically, you can do this with data.table, data.table is a little bit slower than the above csv method, but faster than data.frame

m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]

对于同一个文档,它在大约 1.1 秒内完成.

It finishes in about 1.1 second for the same document.