且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R 解析 XML 总是返回 XML 声明错误

更新时间:2021-10-06 07:24:05

注意:本文在原始版本的基础上进行了编辑.

Note: This post is edited from the original version.

这里的对象教训是,仅仅因为文件具有 xml 扩展名并不意味着它是格式良好的 XML.

The object lesson here is that just because a file has an xml extension does not mean it is well formed XML.

如果@MartinMorgan 对该文件的描述是正确的,那么 Google 似乎已将 2014-07-22(上周)那一周内批准的所有专利转换为 XML,将它们串成一个文本文件,然后鉴于 xml 扩展.很明显,这不是格式良好的 XML.因此,挑战在于解构该文件.这是在 R 中完成的.

If @MartinMorgan is correct about the file, Google seems to have taken all the patents approved during the week of 2014-07-22 (last week), converted them to XML, strung them together into a single text file, and given that an xml extension. Clearly this is not well-formed XML. So the challenge is to deconstruct that file. Here is away to do it in R.

lines   <- readLines("ipg140722.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"

所以现在 docs 是一个解析过的 XML 文档的列表.这些可以单独访问,例如,docs[[1]],或共同使用类似下面的代码,从每个文档中提取发明标题.

So now docs is a list of parsed XML documents. These can be accessed individually as, e.g., docs[[1]], or collectively using something like the code below, which extracts the invention title from each document.

sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
#  [1] "Phallus retention harness"                          "Dress/coat"                                        
#  [3] "Shirt"                                              "Shirt"                                             
#  [5] "Sandal"                                             "Shoe"                                              
#  [7] "Footwear"                                           "Flexible athletic shoe sole"                       
#  [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"                                         

不,我没有编造第一项专利的名称.

And no, I did not make up the name of the first patent.

对 OP 评论的回复

我的原始帖子,它使用以下方法检测到新文档的开始:

My original post, which detected the start of a new document using:

start   <- grep("xml version",lines,fixed=T)

太天真了:事实证明,xml 版本"这个短语出现在一些专利的文本中.所以这会过早地破坏(一些)文档,导致格式错误的 XML.上面的代码解决了这个问题.如果您取消注释函数 get.xml(...) 中的两行并使用

was too naive: it turns out the phrase "xml version" appears in the text of some of the patents. So this was breaking (some of) the documents prematurely, resulting in mal-formed XML. The code above fixes that problem. If you un-coment the two lines in the function get.xml(...) and run the code above with

docs <- lapply(1:length(start),get.xml)

您将看到所有 6961 个文档都正确解析.

you will see that all 6961 documents parse correctly.

但是还有另一个问题:解析的 XML 非常大,所以如果您将这些行作为注释保留并尝试解析完整集,您会在大约一半的时间内耗尽内存(或者我在 8GB 系统上这样做了)).有两种方法可以解决这个问题.第一个是在块中进行解析(比如一次 2000 个文档).第二个是在 get.xml(...) 中提取 CSV 文件所需的任何信息,并在每一步丢弃解析的文档.

But there is another problem: the parsed XML is very large, so if you leave these lines as comments and try to parse the full set, you run out of memory about half way through (or I did, on an 8GB system). There are two ways to work around this. The first is to do the parsing in blocks (say 2000 documents at a time). The second is to extract whatever information you need for your CSV file in get.xml(...) and discard the parsed document at each step.