且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 R 抓取包含多个页面的 HTML 表格

更新时间:2023-12-01 23:20:10

您可以使用 paste0 动态创建 url,因为它们略有不同.对于某一年,您只更改页码.你得到一个 url 结构,如:

You can dynamically create the url using paste0 since that they slightly differ. For a certain year you change just the page number. You get an url structure like :

url <- paste0(url1,year,url2,page,url3) ## you change page or year or both

您可以创建一个函数来循环不同的页面,并返回一个表.然后你可以使用经典的 do.call(rbind,..) 绑定它们:

You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..):

library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"

getTable <- 
  function(page=1,year=2013){
    url <- paste0(url1,year,url2,page,url3)
    tab = readHTMLTable(url,header=FALSE) ## see comment !!
    tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))

一般方法

一般的方法是使用一些 xpath 标记来废弃下一页 url 并循环直到没有任何新的下一页.这可能更难做到,但它是最干净的解决方案.

the general method

The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .

getNext <- 
function(url=url_base){
  doc <- htmlParse(url)
  XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
  next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
  if(length(next_page)>0)
    paste0("http://www.nfl.com",next_page)
  else ''
}
## url_base is your first  url
res <- list()
while(TRUE){
  tab = readHTMLTable(url_base,header=FALSE)
  res <- rbind(res,tab$result)
  url_base <- getNext(url_base)
  if (nchar(url_base)==0)
    break
}