使用 R 抓取包含多个页面的 HTML 表格

更新时间：2023-12-01 23:20:10

您可以使用 paste0 动态创建 url，因为它们略有不同.对于某一年，您只更改页码.你得到一个 url 结构，如:

You can dynamically create the url using paste0 since that they slightly differ. For a certain year you change just the page number. You get an url structure like :

url <- paste0(url1,year,url2,page,url3) ## you change page or year or both

您可以创建一个函数来循环不同的页面，并返回一个表.然后你可以使用经典的 do.call(rbind,..) 绑定它们:

You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..):

library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"

getTable <- 
  function(page=1,year=2013){
    url <- paste0(url1,year,url2,page,url3)
    tab = readHTMLTable(url,header=FALSE) ## see comment !!
    tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))

一般方法

一般的方法是使用一些 xpath 标记来废弃下一页 url 并循环直到没有任何新的下一页.这可能更难做到，但它是最干净的解决方案.

the general method

The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .

getNext <- 
function(url=url_base){
  doc <- htmlParse(url)
  XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
  next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
  if(length(next_page)>0)
    paste0("http://www.nfl.com",next_page)
  else ''
}
## url_base is your first  url
res <- list()
while(TRUE){
  tab = readHTMLTable(url_base,header=FALSE)
  res <- rbind(res,tab$result)
  url_base <- getNext(url_base)
  if (nchar(url_base)==0)
    break
}

上一篇 : ：在 Yii 中保存多个子模型下一篇 : 从 Firebase 中删除特定用户

使用 R 抓取包含多个页面的 HTML 表格

一般方法

the general method

相关阅读

推荐文章