且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用R从搜索结果URL中提取文本

更新时间:2023-02-18 09:49:48

这是如何删除此页面的基本思想.如果要剪贴的页面很多,则r的速度可能会很慢. 现在您的问题有点模棱两可.您希望最终结果是 .txt 文件.哪些网页包含pdf ???好的.您仍然可以使用此代码,并将包含pdf的网页的文件扩展名更改为pdf.

This is a basic idea of how to go about scrapping this pages. Though it might be slow in r if there are many pages to be scrapped. Now your question is a bit ambiguous. You want the end results to be .txt files. What of the webpages that has pdf??? Okay. you can still use this code and change the file extension to pdf for the webpages that have pdfs.

 library(xml2)
 library(rvest)

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

  urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
       .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
           c(paste("tmp",1:length(.))))

这是上面的代码的细分: 您要从中抓取的 url :

This is the breakdown of the code above: The url you want to scrap from:

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

获取所需的所有 url :

  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

您想在哪里保存文本?创建临时文件:

Where do you want to save your texts?? Create the temp files:

 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

按照现在.您的allurls是类字符.您必须将其更改为xml以便能够将其废弃.然后最后将它们写入上面创建的tmp文件中:

as per now. Your allurls is in class character. You have to change that to xml in order to be able to scrap them. Then finally write them into the tmp files created above:

  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,y,options="format"),.,tmps)

请不要遗漏任何内容.例如,在..."format"),之后有一个句点.考虑到这一点. 现在,您的文件已写入 tempdir .要确定它们的位置,只需在控制台上键入命令tempdir(),它就会为您提供文件的位置.同时,您可以在tempfile命令中在剪贴时更改文件的位置.

Please do not leave anything out. For example after ..."format"), there is a period. Take that into consideration. Now your files have been written in the tempdir. To determine where they are, just type the command tempdir() on the console and it should give you the location of your files. At the same time, you can change the location of the files on scrapping within the tempfile command.

希望这会有所帮助.