且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

重新抓取网址与Nutch的只是更新的网站

更新时间:2023-09-04 21:45:28

只要你不能。您需要重新抓取网页,如果它的更新来控制。因此,根据您的需要,优先页/域和一段时间内重新抓取他们。为此你需要一个作业调度程序,如石英

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

您需要编写一个比较页面的功能。然而,原本Nutch的保存页面作为索引文件。换句话说Nutch的产生新的二进制文件,以节省HTMLS。我不认为这是可以比较的二进制文件,结合Nutch的一个文件中的所有抓取的结果。如果你想要保存原始HTML格式来比较页面,请参阅我的回答this问题。

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.