重新抓取网址与Nutch的只是更新的网站

更新时间：2023-09-04 21:45:28

只要你不能。您需要重新抓取网页，如果它的更新来控制。因此，根据您的需要，优先页/域和一段时间内重新抓取他们。为此你需要一个作业调度程序，如石英。

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

您需要编写一个比较页面的功能。然而，原本Nutch的保存页面作为索引文件。换句话说Nutch的产生新的二进制文件，以节省HTMLS。我不认为这是可以比较的二进制文件，结合Nutch的一个文件中的所有抓取的结果。如果你想要保存原始HTML格式来比较页面，请参阅我的回答this问题。

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

上一篇 : ：Nutch 不会抓取多个站点下一篇 : 取所有元素不同而不重复 xslt 1.0

重新抓取网址与Nutch的只是更新的网站

相关阅读

推荐文章