且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Apache Nutch 未将网页中的内部链接添加到 fetchlist

更新时间:2023-01-10 19:35:26

默认过滤器会忽略您的种子网址,因此不会抓取您的页面.

Your seed url is being ignored by the default filters, so your page is not being crawled.

编辑以下文件:

conf/automaton-urlfilter.txt

conf/automaton-urlfilter.txt

conf/regex-urlfilter.txt

conf/regex-urlfilter.txt

替换

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*