且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

检测“隐形"网络爬虫

更新时间:2023-02-26 13:44:37

不久前,我与一家小型托管公司合作,帮助他们实施解决方案.我开发的系统检查 Web 服务器日志中是否有来自任何给定 IP 地址的过度活动,并发布防火墙规则以阻止违规者.它包括基于 http://www.iplists.com/ 的 IP 地址/范围白名单,然后通过检查声称的用户代理字符串根据需要自动更新,如果客户端声称是合法蜘蛛但不在白名单中,则它执行 DNS/反向 DNS 查找以验证源 IP 地址是否与声称的所有者相对应机器人.作为故障安全措施,这些操作已通过电子邮件报告给管理员,并附有链接,以便在评估不正确时将地址列入黑名单/白名单.

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

我已经有 6 个月左右的时间没有和那个客户谈过了,但是,我最后一次听说,该系统的运行非常有效.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

旁注:如果您正在考虑基于命中率限制建立类似的检测系统,请务必使用至少一分钟(***至少为五分钟)的总数.我看到很多人在谈论这些类型的计划,他们想阻止任何在一秒钟内点击 5-10 次的人,这可能会在大量图像的页面上产生误报(除非图像被排除在计数之外)和 当像我这样的人找到一个他想要阅读全部内容的有趣网站时,会产生误报,因此他在阅读第一个时打开标签中的所有链接以在后台加载.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.