且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用 PHP 检测爬虫/蜘蛛?

更新时间:2023-02-26 13:22:53

根据验证 Googlebot:

您可以使用反向 DNS 查找来验证访问您服务器的机器人是否确实是 Googlebot(或其他 Google 用户代理),验证名称是否在 googlebot.com 域中,然后使用那个谷歌机器人的名字.如果您担心垃圾邮件发送者或其他麻烦制造者在声称自己是 Googlebot 的同时访问您的网站,这将非常有用.

例如:

主机 66.249.66.1
1.66.249.66.in-addr.arpa域名指针
crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com 的地址是 66.249.66.1
Google 不会发布供网站管理员加入白名单的公开 IP 地址列表.这是因为这些 IP 地址范围可能会发生变化,从而给任何对其进行硬编码的网站管理员带来问题.识别 Googlebot 访问的***方法是使用用户代理 (Googlebot).

您可以进行反向 DNS 查找:

function validateGoogleBotIP($ip) {$hostname = gethostbyaddr($ip);//crawl-66-249-66-1.googlebot.com"return preg_match('/\.google(bot)?\.com$/i', $hostname);}如果 (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {echo '这实际上是谷歌';} 别的 {echo '有人在伪造它!';}} 别的 {echo '与谷歌无关';}

How can one detect a crawler / spider using PHP?

I'm currently working on a project where I need to keep track of each crawler's visit.
I know that you should use HTTP_USER_AGENT but I'm not really sure how to format the code for this purpose and i know that the USER AGENT can be changed very easy so i would also like to know if it is possible to add some more parameters to avoid spoofing?

Sample code of what i'm trying to do..

<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>

Thank you

According to Verifying Googlebot:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

You can do a reverse DNS lookup:

function validateGoogleBotIP($ip) {
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"

    return preg_match('/\.google(bot)?\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
    if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
        echo 'It is ACTUALLY google';
    } else {
        echo 'Someone\'s faking it!';
    }
} else {
    echo 'Nothing to do with Google';
}