且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

php爬虫检测

更新时间:2022-10-18 19:59:03

您的代码有误:

$crawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

应该

$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

如果您在开发时带有通知,您将更容易发现这些错误.

此外,您可能希望在 header

之后 exit

警告:伪装会让您在搜索提供商方面遇到麻烦.这篇文章解释了原因.

I'm trying to write a sitemap.php which acts differently depending on who is looking.

I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page.

This will all be controlled from within the php header, and I've found this code on the web which by the looks of it should work, but it's not. Can anyone help crack this for me?

function getIsCrawler($userAgent) {
    $crawlers = 'firefox|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
    'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
    'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
    $isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);
    return $isCrawler;
}

$iscrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

if ($isCrawler) {
    header('Location: http://www.website.com/sitemap.xml');
    exit;
} else {
    echo "not crawler!";
}

It looks pretty simple, but as you can see i've added firefox into the agent list, and sure enough I'm not being redirected..

Thanks for any help :)

You have a mistake in your code:

$crawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

should be

$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

If you develop with notices on you'll catch these errors much more easily.

Also, you probable want to exit after the header

Warning: Cloaking can get you in trouble with search providers. This article explains why.