且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

无法加载资源:服务器通过Selenium用ChromeDriver Chrome响应状态为429(请求过多)和404(未找到)

更新时间:2022-10-19 16:46:23

429请求太多

HTTP 429个请求过多响应状态码表示用户在给定的时间内发送了太多请求(速率限制").响应表示应包含说明条件的详细信息,并且可以包含Retry-After标头,该标头指示在发出新请求之前要等待多长时间.

当服务器受到攻击或仅从单方接收到大量请求时,以 429 状态码响应每个请求都会消耗资源.因此,不需要服务器使用429状态代码.在限制资源使用时,***只是断开连接或采取其他步骤.


404未找到

HTTP 找不到404 客户端错误响应代码表示服务器找不到请求的资源.在浏览器中,这意味着无法识别URL.在API中,这也可能意味着端点有效,但是资源本身不存在.服务器也可以发送此响应而不是403,以隐藏来自未授权客户端的资源.由于该响应代码经常在网络上出现,因此可能是最著名的响应代码.

404状态代码不指示资源是暂时丢失还是永久丢失.但是,如果资源被永久删除,则应使用410 (Gone)而不是404状态.另外,当找不到请求的资源时(无论该资源不存在还是出于安全原因该401403出于安全原因该服务要屏蔽),都使用404状态代码.


分析

当我尝试使用您的代码块时,我遇到了类似的后果.如果您检查 DOM树 .au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber = 1"rel =" nofollow noreferrer>网页,您会发现很多标签是具有关键字 dist .例如:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
  • 'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

出现 dist 一词清楚地表明该网站受 Bot Management 服务提供商 Distil Networks ,并检测到 ChromeDriver 进行的导航,随后将其阻止.>


Distil

根据文章,Distil首席执行官拉米·埃塞伊(Rami Essai)在上周的一次采访中表示. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


参考

您可以在以下位置找到一些详细的讨论:

I am trying to build a scraper using selenium in python. Selenium webdriver opening window and trying to load the page but suddenly stop loading. I can access the same link in my local chrome browser.

Here are the error logs I'm getting from the webdriver:

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1 - Failed to load resource: the server responded with a status of 429 (Too Many Requests)', 'source': 'network', 'timestamp': 1556997743637}

{'level': 'SEVERE', 'message': 'about:blank - Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME', 'source': 'network', 'timestamp': 1556997745338}

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1556997748339}

My script:

from selenium import webdriver
import os

path = os.path.join(os.getcwd(), 'chromedriver')
driver = webdriver.Chrome(executable_path=path)

links = [
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/baby-accessories?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/food?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/formula?pageNumber=1",
]


for link in links:
    driver.get(link)

429 Too Many Requests

The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests in a given amount of time ("rate limiting"). The response representations SHOULD include details explaining the condition, and MAY include a Retry-After header indicating how long to wait before making a new request.

When a server is under attack or just receiving a very large number of requests from a single party, responding to each with a 429 status code will consume resources. Therefore, servers are not required to use the 429 status code; when limiting resource usage, it may be more appropriate to just drop connections, or take other steps.


404 Not Found

The HTTP 404 Not Found client error response code indicates that the server can not find requested resource. In the browser, this means the URL is not recognized. In an API, this can also mean that the endpoint is valid but the resource itself does not exist. Servers may also send this response instead of 403 to hide the existence of a resource from an unauthorized client. This response code is probably the most famous one due to its frequent occurence on the web.

A 404 status code does not indicate whether the resource is temporarily or permanently missing. But if a resource is permanently removed, a 410 (Gone) should be used instead of a 404 status. Additionally, 404 status code is used when the requested resource is not found, whether it doesn't exist or if there was a 401 or 403 that, for security reasons, the service wants to mask.


Analysis

When I tried your code block, I faced similar consequences. If you inspect the DOM Tree of the webpage you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
  • 'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

The presence of the term dist is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in: