且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在不向下滚动的情况下强制加载动态内容(使用延迟加载)

更新时间:2023-09-16 15:15:40

问题不是那么简单;如果不看你想要刮去的网站,我认为它无法解决。我不知道你所描述的延迟加载技术究竟是如何实现的,但我确信它可以通过一些不同的方式实现,而这些差异需要不同的抓取方法。差异的一个方面很重要:在所有情况下,滚动会导致一些额外的HTTP请求,并且与滚动事件相关的数据(例如,滚动位置,页面或类似的东西)可以在不同的HTTP请求中传递方式:HTTP参数,URL参数等。



所以,你需要研究这个并采取相应的行动。怎么样?这是我将使用的方法:



使用一些现有的HTTP间谍软件,然后尝试通过加载页面和滚动手动丰富完整内容。此类HTTP间谍工具通常可用作Web浏览器的插件。例如,我使用HttpFox,一个Mozilla浏览器的插件。如果启用了跟踪,它将列出通过浏览器传递的所有HTTP请求和HTTP响应,以及了解如何进行抓取所需的所有详细信息。



-SA
The problem is not so simple; and I don't think it can be solved without looking at the site you are trying to scrape. I have no idea how exactly the lazy loading technique you described is implemented, but I'm sure it can be implemented is some different ways, and those differences would need difference scraping approaches. Only one aspect of the difference is important: in all cases, scrolling causes some additional HTTP requests, and the data related to the scrolling event (say, scrolling position, page, or something like that) can be passed in the HTTP request in different ways: HTTP parameters, URL parameters, etc.

So, you need to study this and act accordingly. How? Here is the approach I would use:

Use some existing HTTP spy software and then try to rich the full content manually, by loading the page and scrolling. Such HTTP spying tools are often available as plug-ins for Web browser. I, for example, use HttpFox, a plug-in for Mozilla browsers. If the tracking is turned on, it will list you all the HTTP requests and HTTP responses passed through the browser, with all the detail needed to understand how to do scraping.

—SA