且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用QWebPage抓取多个网址

更新时间:2023-09-18 22:19:10

程序的问题是,您尝试使用获取的每个URL创建一个新的QApplication.

The problem with your program is that you are attempting to create a new QApplication with every url you fetch.

相反,您应该创建一个QApplication,并处理WebPage类本身内的所有网页加载和处理.关键概念是使用loadFinished信号通过在加载和处理了当前URL后获取一个新URL来创建循环.

Instead, you should create one QApplication, and handle all the loading and processing of web pages within the WebPage class itself. The key concept is to use the loadFinished signal to create a loop by fetching a new url after the current one has been loaded and processed.

下面的两个演示脚本(用于PyQt4和PyQt5)是简化的示例,显示了如何构造程序.希望,如何使它们适应自己的使用应该很明显:

The two demo scripts below (for PyQt4 and PyQt5) are simplified examples that show how to structure the program. Hopefully, it should be fairly obvious how to adapt them for your own use:

import sys
from PyQt4 import QtCore, QtGui, QtWebKit

class WebPage(QtWebKit.QWebPage):
    def __init__(self):
        super(WebPage, self).__init__()
        self.mainFrame().loadFinished.connect(self.handleLoadFinished)

    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.mainFrame().load(QtCore.QUrl(url))
        return True

    def processCurrentPage(self):
        url = self.mainFrame().url().toString()
        html = self.mainFrame().toHtml()
        # do stuff with html...
        print('loaded: [%d bytes] %s' % (self.bytesReceived(), url))

    def handleLoadFinished(self):
        self.processCurrentPage()
        if not self.fetchNext():
            QtGui.qApp.quit()

if __name__ == '__main__':

    # generate some test urls
    urls = []
    url = 'http://pyqt.sourceforge.net/Docs/PyQt4/%s.html'
    for name in dir(QtWebKit):
        if name.startswith('Q'):
            urls.append(url % name.lower())

    app = QtGui.QApplication(sys.argv)
    webpage = WebPage()
    webpage.start(urls)
    sys.exit(app.exec_())


以下是上述脚本的PyQt5/QWebEngine版本:


Here is a PyQt5/QWebEngine version of the above script:

import sys
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets

class WebPage(QtWebEngineWidgets.QWebEnginePage):
    def __init__(self):
        super(WebPage, self).__init__()
        self.loadFinished.connect(self.handleLoadFinished)

    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.load(QtCore.QUrl(url))
        return True

    def processCurrentPage(self, html):
        url = self.url().toString()
        # do stuff with html...
        print('loaded: [%d chars] %s' % (len(html), url))
        if not self.fetchNext():
            QtWidgets.qApp.quit()

    def handleLoadFinished(self):
        self.toHtml(self.processCurrentPage)

if __name__ == '__main__':

    # generate some test urls
    urls = []
    url = 'http://pyqt.sourceforge.net/Docs/PyQt5/%s.html'
    for name in dir(QtWebEngineWidgets):
        if name.startswith('Q'):
            urls.append(url % name.lower())

    app = QtWidgets.QApplication(sys.argv)
    webpage = WebPage()
    webpage.start(urls)
    sys.exit(app.exec_())