且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何解析Java的脚本包含[动态]的网页使用[HTML] Python的?

更新时间:2023-02-23 16:50:07

对于准确分析从网页JavaScript的增强内容***的办法是通过浏览器引擎加载页面。幸运的是,有一些方法可以在Python自动执行此。

Your best bet for accurately parsing Javascript-enhanced content from web pages is to load the page via a browser engine. Luckily there are ways to automate this in Python.

我已经受够了最成功的方法是使用 pywebkitgtk项目一>,它可以让您以编程方式创建和Python应用程序中的WebKit浏览器引擎的控制实例。我还使用 jswebkit模块以简化在页面背景的Javascript执行。

The method I've had the most success with is to use the pywebkitgtk project which lets you programmatically create and control instances of the Webkit browser engine from within a Python application. I also use the jswebkit module to simplify execution of Javascript in the page context.

另一个选择是 PyQt4中的QtWebKit的类我'已经只用于实验。

Another option is PyQt4's QtWebKit class which I've only used for experimentation.

下面是一个使用pywebkitgtk和jswebkit一起提取WebKit的渲染页面数据的工作示例。在生产环境中你要并行渲染每个运行几个这些处理器中,其自身的点¯x虚拟帧缓冲区(Xvfb来)

Here is a working example of using pywebkitgtk and jswebkit together to extract data from a Webkit-rendered page. In a production environment you'll want to run several of these processors in parallel, each rendering to its own instance of the X virtual framebuffer (Xvfb).

import os

import gtk
import jswebkit
import lxml.html
import pygtk
import webkit

def load_finished(view, frame):
    # called when the document finishes loading
    if frame != view.get_main_frame():
        return
    ctx = jswebkit.JSContext(frame.get_global_context())
    res = ctx.EvaluateScript('window.location.href')
    print res
    res = ctx.EvaluateScript('document.body.innerHTML')
    tree = lxml.html.fromstring(res)
    print tree.xpath('//input[@type="submit"]')

# initialization
pygtk.require20()
gtk.gdk.threads_init()

# create the webview and hook up callbacks to signals
view = webkit.WebView()
view.set_size_request(1024, 768)
view.connect('load-finished', load_finished)

# configure the webview
props = view.get_settings()
props.set_property('enable-java-applet', False)
props.set_property('enable-plugins', False)
props.set_property('enable-page-cache', False)

# create a window to host the webview
win = gtk.Window()
win.add(view)
win.show_all()

# open google front page
view.open('http://www.google.com')

# spin, processing gtk events
while True:
    try:
        while gtk.events_pending():
            gtk.main_iteration(False)
    except KeyboardInterrupt:
        break

输出示例:

http://www.google.com/
[<InputElement 2a64a78 name='btnG' type='submit'>, <InputElement 2a64bb0 name='btnG' type='submit'>, <InputElement 2a64ae0 name='btnI' type='submit'>]