且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python - 在http响应流中寻找

更新时间:2022-02-02 05:54:09

我不确定C#实现是如何工作的,但是,因为互联网流通常是不可寻找,我的猜测是它将所有数据下载到本地文件或内存中对象,并从那里搜索。 Python的等价物就像Abafei建议并将数据写入文件或StringIO并从那里搜索一样。

I'm not sure how the C# implementation works, but, as internet streams are generally not seekable, my guess would be it downloads all the data to a local file or in-memory object and seeks within it from there. The Python equivalent of this would be to do as Abafei suggested and write the data to a file or StringIO and seek from there.

但是,如果你对Abafei的评论答案表明,您只想检索文件的特定部分(而不是通过返回的数据寻找后退和转发),还有另一种可能性。 urllib2 可用于检索网页的某个部分(或HTTP术语中的范围),前提是服务器支持此行为。

However, if, as your comment on Abafei's answer suggests, you want to retrieve only a particular part of the file (rather than seeking backwards and forwards through the returned data), there is another possibility. urllib2 can be used to retrieve a certain section (or 'range' in HTTP parlance) of a webpage, provided that the server supports this behaviour.

当你发送一个请求到服务器,请求的参数在各种标题中给出。其中一个是 Range 标头。 35> RFC2616的第14.35节(定义HTTP / 1.1的规范)。此标头允许您执行诸如从第10,000个字节开始检索所有数据或从1,000和1,500字节之间的数据执行操作。

When you send a request to a server, the parameters of the request are given in various headers. One of these is the Range header, defined in section 14.35 of RFC2616 (the specification defining HTTP/1.1). This header allows you to do things such as retrieve all data starting from the 10,000th byte, or the data between bytes 1,000 and 1,500.

服务器不需要支持范围检索。某些服务器将返回 Accept-Ranges 标头( RFC2616的第14.5节)以及对报告的响应是否支持范围。可以使用HEAD请求来检查。但是,没有特别需要这样做;如果服务器不支持范围,它将返回整个页面,然后我们可以像以前一样在Python中提取所需的数据部分。

There is no requirement for a server to support range retrieval. Some servers will return the Accept-Ranges header (section 14.5 of RFC2616) along with a response to report if they support ranges or not. This could be checked using a HEAD request. However, there is no particular need to do this; if a server does not support ranges, it will return the entire page and we can then extract the desired portion of data in Python as before.

如果服务器返回一个范围,它必须发送 Content-Range 标题( RFC2616的第14.16节)以及响应。如果这在响应的标题中出现,我们知道返回了一个范围;如果它不存在,则返回整个页面。

If a server returns a range, it must send the Content-Range header (section 14.16 of RFC2616) along with the response. If this is present in the headers of the response, we know a range was returned; if it is not present, the entire page was returned.

urllib2 允许我们向请求添加标头,从而允许我们向服务器询问范围而不是整个页面。以下脚本在命令行中获取URL,起始位置和(可选)长度,并尝试检索页面的给定部分。

urllib2 allows us to add headers to a request, thus allowing us to ask the server for a range rather than the entire page. The following script takes a URL, a start position, and (optionally) a length on the command line, and tries to retrieve the given section of the page.

import sys
import urllib2

# Check command line arguments.
if len(sys.argv) < 3:
    sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0])
    sys.exit(1)

# Create a request for the given URL.
request = urllib2.Request(sys.argv[1])

# Add the header to specify the range to download.
if len(sys.argv) > 3:
    start, length = map(int, sys.argv[2:])
    request.add_header("range", "bytes=%d-%d" % (start, start + length - 1))
else:
    request.add_header("range", "bytes=%s-" % sys.argv[2])

# Try to get the response. This will raise a urllib2.URLError if there is a
# problem (e.g., invalid URL).
response = urllib2.urlopen(request)

# If a content-range header is present, partial retrieval worked.
if "content-range" in response.headers:
    print "Partial retrieval successful."

    # The header contains the string 'bytes', followed by a space, then the
    # range in the format 'start-end', followed by a slash and then the total
    # size of the page (or an asterix if the total size is unknown). Lets get
    # the range and total size from this.
    range, total = response.headers['content-range'].split(' ')[-1].split('/')

    # Print a message giving the range information.
    if total == '*':
        print "Bytes %s of an unknown total were retrieved." % range
    else:
        print "Bytes %s of a total of %s were retrieved." % (range, total)

# No header, so partial retrieval was unsuccessful.
else:
    print "Unable to use partial retrieval."

# And for good measure, lets check how much data we downloaded.
data = response.read()
print "Retrieved data size: %d bytes" % len(data)

使用这个,我可以检索Python主页的最后2,000个字节:

Using this, I can retrieve the final 2,000 bytes of the Python homepage:

blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387
Partial retrieval successful.
Bytes 17387-19386 of a total of 19387 were retrieved.
Retrieved data size: 2000 bytes

或者从主页中间400字节:

Or 400 bytes from the middle of the homepage:

blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400
Partial retrieval successful.
Bytes 6000-6399 of a total of 19387 were retrieved.
Retrieved data size: 400 bytes

但是,Google主页不支持范围:

However, the Google homepage does not support ranges:

blair@blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500
Unable to use partial retrieval.
Retrieved data size: 9621 bytes

在这种情况下,有必要提取在进一步处理之前,Python中感兴趣的数据。

In this case, it would be necessary to extract the data of interest in Python prior to any further processing.