且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何知道 urllib.urlretrieve 是否成功?

更新时间:2022-10-20 15:44:23

如果可能,请考虑使用 urllib2.它比 urllib 更先进,更易于使用.

您可以轻松检测任何 HTTP 错误:

>>>导入 urllib2>>>resp = urllib2.urlopen("http://google.com/abc.jpg")回溯(最近一次调用最后一次):<<多行跳过>>urllib2.HTTPError:HTTP 错误 404:未找到

resp 实际上是 HTTPResponse 对象,你可以用它做很多有用的事情:

>>>resp = urllib2.urlopen("http://google.com/")>>>代码200>>>resp.headers[内容类型"]'文本/html;字符集=windows-1251'>>>resp.read()<<实际 HTML>>"

urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.

Can anybody provide some information about this problem?

Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.

You can detect any HTTP errors easily:

>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found

resp is actually HTTPResponse object that you can do a lot of useful things with:

>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"