且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

UnicodeDecodeError: 'ascii' 编解码器无法解码位置 2 中的字节 0xd1:序号不在范围内 (128)

更新时间:2022-10-21 17:22:06

Unicode 不等于 UTF-8.后者只是前者的编码.

你的做法是错误的.您正在读取 UTF-8-编码数据,因此您必须将 UTF-8 编码字符串解码成一个 unicode 字符串.>

所以只需将 .encode 替换为 .decode,它应该可以工作(如果您的 .csv 是 UTF-8 编码的).

不过,没什么可羞耻的.我敢打赌,五分之三的程序员一开始都很难理解这一点,如果不是更多的话;)

更新:如果您的输入数据不是 UTF-8 编码,那么您当然必须.decode() 使用适当的编码.如果没有给出任何内容,python 假定 ASCII,这显然在非 ASCII 字符上失败.

I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

I open the CSV using:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='	', quotechar='"')

Then, I attempt to encode it with:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I'm encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I'm using python 2.7.2, and this is part of an app build on django 1.4. I've read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.