且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

仅解码 URL 非 ASCII 字符

更新时间:2023-02-23 13:07:56

最简单的方法,你可以用一些占位符替换 %80 (%00-%7F) 以下的所有 URL 编码序列,做一个 URL 解码,并替换原来的URL 编码序列回到占位符.

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.

另一种方法是查找 UTF-8 序列.您的 URL 似乎以 UTF-8 编码,而***使用 UTF-8.您可以查看 UTF-8 的***条目,了解 UTF-8 字符的编码方式.

Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.

因此,当在 URL 中编码时,每个有效的非 ASCII UTF-8 字符都将遵循以下模式之一:

So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:

  • (%C0-%DF)(%80-%BF)
  • (%E0-%EF)(%80-%BF)(%80-%BF)
  • (%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
  • (%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
  • (%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)

因此您可以匹配 URL 中的这些模式并分别取消每个字符的引号.

So you can match these patterns in the URL and unquote each character separately.

但是,请记住,并非所有网址都以 UTF-8 编码.

However, remember that not all URLs are encoded in UTF-8.

在一些旧网站中,他们仍然使用其他字符集,例如泰语的 Windows-874.

In some old websites, they still use other character sets, such as Windows-874 for Thai language.

在这种情况下,该特定网站的ฉัน"被编码为%A9%D1%B9"而不是%E0%B8%89%E0%B8%B1%E0%B8%99".如果你使用 urllib.unquote 解码它,你会得到一些乱码,比如?ѹ"而不是ฉัน",这可能会破坏链接.

In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.

所以你必须小心并检查 URL 解码是否破坏了链接.确保您正在解码的网址采用 UTF-8 格式.

So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.