仅解码 URL 非 ASCII 字符

更新时间：2023-02-23 13:07:56

最简单的方法，你可以用一些占位符替换 %80 (%00-%7F) 以下的所有 URL 编码序列，做一个 URL 解码，并替换原来的URL 编码序列回到占位符.

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.

另一种方法是查找 UTF-8 序列.您的 URL 似乎以 UTF-8 编码，而***使用 UTF-8.您可以查看 UTF-8 的***条目，了解 UTF-8 字符的编码方式.

Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.

因此，当在 URL 中编码时，每个有效的非 ASCII UTF-8 字符都将遵循以下模式之一:

So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:

(%C0-%DF)(%80-%BF)
(%E0-%EF)(%80-%BF)(%80-%BF)
(%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
(%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
(%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)

因此您可以匹配 URL 中的这些模式并分别取消每个字符的引号.

So you can match these patterns in the URL and unquote each character separately.

但是，请记住，并非所有网址都以 UTF-8 编码.

However, remember that not all URLs are encoded in UTF-8.

在一些旧网站中，他们仍然使用其他字符集，例如泰语的 Windows-874.

In some old websites, they still use other character sets, such as Windows-874 for Thai language.

在这种情况下，该特定网站的ฉัน"被编码为%A9%D1%B9"而不是%E0%B8%89%E0%B8%B1%E0%B8%99".如果你使用 urllib.unquote 解码它，你会得到一些乱码，比如?ѹ"而不是ฉัน"，这可能会破坏链接.

In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.

所以你必须小心并检查 URL 解码是否破坏了链接.确保您正在解码的网址采用 UTF-8 格式.

So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

上一篇 : ：PHP，使用htaccess重写URL和Microsoft IIS Url重写下一篇 : 从网站的URL删除的index.php

仅解码 URL 非 ASCII 字符

相关阅读

技术问答最新文章