且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Java中用于非ASCII字符的URL解码

更新时间:2023-02-23 12:10:13

Anv%E4ndare

Anv%E4ndare

PopoFibo说,这不是有效的UTF-8编码序列.

As PopoFibo says this is not a valid UTF-8 encoded sequence.

您可以进行一些宽容的***猜测解码:

You can do some tolerant best-guess decoding:

public static String parse(String segment, Charset... encodings) {
  byte[] data = parse(segment);
  for (Charset encoding : encodings) {
    try {
      return encoding.newDecoder()
          .onMalformedInput(CodingErrorAction.REPORT)
          .decode(ByteBuffer.wrap(data))
          .toString();
    } catch (CharacterCodingException notThisCharset_ignore) {}
  }
  return segment;
}

private static byte[] parse(String segment) {
  ByteArrayOutputStream buf = new ByteArrayOutputStream();
  Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
                          .matcher(segment);
  int last = 0;
  while (matcher.find()) {
    appendAscii(buf, segment.substring(last, matcher.start()));
    byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
    buf.write(hex);
    last = matcher.end();
  }
  appendAscii(buf, segment.substring(last));
  return buf.toByteArray();
}

private static void appendAscii(ByteArrayOutputStream buf, String data) {
  byte[] b = data.getBytes(StandardCharsets.US_ASCII);
  buf.write(b, 0, b.length);
}

此代码将成功解码给定的字符串:

This code will successfully decode the given strings:

for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
    "Anv%E4ndare")) {
  String result = parse(test, StandardCharsets.UTF_8,
      StandardCharsets.ISO_8859_1);
  System.out.println(result);
}

请注意,这不是一个万无一失的系统,它允许您忽略正确的URL编码.之所以在这里起作用,是因为 v%E4n -字节序列76 E4 6E-根据 UTF-8方案,解码器可以检测到.

Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E - is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.

如果您反转编码顺序,则第一个字符串可以愉快地(但错误地)被解码为ISO-8859-1.

If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.

注意: HTTP不在意关于百分比编码,您可以编写一个接受http://foo/%%%%%作为有效格式的Web服务器. URI规范强制使用UTF-8,但是这是追溯完成的.确实要由服务器来描述其URI应该采用的形式,并且如果您必须处理任意URI,则需要了解这一传统.

Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%% as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.

我写了此处有关URL和Java的更多信息.