且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

VBA - 将字符串转换为 UNICODE

更新时间:2022-11-09 10:02:52

VBA 对 Unicode 的支持并不是那么好.

可以处理 Unicode 字符串,但您将无法使用 Debug.PrintMsgBox 看到实际字符 - 它们将显示为 ? 在那里.

您可以设置控制面板>地区和语言 >管理选项卡 >非 Unicode 程序的当前语言" 改为俄语"切换到不同的代码页,这将允许您在 VBA 消息框中看到西里尔字母而不是问号.但这只是表面上的改变.


你真正的问题是这里的其他问题.

服务器 (nfs.mobile.bg) 将文档作为 Content-Type: text/html 发送.没有关于字符编码的信息.这意味着接收方必须自己弄清楚字符编码.

浏览器通过查看响应字节流并进行猜测来做到这一点.在您的情况下,一个有用的 <meta http-equiv="Content-Type"内容=文本/html;charset=windows-1251"> 标签存在于 HTML 源代码中.因此,字节流应该被解释为Windows-1251,这恰好是Windows中的Cyrillic ANSI代码页.

所以,我们这里甚至没有 Unicode!

在没有任何附加信息的情况下,XMLHTTP 对象的 responseText 属性默认为 us-ascii.来自西里尔字母的扩展字符不存在于 ASCII 中,因此它们将被转换为实际问号并丢失.这就是为什么您不能将 responseText 用于任何事情的原因.

然而,响应的原始字节仍然可用,在responseBody属性中,它是一个Byte数组.>

在 VBA 中,您必须做浏览器会做的事情.您必须将字节流解释为某个字符集.ADODB.Stream 对象可以为您做到这一点,而且非常简单:

' 参考:Microsoft XML,v6.0";(或任何其他版本)' 参考:Microsoft ActiveX 数据对象 6.1 库"(或任何其他版本)选项显式子 HTMLsearch()Dim url As String, html As Stringurl = http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1"html = GetHTML(url, Windows-1251")' Office 支持西里尔字符,因此它们会正确显示ActiveDocument.Range.InsertAfter html结束子函数 GetHTML(Url As String, Optional Charset As String = "UTF-8") As StringDim 请求作为新 MSXML2.XMLHTTPDim 转换器作为新的 ADODB.stream' 获取页面request.OpenGET",网址,假请求发送' 将原始字节写入流转换器.OpenConverter.Type = adTypeBinary转换器.写请求.响应体' 将流切换到文本模式并设置字符集转换器.位置 = 0转换器.Type = adTypeText转换器.Charset = 字符集' 从流中读取文本字符,关闭流GetHTML = converter.ReadText转换器.关闭结束函数

我一直在这里使用 MS Word 并调用 HTMLsearch() 正确地将西里尔字符写入页面.不过,对我来说,它们仍然在 MsgBox 中显示为 ?,但现在这纯粹是一个显示问题,因为 VBA 创建的 UI 无法处理 Unicode.

I need to convert the string HTML from a mix of Cyrillic and Latin symbols to UNICODE.

I tried the following:

Public HTML As String
    Sub HTMLsearch()

    GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")
    MsgBox HTML
    HTML = StrConv(HTML, vbUnicode)
    MsgBox HTML
End Sub

Function GetHTML(URL As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        HTML = .ResponseText
    End With
End Function

You can see what is before and after the StrConv. If you like to get the html in a file, you can use the following code:

Public HTML As String
    Sub HTMLsearch()

    GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")

    Dim path As String

    path = ThisWorkbook.path & "html.txt"
    Open path For Output As #1
    Print #1, HTML
    Close #1

    HTML = StrConv(HTML, vbUnicode)

    path = ThisWorkbook.path & "htmlUNICODE.txt"
    Open path For Output As #1
    Print #1, HTML
    Close #1
End Sub

Function GetHTML(URL As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        HTML = .ResponseText
    End With
End Function

IDEAS?

VBA's support for Unicode is not all that great.

It is possible to handle Unicode strings, but you will not be able to see the actual characters with Debug.Print or MsgBox - they will appear as ? there.

You can set Control Panel > Region and Language > Administrative tab > "Current language for non-Unicode programs" to "Russian" switch to a different code page, which would allow you to see Cyrillic letters in VBA message boxes instead of question marks. But that's only a cosmetic change.


Your real problem is something else here.

The server (nfs.mobile.bg) sends the document as Content-Type: text/html. There is no information about character encoding. That means the receiver must figure out character encoding on its own.

A browser does that by looking at the response byte stream and making guesses. In your case, a helpful <meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> tag is present in the HTML source. Therefore, the byte stream should be interpreted as Windows-1251, which happens to be the Cyrillic ANSI code page in Windows.

So, we do not even have Unicode here!

In the absence of any additional info, the responseText property of the XMLHTTP object defaults to us-ascii. The extended characters from the Cyrillic alphabet are not present in ASCII, so they will be converted to actual question marks and are lost. That's why you can't use responseText for anything.

However, the original bytes of the response are still available, in the responseBody property, which is an array of Byte.

In VBA you must do the same thing a browser would do. You must interpret the byte-stream as a certain character set. The ADODB.Stream object can do that for you, and it's pretty straight-forward, too:

' reference: "Microsoft XML, v6.0" (or any other version)
' reference: "Microsoft ActiveX Data Objects 6.1 library" (or any other version)
Option Explicit

Sub HTMLsearch()
    Dim url As String, html As String
    
    url = "http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1"
    html = GetHTML(url, "Windows-1251")
    
    ' Cyrillic characters are supported in Office, so they will appear correctly
    ActiveDocument.Range.InsertAfter html
End Sub

Function GetHTML(Url As String, Optional Charset As String = "UTF-8") As String
    Dim request As New MSXML2.XMLHTTP
    Dim converter As New ADODB.stream
    
    ' fetch page
    request.Open "GET", Url, False
    request.send
    
    ' write raw bytes to the stream
    converter.Open
    converter.Type = adTypeBinary
    converter.Write request.responseBody
    
    ' switch the stream to text mode and set charset
    converter.Position = 0
    converter.Type = adTypeText
    converter.Charset = Charset
    
    ' read text characters from the stream, close the stream
    GetHTML = converter.ReadText
    converter.Close
End Function

I've been using MS Word here and calling HTMLsearch() properly writes Cyrillic characters to the page. They still do appear as ? in a MsgBox for me, though, but now that's purely a display problem, caused by the fact that VBA-created UI cannot deal with Unicode.