Windows String Representations and Simple Conversions via __bstr_t

The Windows NT kernel (from WinNT to Windows 11) internally uses UTF-16 strings by default, including Windows Drivers, Native Applications and COM clients and servers, etc. All other common encodings like UTF-8, GBK, GB18030, BIG-5 should always be converted to UTF-16 before invocations to the kernel functions within ntdll.dll.

There is a compatible layer upon the kernel layer, which is called Win32 API, a huge heritage left by the Win9X series. Win32 API is a family of functions for programmers to communicate with the operating system and hardware conveniently. Microsoft keeps this compatibility on the Windows NT Kernel so that most of the Win32 functions are still remaining unchanged and ABI-compatible. For example the CreateFile function does exist from Windows 98 to Windows 11. That is unbelievable because a Linux distribution may break any API in a minor update!

HANDLE CreateFile(
  [in]           LPCSTR                lpFileName,
  [in]           DWORD                 dwDesiredAccess,
  [in]           DWORD                 dwShareMode,
  [in, optional] LPSECURITY_ATTRIBUTES lpSecurityAttributes,
  [in]           DWORD                 dwCreationDisposition,
  [in]           DWORD                 dwFlagsAndAttributes,
  [in, optional] HANDLE                hTemplateFile
);

Some challenges started to occur. The Unicode Standard was then generally accepted by OS vendors after Win9X was released while the Win9X was still using the ANSI encoding or some multi-byte encoding in the terminal country like GB2312 and BIG-5. The Windows NT Kernel chose the UTF-16 as its kernel string representations that hindered the working progress of compatibility.

To resolve this issue, Microsoft’s talented engineers decided to create duplicates of the corresponding old Win32 APIs. The only difference of these two versions is the string types: LPCSTR vs LPCWSTR, the aliases of const char* and const wchar_t* in C++. The former represents the ANSI, and the latter stores a UTF-16 encoded string. To distinguish the mangled names at the C ABI level, the developers simply added a single-word suffix for the function: -A for the ANSI version and -W for the Unicode version. It provides much flexibility for users to call any of them in their projects.

BOOL DeleteFileA(
  [in] LPCSTR lpFileName
);

BOOL DeleteFileW(
  [in] LPCWSTR lpFileName
);

Generally speaking, the encoding API MultibyteToWideChar and WideCharToMultiByte are the usual way to reach the goal of interoperability for user-mode programs using different internal string representations on Windows.

int WideCharToMultiByte(
  [in]            UINT                               CodePage,
  [in]            DWORD                              dwFlags,
  [in]            _In_NLS_string_(cchWideChar)LPCWCH lpWideCharStr,
  [in]            int                                cchWideChar,
  [out, optional] LPSTR                              lpMultiByteStr,
  [in]            int                                cbMultiByte,
  [in, optional]  LPCCH                              lpDefaultChar,
  [out, optional] LPBOOL                             lpUsedDefaultChar
);

int MultiByteToWideChar(
  [in]            UINT                              CodePage,
  [in]            DWORD                             dwFlags,
  [in]            _In_NLS_string_(cbMultiByte)LPCCH lpMultiByteStr,
  [in]            int                               cbMultiByte,
  [out, optional] LPWSTR                            lpWideCharStr,
  [in]            int                               cchWideChar
);

It requires two calls to each function for one single conversion, the first call to calculate the buffer size and the second to perform the actual conversion. There is a simpler method to behave equivalently, that is to say, via the __bstr_t class that is supplied within the compiler’s COM support.

The COM always uses UTF-16 strings as mentioned above and the BSTR type (that is wchar_t* with some extra header) is the standard string type of COM. MSVC has native support for BSTR called __bstr_t, an encapsulation of the BSTR data type, providing a simplified version compared with the general approach.

A faster way to do encoding conversions is to instantiate an object of _bstr_t using the constructor based on the signature const char*. This overload takes an ANSI string and converts it to a UTF-16 string immediately and the _bstr_t has a const wchar_t* operator() to do the implicit cast and vice versa.

#include <string>
#include <string_view>

#include <comutil.h>

const std::string str{ "Hello some 汉字 characters" };
const _bstr_t wide_str{ str.c_str() };
const std::wstring_view{ wide_str };

#include <string>
#include <string_view>

#include <comutil.h>

const std::wstring str{ L"一些宽字符,逆向转换" };
const _bstr_t wide_str{ str.c_str() };
const std::string_view{ wide_str };

Leave a Reply

Your email address will not be published. Required fields are marked *