Character Set Conversion Issues

In general, every character set encoding assigns slightly different semantics to its code points. Thus, even well-defined mappings between encodings can lose information. For example, a control character meaningful in ISO 8859-8-E (Bidirectional Hebrew) will lose all meaning in UTF-16, and a private use character in codepage 950 (Traditional Chinese Big5) might be a completely different character in UTF-16.

Your code must recognize that these losses can occur. In particular, if your code converts between encodings, do not assume that if the converted string is safe, the original string was also safe.

Use MultiByteToWideChar and WideCharToMultiByte for UTF-8 conversions on Windows XP and later. Conversion between UTF-8 and UTF-16 can be lossless and secure but only if you are careful. If you must convert between the two forms, be sure to use a converter that is up-to-date with the latest security advisories. Several products and Windows components have cloned the early, insecure version do not use these. Microsoft has tuned the MultiByteToWideChar and WideCharToMultiByte tables over the years for security and application compatibility. Do not roll your own converter, even if this appears to yield a better mapping.