Normalization | Writing Secure Code, Second Edition

Normalization

Many character set encodings, but especially Unicode, have multiple binary representations for the same string. For example, there are dozens of distinct strings that might render as . This multiplicity complicates operations such as indexing and validation. The complexity increases the risk of coding errors that will compromise security. To reduce complexity in your code, normalize strings to a single form.

Many normalization forms exist already:

The Unicode Consortium has defined four standard normalization forms. Normalization Form C is especially popular. Consider adopting Normalization Form C for new designs. It is the most frequently adopted and the easiest to optimize. Most of the Internet normalization forms are modifications of Normalization Form C. You can find more information at http://www.unicode.org/unicode/reports/tr15/.
Normalization of URIs is a hot topic within the Internet Engineering Task Force (IETF) and W3C. Details are available at http://www.i-d-n.net/draft/draft-duerst-i18n-norm-04.txt and at http://www.w3.org/TR/charmod.
Each file system has a unique form. NTFS, FAT32, NFS, High Sierra, and MacOS are all quite distinct.
Several normalization standards specific to Internet protocols. Consult the RFC for your application domain.

The Win32 FoldString function provides several useful options for normalizing strings. Unfortunately, it doesn't cover the full range of Unicode characters, and the mappings do not always match any of the Unicode normalization forms. If you do use FoldString, be sure to test your code with the full Unicode repertoire. For example, if you use FoldString with the MAP_FOLDDIGITS option, it will normalize many but not all of the characters with the numeric Unicode property.