C0 Control Characters
The first 32 Unicode characters with code points from 0 to 31 are known as the C0 controls. They were originally defined in ASCII to control teletypes and other monospace dumb terminals. Aside from the tab, carriage return, and line feed they have no obvious meaning in text. Since XML is text, it does not include binary characters such as NULL (#x00), BEL (#x07), DC1 (#x11) through DC4 (#x14), and so forth. These noncharacters are historical relics. XML 1.0 does not allow them. This is a good thing. Although dumb terminals and binary- hostile gateways are far less common today than they were twenty years ago, they are still used, and passing these characters through equipment that expects to see plain text can have nasty consequences, including disabling the screen. (One common problem that still occurs is accidentally paging a binary file on a console. This is generally quite ugly and often disables the console.)
A few of these characters occasionally do appear in non-XML text data. For example, the form feed (#x0C) is sometimes used to indicate a page break. Thus moving data from a non-XML system such as a BLOB or CLOB field in a database into an XML document can unexpectedly cause malformedness errors. Text may need to be cleaned before it can be added to an XML document. However, the far more common problem is that a document's encoding is misidentified, for example, defaulted as UTF-8 when it's really UTF-16 or ISO-8859-1. In this case, the parser will notice unexpected nulls and throw a well- formedness error.
XML 1.1 fortunately still does not allow raw binary data in an XML document. However, it does allow you to use character references to escape the C0 controls such as form feed and BEL. The parser will resolve them into the actual characters before reporting the data to the client application. You simply can't include them directly. For example, the following document uses form feeds to separate pages.
<?xml version="1.1"> <book> <title>Nursery Rhymes</title> <rhyme> <verse>Mary, Mary quite contrary</verse> <verse>How does your garden grow?</verse> </rhyme>
<rhyme> <verse>Little Miss Muffet sat on a tuffet</verse> <verse>Eating her curds and whey</verse> </rhyme>
<rhyme> <verse>Old King Cole was a merry old soul</verse> <verse>And a merry old soul was he</verse> </rhyme> </book>
However, this style of page break died out with the line printer. Modern systems use stylesheets or explicit markup to indicate page boundaries. For example, you might place each separate page inside a page element or add a pagebreak element where you wanted the break to occur, as shown below.
<?xml version="1.1"> <book> <title>Nursery Rhymes</title> <rhyme> <verse>Mary, Mary quite contrary</verse> <verse>How does your garden grow?</verse> </rhyme> <pagebreak/> <rhyme> <verse>Little Miss Muffet sat on a tuffet</verse> <verse>Eating her curds and whey</verse> </rhyme> <pagebreak/> <rhyme> <verse>Old King Cole was a merry old soul</verse> <verse>And a merry old soul was he</verse> </rhyme> </book>
Better yet, you might not change the markup at all, just write a stylesheet that assigns each rhyme to a separate page. Any of these options would be superior to using form feeds. Most uses of the other C0 controls are equally obsolete.
There is one exception. You still cannot embed a null in an XML document, not even with a character reference. Allowing this would have caused massive problems for C, C++, and other languages that use null- terminated strings. The null is still forbidden, even with character escaping, which means it's still not possible to directly embed binary data in XML. You have to encode it using Base64 or some similar format first. (See Item 19.)