Section 15.1. Syntactic details | XML in Office 2003: Information Sharing with Desktop XML


Prev	don't be afraid of buying books	Next

15.1. Syntactic details

XML documents are composed of characters from the Unicode character set. Any such sequence of characters is called a string. The characters in this book can be thought of as one long (but interesting) string of text. Each chapter is also a string. So is each word. XML documents are similarly made up of strings within strings.

Natural languages such as English have a particular syntax. The syntax allows you to combine words into grammatical sentences. XML also has syntax. It describes how you combine strings into well-formed XML documents. We will describe the basics of XML's syntax in this section.

15.1.1 Case-sensitivity

XML is case-sensitive. That means that if the XML specification says to insert the word "ELEMENT", it means that you should insert "ELEMENT" and not "element" or "Element" or "EIEmEnT".

So mind your "p's" and "q's" and "P's" and "Q's". Our authoritative laboratory testing by people in white coats indicates that exactly 74.5% of all XML errors are related to case-sensitivity mistakes. Of course XML is also spelling-sensitive and typo-sensitive, so watch out for these and other products of human fallibility.

Note that although XML is case-sensitive it is not case-prejudiced. Anywhere that you have the freedom to create your own names or text, you can choose to use upper- or lower-case text, as you prefer.

For instance, when you create your own document types you will be able to choose element-type names. A particular name could be all upper-case (SECTION), all lower-case (section) or mixed-case (SeCtION). But because XML is case-sensitive, all occurrences of a particular element-type name would have to use the same case. It is good practice to create a simple convention such as all lower-case or all upper-case so that you do not have to depend on your memory.

15.1.2 Markup and data

The constructs such as tags, entity references, and declarations are called markup. These are the parts of your document that are supposed to be understood by the XML parser. The parts that are between the markup constitute the character data. While the XML parser rips apart and analyzes markup, it merely passes the character data to the application.

Recall that the parser is the part of the program dedicated to separating the document into its constituent parts. The application is the "rest" of the program. In a word processor, the application is the part that lets you edit the document; in a spreadsheet it is the part that lets you crunch the numbers.

We haven't explained all of the parts of markup yet, but they are easy to recognize. All of them start with less-than (<) or ampersand (&) characters. Everything else is character data.

15.1.3 White space

There is a set of characters called white space characters that XML parsers treat differently in XML markup. They are the "invisible" characters: space (Unicode/ASCII 32), tab (Unicode/ASCII 9), carriage return (Unicode/ASCII 13) and line feed (Unicode/ASCII 10). These correspond roughly to the spacebar, tab, and Enter keys on your keyboard.

When the XML specification says that white space is allowed at a particular point, you may put as many of these characters as you want in any combination. Just as you might put two lines between paragraphs in a word processor to make a printed document readable, you may put two carriage returns in certain places in an XML document to make your source file more readable and maintainable. When the document is processed, those characters will be ignored.

In other places, white space will be significant. For instance you would not want the parser to strip out the spaces between the words in your document! Thatwouldmakeithardtoread. So white space outside of markup is always preserved.

15.1.4 Names

When you use XML you will often have to give things names. You will name logical structures with element-type names, particular elements with IDs, and so forth. XML names have certain common features. They are not nearly as flexible as character data.

Letters or underscores can be used anywhere in a name. There are thousands of characters that XML version 1.0 considers a "letter" because it includes characters from every language including ideographic ones like Japanese Kanji. XML version 1.1 is even more liberal: it treats a character as a "letter" unless it is from a small list designated as punctuation.^[2] Characters that can be used anywhere in a name are known in XML terms as name start characters. They are called this because they may be used at the start of names as well as in later positions.

^[2] The two versions differ only in some character set details, which is why XML 1.1 hasn't been mentioned before.

This implies that there must be characters that can go in a name but cannot be the first character. You may include digits, hyphens and full-stop (.) characters in a name, but you may not start the name with one of them. These are known as name characters. Other characters, like various white space and Western punctuation characters, cannot be part of a name at all. Examples of these non-name characters include the tilde (~), caret (^) and space ( ).

You cannot make names that begin with the string "xml" or some case-insensitive variant like "XML" or "XmL".

Like almost everything else in XML, names are matched case-sensitively. Names may not contain white space, punctuation or other "funny" characters other than those listed above. The remaining "ordinary" characters (including letters from non-Latin alphabets) are called name characters because they may occur anywhere in a name.


	Amazon