2.7 XML and Unicode

 < Day Day Up > 



2.7 XML and Unicode

An XML document can contain any Unicode text. Unicode is an international standard for representing multilingual text. It defines a very large character set that includes characters from most of the world's languages as well as many mathematical and technical symbols.

A character set defines a mapping between a set of characters and a set of numbers, which are called code points. For example, in Unicode, the Greek letter α is represented by the code point 945 (in decimal notation), or x3B1 (in hexadecimal notation).

Unicode is a superset of American Standard Code for Information Interchange (ASCII), a widely used character set that includes all the letters and common punctuation marks used in English. ASCII consists of 128 characters with code points from 0 to 127. The first 128 characters of Unicode are identical to ASCII. For example, the letter A has the code point 65 in both ASCII and Unicode. However, Unicode goes well beyond ASCII by including many more characters. The current version of the standard, Unicode 3.2, defines code points for approximately 95,000 characters.

In an XML document, the names of elements and attributes, as well as the character data contained in an element, can all be written in Unicode. The advantage of using Unicode is that it allows you to use a single character set for text containing multiple languages and many different types of symbols. This avoids the problems caused by conflicting character sets, in which a single code point might be assigned to more than one character or a single character might have more than one code point, depending on the type of computer being used. Many software applications and operating systems now support Unicode. Unicode thus provides a standard way of encoding multilingual text so it can be exchanged and interpreted reliably across a wide variety of computer systems.

You can include a Unicode character in an XML document in the form of a character entity reference. For example, to include the Greek character α, you would type &#x3B1;. If the document includes a DTD declaration with entity names defined for specific characters, you can also insert the character using a named entity reference. For example, suppose you include a reference to the MathML DTD, as shown here:

    <!DOCTYPE math SYSTEM      "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd"> 

You can then insert the α character in this document by using the named entity reference, &alpha; because the MathML DTD includes an entity declaration that associates the entity name alpha with the corresponding Unicode character code. We shall learn more about named characters in MathML in Section 3.4.



 < Day Day Up > 



The MathML Handbook
The MathML Handbook (Charles River Media Internet & Web Design)
ISBN: 1584502495
EAN: 2147483647
Year: 2003
Pages: 127
Authors: Pavi Sandhu

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net