9.3. Media Types for TextWhen data is stored in a file or transferred between systems and applications, it is essential to keep track of the format of data. This is especially important on the Internet, where the recipient of data may be prepared to handle different formats of data but needs to know the format. For example, if some data is included in an email message as an attachment, the message should internally carry information about the format of the data, such as plain text (which can be rendered directly very simply) or rich text (which needs to be processed in a rather complicated way in order to display it properly). Internet media types (MIME types), described in Chapter 10, are used to specify the general nature of a data set (file), such as image versus text, as well as its more specific format. Here we will consider the major type text and its subtypes. 9.3.1. The Type textThe MIME specification (RFC 2046 ) defines the type text as follows:
In most cases, data of type text is completely textual, not just principally textual. However, rich text formats may contain facilities for embedding images directly into the file format. This definition might be read so that "rich text" is a catchall name for anything that is text but not plain text. However, "rich text" normally refers to formats that contain text and formatting instructions (for italics, bolding, font selection, spacing, etc.). HTML or XML documents are hardly "rich text," since they mostly do not contain direct formatting instructions. Data formats such as TSV (Tab Separated Values) aren't really "rich text" either. Rather, they specify a very simple structure for tabular data: each line of text (separated by line breaks) corresponds to one row of a table, and some designated character (typically, tab, semicolon, or comma) is treated as a separator between cells. Naturally, that designated character must not appear in the data itself. 9.3.2. The Character EncodingThe text type has an optional charset parameter that can be used to specify the character encoding of the text. For example, text/plain;charset=utf-8 means plain text that shall be interpreted as UTF-8 encoding. What happens if the encoding is not specified that way? Since the content is texti.e., charactersthere is really no meaningful way to process it in any way without knowing, guessing, or implying some encoding. In Chapter 10, we will take a detailed look at this practically important problem for HTML documents on the Web. The problem is of a more general nature, though. For example, if you open a plain text file locally in a system, there is usually no encoding information for the file. Most filesystems contain no direct data about media types in the MIME sense or about the encoding. At the general level, there are different ways to deal with a situation where a subtype of text does not specify the encoding with charset:
As an implication, if you have Unicode data in UTF-8 encoding, it is very probable that characters in the ASCII range get interpreted correctly. All the rest is more or less unsafe. This is one reason why the basic structural elements of computer languages, such as markup tags, are usually still limited to ASCII. According to MIME specifications, if a program does not recognize a subtype of text, it should treat it as text/plain, provided that it knows how to handle the character encoding (charset). If the character encoding unrecognized, too, the subtype should be should be treated as application/octet-stream, which effectively means "lump of binary data." Upon receiving such data from a network, well-behaved software normally prompts the user for an action, asking her to specify whether the data should be stored on the local disk or processed in some other user-specified way. In reality, programs might just imply the ASCII encoding (or some other) instead. 9.3.3. The text Type Versus the application TypeIn the type classification, many formats that can intuitively be understood as text formats are defined as being of major type application. For example, the data formats that word processors normally use are classified as application types. As a rough rule of thumb, if a format is designed for processing with a specific program or family of programs, it is classified as an application type. Formats of text type are meant to be processed with many different programs, and they have been defined by specifying their structural properties and semantics, rather than technical implementation. For example, the format used by WordPerfect is application/vnd.wordperfect. Names of subtypes defined for vendor-specific software start with vnd. in most cases, but there are some exceptions for historical reasons, such as application/msword. The PDF format, defined by Adobe, is registered as application/pdf. It is comparable to word processor formats, in the sense that the content is typically mostly text, but the overall structure is not textual. PDF is widely used for the interchange and distribution of documents, especially when it is desirable to deliver them in easily printable format. Officially, "PDF" is short for "Portable Document Format." PDF is often used for documents that contain special characters, since you can, upon creating a PDF file, specify that font information be embedded into the data. This means that recipients can usually view and print the document, even if the fonts on their computers do not contain all the characters used. In some cases, the same data format can be classified using different media types. For example, an XML document may be classified as text/xml or application/xml, and possibly using other media types as well, depending on the specific markup used. 9.3.4. Subtypes of textJust as for the application type, the subtype name usually begins with vnd. for vendor-specific subtypes of text. This does not mean that the subtype is private use only. On the contrary, it has been registered so that it can be used generallye.g., on the Internet. Table 9-6 presents all subtypes of text except those with names beginning with vnd. (see the full registry at http://www.iana.org/assignments/media-types/text/). The last column identifies the registration documents, which usually do not describe the format itself; instead, it lists some basic properties and refers to some documents or organizations for the actual specifications. "I-Draft" means an Internet-Draft, available from the repository https://datatracker.ietf.org/public/idindex.cgi. "Registry" means that the definition is in a file in the registry, not published as an RFC or as an Internet-Draft.
Usually, subtypes of application are used for XML documents. However, RFC 2023 recommends that text/xml (or, in some cases, text/xml-external-parsed-entity) be used, if "an XML documentthat is, the unprocessed, source XML documentis readable by casual users." As a practical consideration, software that does not support XML in any particular way will probably treat text/xml as comparable to text/plain and display it as such. Thus, the question is whether a person who does not know the specific markup used will be able to understand (some of) the data intuitively. This may well be the case, if element and attribute names are mnemonic and descriptive, like product and price. Note, however, that displaying an XML document as unprocessed means that character references such as ሴ are displayed literally, probably confusing casual users. |