Section 9.3. Media Types for Text


9.3. Media Types for Text

When data is stored in a file or transferred between systems and applications, it is essential to keep track of the format of data. This is especially important on the Internet, where the recipient of data may be prepared to handle different formats of data but needs to know the format. For example, if some data is included in an email message as an attachment, the message should internally carry information about the format of the data, such as plain text (which can be rendered directly very simply) or rich text (which needs to be processed in a rather complicated way in order to display it properly).

Internet media types (MIME types), described in Chapter 10, are used to specify the general nature of a data set (file), such as image versus text, as well as its more specific format. Here we will consider the major type text and its subtypes.

9.3.1. The Type text

The MIME specification (RFC 2046 ) defines the type text as follows:

The "text" media type is intended for sending material which is principally textual in form. A "charset" parameter may be used to indicate the character set of the body text for "text" subtypes, notably including the subtype "text/plain," which is a generic subtype for plain text . Plain text does not provide for or allow formatting commands, font attribute specifications, processing instructions, interpretation directives, or content markup. Plain text is seen simply as a linear sequence of characters, possibly interrupted by line breaks or page breaks. Plain text may allow the stacking of several characters in the same position in the text. Plain text in scripts like Arabic and Hebrew may also include facilities that allow the arbitrary mixing of text segments with opposite writing directions.

Beyond plain text, there are many formats for representing what might be known as "rich text." An interesting characteristic of many such representations is that they are to some extent readable even without the software that interprets them. It is useful, then, to distinguish them, at the highest level, from such unreadable data as images, audio, or text represented in an unreadable form. In the absence of appropriate interpretation software, it is reasonable to show subtypes of "text" to the user, while it is not reasonable to do so with most nontextual data. Such formatted textual data should be represented using subtypes of "text."

In most cases, data of type text is completely textual, not just principally textual. However, rich text formats may contain facilities for embedding images directly into the file format.

This definition might be read so that "rich text" is a catchall name for anything that is text but not plain text. However, "rich text" normally refers to formats that contain text and formatting instructions (for italics, bolding, font selection, spacing, etc.). HTML or XML documents are hardly "rich text," since they mostly do not contain direct formatting instructions.

Data formats such as TSV (Tab Separated Values) aren't really "rich text" either. Rather, they specify a very simple structure for tabular data: each line of text (separated by line breaks) corresponds to one row of a table, and some designated character (typically, tab, semicolon, or comma) is treated as a separator between cells. Naturally, that designated character must not appear in the data itself.

9.3.2. The Character Encoding

The text type has an optional charset parameter that can be used to specify the character encoding of the text. For example, text/plain;charset=utf-8 means plain text that shall be interpreted as UTF-8 encoding.

What happens if the encoding is not specified that way? Since the content is texti.e., charactersthere is really no meaningful way to process it in any way without knowing, guessing, or implying some encoding. In Chapter 10, we will take a detailed look at this practically important problem for HTML documents on the Web. The problem is of a more general nature, though. For example, if you open a plain text file locally in a system, there is usually no encoding information for the file. Most filesystems contain no direct data about media types in the MIME sense or about the encoding.

At the general level, there are different ways to deal with a situation where a subtype of text does not specify the encoding with charset:


Imply an encoding

This has been very common in the past, usually implying ASCII, or (especially on the Web) ISO-8859-1. According to MIME specifications, the default must be ASCII for all subtypes of text, but other Internet protocols (e.g., HTTP) impose other rules. Thus, it is unsafe to assume any specific default. If you open a plain text file on the local disk, the program you use might imply a system-dependent default.


Specify a default encoding for a subtype

It might be natural to specify a default encoding for a subtype on practical grounds. In particular, the effective default encoding for text/html is usually windows-1252 or ISO-8859-1. In principle, this is not the case, and the MIME specification apparently disallows subtype-specific defaults.


Deduce the encoding from the data itself

Various techniques can be used to try to guess the encoding from the data content. In particular, some data formats contain mechanisms for specifying the encoding inside the data (e.g., a meta element in HTML and the XML prologue in XML). Although logically odd, these mechanisms often work reasonably well.


Let the user decide

Rather naturally, a program could prompt for a user action to choose between encodings, when adequate information about encoding is not available. If the dialog contains a method for previewing the content in different encodings, this may work well, when the user is experienced.

As an implication, if you have Unicode data in UTF-8 encoding, it is very probable that characters in the ASCII range get interpreted correctly. All the rest is more or less unsafe. This is one reason why the basic structural elements of computer languages, such as markup tags, are usually still limited to ASCII.

According to MIME specifications, if a program does not recognize a subtype of text, it should treat it as text/plain, provided that it knows how to handle the character encoding (charset). If the character encoding unrecognized, too, the subtype should be should be treated as application/octet-stream, which effectively means "lump of binary data." Upon receiving such data from a network, well-behaved software normally prompts the user for an action, asking her to specify whether the data should be stored on the local disk or processed in some other user-specified way. In reality, programs might just imply the ASCII encoding (or some other) instead.

9.3.3. The text Type Versus the application Type

In the type classification, many formats that can intuitively be understood as text formats are defined as being of major type application. For example, the data formats that word processors normally use are classified as application types. As a rough rule of thumb, if a format is designed for processing with a specific program or family of programs, it is classified as an application type. Formats of text type are meant to be processed with many different programs, and they have been defined by specifying their structural properties and semantics, rather than technical implementation.

For example, the format used by WordPerfect is application/vnd.wordperfect. Names of subtypes defined for vendor-specific software start with vnd. in most cases, but there are some exceptions for historical reasons, such as application/msword.

The PDF format, defined by Adobe, is registered as application/pdf. It is comparable to word processor formats, in the sense that the content is typically mostly text, but the overall structure is not textual. PDF is widely used for the interchange and distribution of documents, especially when it is desirable to deliver them in easily printable format. Officially, "PDF" is short for "Portable Document Format." PDF is often used for documents that contain special characters, since you can, upon creating a PDF file, specify that font information be embedded into the data. This means that recipients can usually view and print the document, even if the fonts on their computers do not contain all the characters used.

In some cases, the same data format can be classified using different media types. For example, an XML document may be classified as text/xml or application/xml, and possibly using other media types as well, depending on the specific markup used.

9.3.4. Subtypes of text

Just as for the application type, the subtype name usually begins with vnd. for vendor-specific subtypes of text. This does not mean that the subtype is private use only. On the contrary, it has been registered so that it can be used generallye.g., on the Internet.

Table 9-6 presents all subtypes of text except those with names beginning with vnd. (see the full registry at http://www.iana.org/assignments/media-types/text/). The last column identifies the registration documents, which usually do not describe the format itself; instead, it lists some basic properties and refers to some documents or organizations for the actual specifications. "I-Draft" means an Internet-Draft, available from the repository https://datatracker.ietf.org/public/idindex.cgi. "Registry" means that the definition is in a file in the registry, not published as an RFC or as an Internet-Draft.

Table 9-6. Registered subtypes of text

Subtype

Meaning

Definition

calendar

iCalendar format, for calendaring and scheduling

RFC 2445

css

Stylesheet, in Cascading Style Sheets (CSS)

RFC 2318

csv

Comma Separated Values, for tabular data

I-Draft

directory

Directory information (e.g., telephone directory)

RFC 2425

dns

Domain Name System data

RFC 4027

ecmascript

Obsolete subtype for Ecmascript code

I-Draft

enriched

A simple rich text typei.e., text with formatting info

RFC 1896

html

HTML (Hypertext Markup Language) document

RFC 2854

javascript

Obsolete subtype for JavaScript code

I-Draft

parityfec

For Real-time Transport Protocol (RTP)

RFC 3009

plain

Plain text: text as such, with no special agreements

RFC 2046

prs.fallenstein.rst

For reStructuredText, a simple markup system

Registry

prs.lines.tags

Consists of lines with simple name: value syntax

Registry

red

For transport of redundant text data via RTP

RFC 4102

rfc822-headers

Internet message headers, when sent as data

RFC 1892

richtext

An obsolete rich text type, see text/enriched

RFC 1341

rtf

Rich Text Format (RTF), a common rich text type

Registry

sgml

Standard Generalized Markup Language (SGML)

RFC 1874

t140

For transmission of data via RTP using ITU T.140

RFC 4103

tab-separated -values

TSV format, for tabular data, similar to text/csv

Registry

troff

Marked-up text, for the troff typesetting programs

I-Draft

uri-list

A list of URIs (URLs) for URI resolution services

RFC 2483

xml

XML (Extensible Markup Language) document

RFC 2023

xml-external-parsed-entity

External parsed entity, as defined in the XML specification; typically, a file of common definitions

RFC 2023


Usually, subtypes of application are used for XML documents. However, RFC 2023 recommends that text/xml (or, in some cases, text/xml-external-parsed-entity) be used, if "an XML documentthat is, the unprocessed, source XML documentis readable by casual users." As a practical consideration, software that does not support XML in any particular way will probably treat text/xml as comparable to text/plain and display it as such. Thus, the question is whether a person who does not know the specific markup used will be able to understand (some of) the data intuitively. This may well be the case, if element and attribute names are mnemonic and descriptive, like product and price. Note, however, that displaying an XML document as unprocessed means that character references such as ሴ are displayed literally, probably confusing casual users.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net