8.9. Line Structure ControlFor practical reasons, text usually needs to be divided into lines when presented visually. This is caused by the properties of media like papyrus scroll, sheet of paper, or computer screen. If we used continuous tapes for writing, things would be different. 8.9.1. Different Approaches to Line StructuringWhen text is presented in digital coded form, it seems natural to leave out the line division. It can be handled by the rendering software, which selects the line length according to the rendering situation and styling instructions. This is typically the approach in modern text processing: a paragraph does not contain any line structure information. The same applies to data formats such as HTML and TeX: although the source format may contain line breaks, they are normally ignored (treating them as equivalent to spaces). You would use explicit markup, such as br in HTML, to force a line break. However, in the early days of computing things were different, and this is still reflected in important ways. Text data files were line-oriented, since the files were treated more or less as images of a deck of punched cards (with 80 characters in each card), line printer output (typically consisting of 132 character wide lines), or computer screens (usually 80 characters wide). This means that the digital files were internally divided into lines as well, using some of the coding methods we will discuss shortly. Line structure became semantically important, too. In the absence of more advanced methods, text was formatted using blank lines between paragraphs and other blocks of text. Indentation was created by using spaces at the start of a line. Spaces were also used to create table-like display of data or pictures formed from characters ("ASCII graphics"), and naturally this implied that line structure is essential. Line structure is also used for presenting tabular data in formats such as Tab Separated Values (TSV) or Comma Separated Values (CSV) . They are commonly used for transferring data as text between spreadsheet programs and other software. A row of a table is presented as one line of text, with a horizontal tab or comma or other character as separator between cells. Many computer languages have been designed to be line-structured. Although in most programming languages (excluding original FORTRAN, Python, and few others), line structuring is just visual formatting for the human eye, most command languages use a line as a fundamental concept. Typically, a command consists of one line. In particular, Internet protocols typically use command (or control) languages that are line structured. For example, an email message header is a logical line, beginning with a key word and a colone.g., From: or Subject:and extending to the end of line. In such headers, the continuation line convention is that a line beginning with at least one space is treated as a continuation of the preceding physical line. 8.9.2. Lines and RecordsLines are often called "records," or "physical records" to distinguish them from a logical record concept. A logical record may correspond to one physical record or a sequence of physical records (e.g., a postal address record consists of several lines, or physical records), or the correspondence can be more complicated. In any case, logical and physical records are at different conceptual levels, and logical record structure is either not explicit at all or it is expressed using tools above the character level. The situation is somewhat more complex, though. Although a physical record (in text data) normally corresponds to a line, it may actually span several lines. To express this somewhat confusing situation, we can distinguish between physical line and logical line. In line-structured languages and data, it may happen that a line needs to be longer than conveniently fits into one physical line. In such cases, some continuation line convention is applied so that one logical line can consist of several physical lines. Even in programming languages that are not line structured, continuation line conventions are useful for constructs that do not permit a line break inside them, most important, string constant literals. The conventions vary. A common one is that a reverse solidus \ (backslash) at the end of line indicates that the logical line continues at the start (character position 1) of the next physical line and the \ itself is not treated as data. In such a convention, \ before a line break effectively nullifies the line break (and the \ character itself). Continuation lines are not a Unicode issue, since the continuation line conventions operate at a higher level. In Unicode, the distinction between physical line and logical line as just described does not exist. 8.9.3. Methods of Coding Line StructureSeveral methods have been deployed for expressing a line structure at the character level:
The line break characters are summarized in Table 8-12. Note that CR and LF, the most common control characters for line breaks, are seriously ambiguous.
Commonly used conventions on line breaks include the following:
On Windows systems, CR LF is normally used as a line break. However, in text-processing software such as MS Word, CR LF separates paragraphs. In such usage, there is normally no line structure inside a paragraph, so a paragraph is like a long line, as far as line break controls are considered. 8.9.4. Editors, Word Processors, and Data TransferThe differences described in the previous section are a common source of problems in data transfer between programs, even inside a single computer. The programs commonly used for processing text can be roughly divided into two categories. An editor processes plain text and is often line oriented, and lines are typically separated by LF (or CR or CR LF). At the simplest, an editor uses one font only, and it stores no font information in a file it creates. Widely used editors include Notepad and Emacs. A word processor such as MS Word can handle different fonts, underlining, tabular formatting, and many other kinds of visual enhancements. This means that it saves data in a particular internal format that contains formatting data in addition to the text itself. Normally a text processor can read or write plain text files, too. Thus, data can be transferred between a text processor and an editor in plain text at least. There are pitfalls, however. Differences in line break conventions often cause trouble. If you use MS Word and tell the program to save a document as plain text, there is a considerable difference between "plain text" and "plain text with line breaks" in the format menu of the "Save As" function. In "plain text," a paragraph is saved as one long line, and this may cause trouble if you try to open the file in an editor. "Plain text with line breaks" splits a paragraph into lines, separated with CR LF, according to the current visual rendering (which depends on the window width). This is usually much more digestible to an editor. It may imply that information about paragraph breaks is lost, though. |