Section 8.9. Line Structure Control

8.9. Line Structure Control

For practical reasons, text usually needs to be divided into lines when presented visually. This is caused by the properties of media like papyrus scroll, sheet of paper, or computer screen. If we used continuous tapes for writing, things would be different.

8.9.1. Different Approaches to Line Structuring

When text is presented in digital coded form, it seems natural to leave out the line division. It can be handled by the rendering software, which selects the line length according to the rendering situation and styling instructions. This is typically the approach in modern text processing: a paragraph does not contain any line structure information. The same applies to data formats such as HTML and TeX: although the source format may contain line breaks, they are normally ignored (treating them as equivalent to spaces). You would use explicit markup, such as br in HTML, to force a line break.

However, in the early days of computing things were different, and this is still reflected in important ways. Text data files were line-oriented, since the files were treated more or less as images of a deck of punched cards (with 80 characters in each card), line printer output (typically consisting of 132 character wide lines), or computer screens (usually 80 characters wide). This means that the digital files were internally divided into lines as well, using some of the coding methods we will discuss shortly.

Line structure became semantically important, too. In the absence of more advanced methods, text was formatted using blank lines between paragraphs and other blocks of text. Indentation was created by using spaces at the start of a line. Spaces were also used to create table-like display of data or pictures formed from characters ("ASCII graphics"), and naturally this implied that line structure is essential.

Line structure is also used for presenting tabular data in formats such as Tab Separated Values (TSV) or Comma Separated Values (CSV) . They are commonly used for transferring data as text between spreadsheet programs and other software. A row of a table is presented as one line of text, with a horizontal tab or comma or other character as separator between cells.

Many computer languages have been designed to be line-structured. Although in most programming languages (excluding original FORTRAN, Python, and few others), line structuring is just visual formatting for the human eye, most command languages use a line as a fundamental concept. Typically, a command consists of one line.

In particular, Internet protocols typically use command (or control) languages that are line structured. For example, an email message header is a logical line, beginning with a key word and a colone.g., From: or Subject:and extending to the end of line. In such headers, the continuation line convention is that a line beginning with at least one space is treated as a continuation of the preceding physical line.

8.9.2. Lines and Records

Lines are often called "records," or "physical records" to distinguish them from a logical record concept. A logical record may correspond to one physical record or a sequence of physical records (e.g., a postal address record consists of several lines, or physical records), or the correspondence can be more complicated. In any case, logical and physical records are at different conceptual levels, and logical record structure is either not explicit at all or it is expressed using tools above the character level.

The situation is somewhat more complex, though. Although a physical record (in text data) normally corresponds to a line, it may actually span several lines. To express this somewhat confusing situation, we can distinguish between physical line and logical line.

In line-structured languages and data, it may happen that a line needs to be longer than conveniently fits into one physical line. In such cases, some continuation line convention is applied so that one logical line can consist of several physical lines. Even in programming languages that are not line structured, continuation line conventions are useful for constructs that do not permit a line break inside them, most important, string constant literals. The conventions vary. A common one is that a reverse solidus \ (backslash) at the end of line indicates that the logical line continues at the start (character position 1) of the next physical line and the \ itself is not treated as data. In such a convention, \ before a line break effectively nullifies the line break (and the \ character itself).

Continuation lines are not a Unicode issue, since the continuation line conventions operate at a higher level. In Unicode, the distinction between physical line and logical line as just described does not exist.

8.9.3. Methods of Coding Line Structure

Several methods have been deployed for expressing a line structure at the character level:

Precede each line by data that expresses the length of the line in octets. Writing characters must be line-buffered: they are written to an internal buffer that is flushed out when the line is complete and its length can be written out before the line itself.
Make all lines of the same, fixed and known length, such as 80 characters, using spaces or other neutral characters for padding. Essentially, a text file is then structurally equivalent to a deck of punched cards with no separator between the cards. This is wasteful but simple, and it was widely used in the early days of computing. You can still find legacy data and even legacy systems that use such an approach. Care must be taken when dealing with trailing spaces, since some of them might be significant and not just padding.
Use control characters for start of line and end of line. Although this may seem unnecessarily explicit, as compared with indicating just line breaks, it is the line structure model used in SGML, for example. By default, SGML uses line feed as start of line (record start, RS) and carriage return as end of line (record end, RE). In implementations, it is common to use line break control as described next, and programs are expected to infer the missing start of line (and end of line) characters.
Use control characters between lines. The expression "line break" is often used to refer to one or more control characters used for the purpose. This is the most common approach nowadays, but the problem is that there are several line break conventions. Even the last line is usually terminated by a line break, although it is then ambiguous whether the data ends with an empty line or not. The control characters used in different environments are listed in the next table.

The line break characters are summarized in Table 8-12. Note that CR and LF, the most common control characters for line breaks, are seriously ambiguous.

Table 8-12. Line break characters in Unicode
Abbr.	Code	Unicode name	Comments
LF	U+000A	Line feed	Line break or paragraph break; "control-J"
VT	U+000B	Vertical tabulation	Line break in MS Word; "control-K"
FF	U+000C	Form feed	Page break, implying line break; "control-L"
CR	U+000D	Carriage return	Line break or paragraph break; "control-M"
NEL	U+0085	Next line	Line break in some systems
LS	U+2028	Line separator	Unambiguous, but used very little
PS	U+2029	Paragraph separator	Unambiguous, but used very little

Commonly used conventions on line breaks include the following:

Some systems (e.g., Macintosh) use CR between lines.
Some systems (e.g., Unix) use LF between lines; XML follows this practice in the sense that XML processors canonicalize line breaks to LF.
Many systems use a CR LF pair (carriage return immediately followed by line feed) to indicate a single line break, and this is a basic convention in most Internet contexts, for example.

On Windows systems, CR LF is normally used as a line break. However, in text-processing software such as MS Word, CR LF separates paragraphs. In such usage, there is normally no line structure inside a paragraph, so a paragraph is like a long line, as far as line break controls are considered.

8.9.4. Editors, Word Processors, and Data Transfer

The differences described in the previous section are a common source of problems in data transfer between programs, even inside a single computer. The programs commonly used for processing text can be roughly divided into two categories. An editor processes plain text and is often line oriented, and lines are typically separated by LF (or CR or CR LF). At the simplest, an editor uses one font only, and it stores no font information in a file it creates. Widely used editors include Notepad and Emacs. A word processor such as MS Word can handle different fonts, underlining, tabular formatting, and many other kinds of visual enhancements. This means that it saves data in a particular internal format that contains formatting data in addition to the text itself.

Normally a text processor can read or write plain text files, too. Thus, data can be transferred between a text processor and an editor in plain text at least. There are pitfalls, however. Differences in line break conventions often cause trouble. If you use MS Word and tell the program to save a document as plain text, there is a considerable difference between "plain text" and "plain text with line breaks" in the format menu of the "Save As" function. In "plain text," a paragraph is saved as one long line, and this may cause trouble if you try to open the file in an editor. "Plain text with line breaks" splits a paragraph into lines, separated with CR LF, according to the current visual rendering (which depends on the window width). This is usually much more digestible to an editor. It may imply that information about paragraph breaks is lost, though.