Section 11.4. Character Input and Output

11.4. Character Input and Output

In this section, we will discuss some special topics of character I/O. The most general question here is whether we should read and write characters one at a time or by lines. This reflects an old division: in the early days of computing, I/O was line oriented, typically with punched cards (corresponding to lines of exactly 80 characters) as input devices and a line printer (typically with line length of 80 or 132) as an output device. After describing these modes, we consider Java file I/O and some character input problems in web forms.

11.4.1. Character-Oriented and Line-Oriented Processing

In character-oriented input, a program reads a character at a time, typically using a subprogram like getchar() in C. This means that line breaks will appear as characters returned by the subprogram. Normally they are canonicalized, by the programming language's basic I/O routines, to some unified representation. For example, in C, the getchar() function returns the character denoted by the character literal '\n'. The identity of line break varies between C implementationsin practice, it is either CR or LF. In any case, you can test for end of line by using code like if(ch == '\n' )....

In line-oriented input, a complete line is read at a time. The data is typically stored to a memory area specified by a parameter of the invocation of the input routine. It is usually the caller's responsibility to allocate sufficient storage for the data. FORTRAN uses primarily line-oriented I/O: one read operation reads at least one line, or (physical) record, to use the FORTRAN terminology.

It is easy to build line-oriented input upon character-oriented input; the opposite is not possible in any direct way. The C language, for example, has line-oriented I/O functions as well, such as gets() for getting an entire line, though it may read just part of a line in some cases. However, many people think that such functions are unsafe, since it is difficult to control the input process and too easy to fail to allocate sufficient storage. Thus, the argument goes, you might just as well write code of your own for reading a line using a function for reading one character.

In Perl, input is essentially line-oriented. It is also implicit in the sense that you do not write a subprogram call but enclose a file handle in the <> operator. The evaluation of a file handle implies the input of a line. Thus, if you write $foo = <STDIN> in Perl, you ask the Perl interpreter to read a line of input from the STDIN file and assign the data (including the trailing end of line character) to the variable $foo. Things can be even more implicit in Perl. If you write while(<STDIN>) { zap(); }, then you have written a loop that reads the entire STDIN file (standard input) one line at a time and executes the subprogram call zap⁠(⁠). Within the subprogram, the current input line can be accessed as the value of the built-in variable $_.

In order to process input character by character in Perl, you would read a line and then use string processing operators to extract individual characters. Moreover, to refer to a single character in a string, you would use substr, the substring operator, and specify a substring of length 1. This may sound clumsy, but Perl programmers are used to it. On the other hand, they try to avoid dealing with characters on such an individual basis and use matching and replacement operators instead.

Perl has the interesting feature that although you read a line at a time, you can make an entire file a single line as far as Perl I/O is considered. The tool to use is the special variable $/, which specifies the character to be recognized as line break. By explicitly setting it to an undefined value, you tell Perl to treat no character as line break. This means that CR or LF will be read and treated as normal data characters. Thus, assuming you have opened a file and assigned the handle DATA to it, the following Perl code would read the entire content of the file into the variable $stuff as one string:

$/ = undef; my $stuff = <DATA>;

This is very handy in many situations, where the program can be simplified by treating the input file as one long string stored into a scalar variable. A typical example is a simple replacement operation that should be performed throughout the data. The following program copies a file to another, replacing each occurrence of the euro sign € (U+20AC) with the word "euros":

open(IN, "<:utf8", "orig.txt") or die "can't do input"; open(OUT, ">:utf8", "new.txt") or die "can't do output"; $/ = undef; $all = <IN>; print $all; $all =~ s/\x{20AC}/euros/g; print OUT $all;

11.4.2. Perl I/O

Although Perl uses internally UTF-8, it does not interpret input data as UTF-8 encoded by default. Instead, it uses the encoding that is normal in its environment or that has been specified in the locale settings. One reason for this is compatibility: it keeps old programs working. To make programs use UTF-8 on input, you need to specify the encoding.

In Perl, a scalar value is internally accompanied with a utf8 flag, which indicates whether the value is to be interpreted as UTF-8 encoded. String constants, for example, have this flag set. When reading from a file, you normally get data that does not have the flag set. To specify that an input file be read as UTF-8, you can do as follows in order to open a file and to read its first line into a variable:

open(IN, "<:utf8", "data.utx") or die "Missing data file"; $dataline = <IN>;

In the extra argument "<:utf8", the less-than sign specifies that the file is opened for input only, and the rest specifies the encoding to be used. The filename is given in another argument. As you might guess, you can open an output file for writing in UTF-8 encoding in a similar mannere.g., open(OUT, ">:utf8", "results.txt").

Alternatively, you can open an input file without the extra argument and convert the data after reading it. For this, you would use the Encode package. The following example shows just the basic approach. It does not contain error processing for the encoding operation, which may fail, since the data might not be valid UTF-8 data:

require Encode; open(IN, "<data.utx") or die "Missing data file"; $dataline = Encode::decode_utf8(<IN>);

In the output of Unicode characters in Perl, a common problem is the warning "Wide character in print." Technically, the reason is that you write UTF-8 characters to a stream that has not been opened for such writing. This can be prevented by opening an output stream in UTF-8 mode, as described above. For the standard output stream STDOUT, you use a statement that changes its mode, as in the following example:

binmode STDOUT, ":utf8"; print "Hello world \x{263A}!\n";

The following demonstration program combines some of the techniques discussed here. It copies a UTF-8 encoded file but replaces Greek letters with inverted question marks, ¿:

use charnames ':full'; open(IN, "<:utf8", "data.utx") or die "Missing data file"; open(OUT, ">:utf8", "data2.utx") or die "Cannot open output file"; $line = 0; while(<IN>) {     $line++;     if($count = s/\p{Greek}/\N{INVERTED QUESTION MARK}/g) {         print "$count replacement(s) on line $line.\n"; }     print OUT $_; }

You can specify other encodings, too, when you open a file. Instead of utf8, you would use a construct of the form encoding( name ) in the second argument of open. The following program performs a code conversion from windows-1252 to UTF-8:

open(IN, "<:encoding(windows-1252)", "dat.txt") or die "No data file"; open(OUT, ">:utf8", "dat2.txt") or die "Cannot write output"; while(<IN>) {     print OUT; }

11.4.3. Java File I/O

In Java, you can perform file output in several ways, such as the following:

Functions like print and println in the PrintWriter class, for textual output. The format is in the system's native encoding, which may well be a non-Unicode encoding. These functions are polymorphic (generic)i.e., they accept arguments of different types.
The write function in the OutputStreamWriter class, which acts as a bridge between streams of characters and streams of octets, encoding character data as needed. The function is polymorphic: the argument can be a character, an array of characters, or a string. The default encoding is the system's native encoding, but the encoding (such as UTF-8) can be specified as a second argument when creating an OutputStreamWriter object.
The write... functions in the DataOutputStream class. They mean "binary" output, and for character and string data, this means UTF-16 format. You need to select the function name according to the argument typee.g., writeChars for a string.
The writeUTF function in the DataOutputStream class. It takes a string argument, so to write anything else, you need to convert it to a string first. The function writes data in the Modified UTF-8 encoding (see Chapter 6). This means that the NUL character and all non-BMP characters are represented differently from UTF-8. Moreover, the function first writes two octets that indicate (when interpreted as a 16-bit integer) the number of octets that constitute the data. Of course, such data is meant to be read by the corresponding input routine, readUTF, or other code that recognizes or at least skips the octets that express the count.

The following program illustrates writing a string into a file in each of the ways described above. The test string is the three-character string written in Java source as Aé\u263a. The first character is an ASCII character, the second one is a Latin 1 character that occupies two octets in UTF-8, and the third one is U+263A, the smiling face:

import java.io.*; public class output {     public static void main(String[] args) {         String msg = "Aé\u263a";         String filename = "test.txt";         try {             OutputStream testf = new FileOutputStream(filename);             PrintWriter testfile = new PrintWriter(testf);             testfile.print(msg);             testfile.close();             System.out.println("Wrote " + filename);         } catch(Exception error) {             System.out.println("Failed to write " + filename); }         filename = "testu.txt";         try {             OutputStream testf = new FileOutputStream(filename);             OutputStreamWriter testfile =                 new OutputStreamWriter(testf,"UTF-8");             testfile.write(msg);             testfile.close();             System.out.println("Wrote " + filename);         } catch(Exception error) {             System.out.println("Failed to write " + filename); }         filename = "test16.txt";         try {             OutputStream testf = new FileOutputStream(filename);             DataOutputStream testfile = new DataOutputStream(testf);             testfile.writeChars(msg);             testfile.close();             System.out.println("Wrote " + filename);         } catch(Exception error) {             System.out.println("Failed to write " + filename); }         filename = "test8.txt";         try {             OutputStream testf = new FileOutputStream(filename);             DataOutputStream testfile = new DataOutputStream(testf);             testfile.writeUTF(msg);             testfile.close();             System.out.println("Wrote " + filename);         } catch(Exception error) {             System.out.println("Failed to write " + filename); }         System.exit(0);     }}

If the sample program is executed on a system that uses ISO-8859-1 as its native encoding, the first write effectively fails, though no exception is raised and no error message is issued. The character U+263A cannot be represented in ISO-8859-1, so the output routine might write a question mark, ?, instead. (This is questionable, but such things happen.) The other ways work well, though you cannot directly view the file contents on programs that support ISO-8859-1 only. The results are summarized in Table 11-7, which shows the contents of the files by octets (in a big-endian computer).

Table 11-7. File output in Java: encoding of sample text "Aé☺"
Method	Filename	Content (as octets in hex)	Comment
`print`	test.txt	41 E9 3F	ISO-8859-1, ☺ as ?
`write`	testu.txt	41 C3 A9 E2 98 BA	UTF-8
`writeChars`	test16.txt	00 41 00 E9 26 3A	UTF-16, no BOM
`writeUTF`	test8.txt	00 06 41 C3 A9 E2 98 BA	UTF-8 with octet count (00 06)

The methods for file input are analogous to output methods. We will here give just a rather trivial example: a program that reads a UTF-8 encoded file and prints the (decimal) Unicode code numbers of the characters. The program uses the read function in the InputStreamReader class, which is analogous to the OutputStreamWriter class. Using these

Figure 11-1. A form with extra buttons for character input

classes, you can create a portable program and handle any character encoding supported by the Java implementation. The read function returns the code number of the input character or -1, which indicates the end of file:

import java.io.*; public class IO {     public static void main(String[] args) {         try {             FileInputStream datafile =                 new FileInputStream(new File("test.txt"));             InputStreamReader input =                 new InputStreamReader(datafile,"UTF-8");             int ch;             while((ch = input.read()) != -1) {                 System.out.println(ch); }         } catch(Exception error) {             System.out.println("I/O error"); }         System.exit(0);     } }

11.4.4. Buttons for Character Input

In "Virtual Keyboards" in Chapter 2, we discussed the idea of buttons for entering characters in a data entry form. To implement it in an HTML form, you would use an input element of type button and associate an onclick event with it. The event handler would append a character to the content of an input box and focus on that box, so that the user can continue typing with the normal keyboard. The interface is illustrated in Figure 11-1.

The idea can be implemented in JavaScript as follows. For simplicity, the example has just two buttons, for entering ä and ö:

<form action="http://www.tracetech.net:8081/"> <div><label for="word">Finnish or English word:</label></div> <div>   <input type="text"  name="word" size="25" maxlength="80">   <input type="submit" value="Search"> </div> <div>   <input type="button" value="ä" onclick="append('ä')">   <input type="button" value="ö" onclick="append('ö')"> </div> </form> <script type="text/javascript"> var word = document.getElementById('word'); function append(char) {   word.value += char;   word.focus(); } </script>