6.4 CHARACTER TYPES

C++ gives you three different character types:

      char      unsigned char      signed char

Almost universally, a C++ char is allocated one byte so that it can hold one of 256 values. The decimal value stored in such a byte can be interpreted to range from either-128 to 127, or from 0 to 255, depending on the implementation. But in either case, the bit patterns for values between 0 and 127 are almost always reserved for letters, digits, punctuations, and so on, according to the ASCII format. All printable characters belong to this set of 128 values. From the standpoint of writing portable code, note that some of the characters one usually sees in C++ source code may not be available in a particular character set available for C++ programming. For example, some European character sets do not provide for the characters {,}, [,], and so on.

The decimal values of the bit patterns stored in a signed char are always interpreted as varying from -128 to 127, and those for a unsigned char^[3] from 0 to 255. So, is a plain char in C++ an unsigned char or a signed char? That, as mentioned before, depends on the implementation.

A char variable can be initialized by either an integer whose value falls within a certain range or by what's known as a character literal:

      char ch = 98;      // ch is assigned the character 'b'      char x = 'b';

The quantity 'b' is referred to as a character literal or a character constant. A character literal is in reality a symbolic constant for the integer value of the character. In the code fragment

      char y = '2';      int z = y + 8;      (works for both C++ and Java)

the value of z would be 58 because, under ASCII coding, the integer that corresponds to the bit pattern for the character '2' is 50.^[4]

Like C, C++ also allows an individual character to be represented by an escape sequence, which is a backslash^[5] followed by a sequence of characters that must be within a certain range. There are two kinds of escape sequences: character escapes and numeric escapes. Character escapes, such as \n, \t, and so on, represent ASCII's more commonly used control characters that when sent to an output device can be used to move the cursor to a new line, or to move the cursor to the next horizontal tab, etc. The character '/n' is frequently called the newline character and '/t' the tab character.

Since character escapes are few in number, a more general form of an escape sequence for representing an individual character is the numeric escape. A numeric escape comes in two forms: hexadecimal and octal. In the following declarations, all initializing x to the same value, the declaration in line (C) uses the hexadecimal form for the escape sequence shown, and the one in line (D) the octal form:

      char x = 'b';         // decimal value of 'b' is 98                    //(A)      char x = 98;                                                           //(B)            char x = '\x62';      // 62 is hex for 98                              //(C)      char x = '\142';      // 142 is octal for 98                           //(D)

In general, the hexadecimal (referred to frequently as just hex) form of a numeric escape in C++ must always be of the form

      \xdddd....d

where every character after the letter ‘x' is a hexadecimal digit (0-9 and a-f or A-F to represent the decimal values 0-15). C++ allows any arbitrary number of characters after the letter 'x' as long as each is a valid hexadecimal digit and with the additional stipulation that the decimal value of the hex number does not exceed 255 for 8-bit characters. The hexadecimal number x62 in line (C) represents the decimal 98, which is the ASCII code for the letter b. Similarly, the octal number 142 in line (D) also represents the decimal 98 and, therefore, corresponds again to the same letter, 'b'. Unlike octal numbers in general, an octal escape sequences does not have to begin with a 0. Also, a maximum of three digits is allowed in an octal escape sequence.

These properties of escape sequences require care when they are used as characters in string literals. This point is illustrated with the examples in the following program:

 
 //CharEscapes.cc #include <iostream> #include <string> using namespace std; int main() {     string y1( "a\x62" );     cout << y1 << endl;   // y1 is string "ab".                                                       // Printed output: ab     string y2( "a\xOa" );     cout << y2 << endl;                               // y2 is the string formed by                                                       // the character 'a' followed                                                       // by the newline character                                                       // Printed output: a     string y3( "a\nbcdef" );     cout << y3 << endl;                               // y3 is the string formed by                                                       // the character 'a' followed                                                       // by the newline character                                                       // represented by the character                                                       // escape '\n' followed by the                                                       // characters 'b', 'c', 'd', 'e',                                                         // and 'f'.                                                       // Printed output: a                                                       // bcdef     string y4( "a\xOawxyz" );     cout << y4 << endl;                               // y4 is the string formed by                                                       // character 'a' followed by the                                                       // newline character represented by                                                       // the numerical escape in hex, '\xOa',                                                       // followed by the characters 'w',                                                       // 'x', 'y', and 'z'.                                                       // Printed output: a                                                       // wxyz   //string y5( "a\xOabcdef" );                        // ERROR   //cout << y5 << endl;                               // because the number whose hex                                                       // representation is 'Oabcdef' is                                                       // out of range for a char     string y6( "a\xef" );     cout << y6 << endl;                               // Correct but the character after                                                       // 'a' may not be printable     string w1( "a\142" );     cout << w1 << endl;                               // w1 is the string formed by                                                       // the character 'a' followed by                                                       // the character 'b'.                                                       // Printed output: ab     string w2( "a\142c" );     cout << w2 << endl;                               // w2 is the string formed by the                                                       // character 'b' followed by the                                                       // character 'c'.                                                       // Printed output: abc     string w3( "a\142142" );     cout << w3 << endl;                               // w3 is the string formed by the                                                         // character 'a' followed by the                                                       // character 'b' followed by the                                                       // characters '1', '4', and '2'.                                                       // Printed output: ab142     string w4( "a\79" );     cout << w4 << endl;                               // w4 is the string formed by the                                                       // character 'a' followed by the                                                       // bell character, followed by                                                       // the character '9'. Printed                                                       // output: a9     string w5( "\x00007p\x0007q\x0007r\x007s\x07t\x7u" );          cout << w5 << endl;                               // printed output: pqrstu     return 0; }

A Java char has 2 bytes. Any two contiguous bytes in the memory represent a legal Java char. Which 16-bit bit pattern corresponds to what character is determined by the Unicode representation. As was mentioned earlier, the integer values 0 through 255 in the Unicode representation correspond to Latin-1 characters and the first 128 of these are the same as the encodings for the 7-bit ASCII character set (except for an additional byte of zeros on the high side). A Java char is unsigned, meaning that its integer values go from 0 through 65,535.

In Java, all of the following four declarations are equivalent:

      char x = 'a';             // value of 'a' is 98               //(E)      char x = 98;                                                  //(F)      char x = '\u0062';        // 0062 is hex for 98               //(G)      char x = '\142';          // 142 is octal for 98              //(H)

As shown in line (G), the hex form of a numeric escape in Java begins with the letter 'u', as opposed to the letter 'x' for C++. In general, the hex form of a numeric escape in Java must always be of the form

      \udddd

where where each d is a hexadecimal digit. The declaration in line (H) above uses an escape sequence in its octal representation. In all four cases, the value of x will be the same, the letter 'a'. Comparing the numeric escapes in lines (E) through (H) for Java and in lines (A) through (D) for C++, we note that only the hex versions are different. The hex version for Java must consist of four hex digits, where C++ allows an arbitrary number of hex digits.

Suppose you are translating a C++ program into a Java program, is it always possible to substitute Java's /udddd escape for C++'s /xd…d escape of an identical decimal value (that is under 256)? Not so. For example, the declaration

0      char ch = '\xOOOa';             // ok in C++                    //(I)

gives us a valid char in C++ consisting of the newline character. An equivalent Java declaration

      char ch = '\uOOOa';            // ERROR in Java                 //(J)

is illegal. By the same token, the second string literal we used in the C++ program CharEscapes.cc

      string y2 = "a\xOa";          // ok in C++                      //(K)

is legal. However, a comparable declaration in Java

      String s = "a\uOOOa";        // ERROR in Java                   //(L)

is illegal for constructing a string literal. The reason for why the /udddd escapes shown in lines (J) and (L) cause errors in Java has to do with the fact that the very first thing a Java compiler does with a source file is to scan it for resolving on the fly all /udddd escapes. As each /udddd escape is encountered, it is replaced immediately by the corresponding 2-byte Unicode character. If a /udddd escape represents the newline character, as is the case with the escape sequence /uOOOa, a newline is inserted into the source file at that point immediately.^[6] The same thing happens with the Unicode escape /uOOOd, which represents carriage return.

The above discussion should not be construed to imply that you cannot embed control characters such as the newline or the carriage-return characters in a character or a string literal. When, for example, a newline character is desired, one can always use the character escape '/n' for achieving the same effect.

Shown below is Java's version of the C++ program CharEscapes.cc presented earlier in this section. This program retains as many of the string literals of the C++ program as make sense in Java. We have also avoided the use of/uOOOa as a newline character in the string literals.

In the program shown below, note in particular that whereas the string y5 resulted in an error in C++, Java has no problems with it. Java forms a Unicode character out of the escape /uOabc, leaving the rest of the characters for the string literal. But since Java cannot find a print representation for the Unicode character, it outputs a question mark in its place when the print function is invoked on the string. The same is true for the print representation of the Unicode character formed from the escape sequence in y6. The string literals w1 - w4 use octal escapes in the same manner as we showed earlier for the C++ program.

 
 //CharEscapes.java class Test {     public static void main( String[] args ) {         String y1 = "a\u0062";         print( "y1:\t" + y1 );           // Printed output:  ab         String y2 = "a\n";         print( "y2:\t" + y2 );           // Printed output:  a         String y3 = "a\nbcdef";         print( "y3:\t"+ y3 );            // Printed output:  a                                          //                  bcdef         String y4 = "a\nwxyz";         print( "y4:\t" + y4 );           // Printed output:  a                                          //                  wxyz         String y5 = "a\uOabcdef";         print( "y5:\t" + y5 );           // Printed output:  a?def         String y6 = "a\uOOef";         print( "y6:\t" + y6 );           // Correct, but the character                                          // following 'a' may not have                                          // a print representation         String w1 = "a\142";         print( "w1:\t" + w1 );           // Printed output:  ab         String w2 = "a\142c";         print( "w2:\t" + w2 );           // Printed output:  abc         String w3 = "a\142142";         print( "w3:\t" + w3 );           // Printed output:  ab142         String w4 = "a\79";         print( "w4:\t" + w4 );           // Printed output:  a9     }     static void print( String str ) { System.out.println( str ); } }

^[3]Unsigned chars of C and C++ are useful for image processing work. Most color cameras produce 8-bit values in each of the color channels, R, G, and B. You'd want to read these values into an unsigned char. If you read them into a signed char, unless care is taken the high values could get interpreted as negative numbers during downstream processing.

^[4]The automatic type conversion involved here from char to int for y is known as binary numeric promotion in both C++ and Java. See the last paragraphs of Sections 6.7.1 and 6.7.2 for when such conversions can be carried out automatically in C++ and in Java, respectively.

^[5]A second common use of backslash inside either a double-quoted string or between a pair of single quotes is that it tells the system to alter the usual meaning of the next character. For example, if you wanted to set the value of a character variable to a single quote that is ordinarily used as a character delimiter, you would not be able to say

      char x = "'; \\ERROR

Instead, you could use a backslash in the following manner

      char x = '\";

to suppress the character-delimiter meaning of the single quote that follows the backslash. Another illustration of this would be if you wanted to initialize a character variable to the backslash itself:

      char x = '\\';

where the first backslash alters the usual meaning of the backslash that follows.

^[6]Java lexical grammar has the notion of a LineTerminator, which is not considered to be one of the InputCharacters from which Tokens are formed. When the Unicode escape /uOOOa is encountered during the initial scan of a source file, it is replaced by a LineTerminator [23].