Section 11.2. Character and String Data


11.2. Character and String Data

Processing of character data in computers operates on characters represented by code numbers. This is often expressed by saying that characters are treated as small integers, though especially when using Unicode, they need not be that small. A string is usually represented as a sequence of characters in consecutive storage locations. Otherwise, the representation and handling of characters varies greatly by programming language and by software modules.

11.2.1. Constructs and Principles of Processing Characters

For the processing of character data, programming language design needs to solve several problems, and the solutions greatly affect the suitability of the language to string-oriented tasks. You are probably not designing a new programming language, but you may need to select between some existing languages for a project, or to learn or teach a language. In the latter area, the phenomenon that psychologists call negative transfer is often problematic: when you have learned one way of doing things in a language (say, the difference between single and double quotes around a literal), you will implicitly assume that another language uses the same way. Even after you have learned the difference, you keep forgetting it. Therefore, it is useful to make some explicit comparisons.

The key features in the processing of character data in a programming language are:

  • Repertoire: which characters can appear in data as processed inside a program?

  • Typing: is there a particular data type for a character, or a string, and what are its basic properties?

  • Characters versus strings: do you treat a character as a special case of a string (a string of length 1), or do you treat a string as a data structure (e.g., array) with characters as its components, or are they two distinct concepts?

  • Internal implementation of a character: is it typically (and perhaps by language specification) one octet, two octets, or something different?

  • Internal implementation of a string, especially information on its length (e.g., separate character count, or a terminator character, or fixed length).

  • Storage allocation for strings: do you need to specify the length, or the maximum length, when declaring a string variable, or do strings automatically expand?

  • Literals: how do you write a constant that denotes a single character, or a given string, perhaps an empty one?

  • Operators and standard functions, such as extracting the nth character of a string, concatenating two strings, or performing a replacement operation on a string.

Modern programming languages normally have a character data type and a string data type, or both, but their relationship to each other may vary. In some languages, the character type is one of the basic types and strings are represented as arrays of characters, often with some special features that other arrays do not have. In other languages, which might be called string-oriented, the string type is one of the basic data types (or, at the extreme, the only data type). Variables with values that are individual characters might be treated just as special cases: strings of length 1.

11.2.2. The FORTRAN Model: Hollerith Data

FORTRAN programming was developed in the 1950s for engineering and scientific tasks. Originally, it had just two data types: integer (whole numbers) and real (floating-point numbers). Character data was just string constants added to output as headings to make it more legible. The ways of handling character data in old FORTRAN are mostly a historical curiosity only, but they are briefly described here for comparison.

Originally, FORTRAN allowed you to add explanatory text to output by using Hollerith constants in FORMAT statements, which specified the way in which numeric data was formatted when executing PRINT statements. A Hollerith constant like 5Hhello was taken as indicating a string of five characters following the letter H. That way, it was easy for a compiler to know where the string ends. To a programmer, it was not that convenient: he needed to count the characters, and count them right.

Later, a convenience was added: a quoted string like 'Hello', leaving it to a compiler to recognize the end of the string from the ending quoteactually, the ASCII apostrophe.

However, the programmer still needed to count characters if he wanted to store character (string) data to a variable. The reason is that such data was stored to a numeric variable, since there were no other variables. An integer or real variable was able to contain a fixed number of characters, but the number depended on the machine architecture. For example, if an integer value consisted of 36 bits and a 7-bit character code was used, an integer variable was able to contain five characters. Therefore you could write an assignment like MSG = 5Hhello or MSG = 'Hello'. However, the program was not portable to, say, a computer where an integer value is 32 bits and an 8-bit character code is used, allowing an integer variable to hold just four characters. You would, for example, declare MSG as an array and write MSG(1) = 'Hell' and MSG(2) = 'o'.

There wasn't much you could do with character data at that time. It was possible to read, store, copy, and print it, as well as compare for equality. Text processing would have been awkward, since extracting a single character from a string required extra tricks, like shifts and masks. When porting a program from one computer to another, it was often necessary to recode all processing of character data.

Later, a data type called CHARACTER was added to FORTRAN. Despite the name, it is really a string type. When declaring a string variable, you specify the length of the strings it can contain. For example, CHARACTER*20 NAME declares NAME as string of length 20. Its values will effectively be padded with spaces on the right, if you assign a shorter value to it. A substring construct was also added; for example, NAME(2:6) is a substring of NAME from the 2nd character to the 6th.

11.2.3. The C model

C was designed in an environment where an 8-bit byte was the basic unit of storage and any character was assumed to fit into such a byte. More or less implicitly, the character code was assumed to be ASCII, or very similar to it. Later, C has been used to process text in genuinely 8-bit encodings, too. The standard C library locale may be used to find out or to set the specific encoding used, as described in the section "Using Locales" later in this chapter.

Although C++ is very different from C in many ways, it is based on C. In character processing, C++ has copied its constructs from C. However, the I/O system is different.

11.2.3.1. The character data type

The C language has a data type called char, but in typical implementations, it really corresponds to the concept of an 8-bit byte. It has been used to store a character among other things, but that was just a technicality. C functions often operate on sequences of bytes often with no regard to their content.

In effect, the char type in C is the shortest of integer types. As an aside, it might also be used to store integers that represent characters by their code numbers. There is no type checking involved here. You can declare and assign char ch = 0; and this initializes the variable to NULi.e., to the character with code number 0.

However, the definition of C does not fix the range of values of char. It might be 0 to 127, corresponding to ASCII, or 0 to 255. It might even be 0..65,535, corresponding to UTF-16 code units, so that a value of type char occupies two octets. Thus, you might be able to use Unicode simply with the help of the character data type in C, but such software is not portable from one computer system to another.

11.2.3.2. Strings as arrays

In C, a string is a sequence of values of type char in consecutive storage locations. You can declare a variable that is an array of characters (e.g., char message[20]) and store characters to it using indexed variables (e.g., message[0]), as with other arrays. The index of the first component of an array is zero in C, as in many other languages.

Using the basic operators of C, text processing is awkward, since you basically need to work with individual characters by their indexes. However, there is a standard C library, string, that contains a collection of useful functions for working with arrays of characters. Operations on strings are still somewhat primitive, since you need to keep track of the lengths of strings. An array has a fixed size in C, though you can create the equivalent of an array dynamically so that its length has been computed during execution, rather than fixed when writing the program. The assignment p = malloc(n) would create a memory block sufficient for n characters and assign its address to p, which must be a pointer variable. Then you can use an indexed variable like p[i] just as if you had declared p as an array.

Many descriptions say that in C, strings are "NUL terminated," where NUL means a character with code number zero (U+0000 in Unicode). This is true in the following sense: the functions in the string library, as well as functions for string processing in C programs in general, expect the input strings to be NUL terminatedi.e., to end with NUL, which is not regarded as part of the string. Moreover, when the functions generate strings, they make them NUL terminated. String constants are implemented as NUL terminated strings; thus, the constant "foo" denotes a string of length 3, but its internal representation occupies four octets. As a C programmer, you normally follow suit by treating NUL as a string terminator and by making sure that every string you generate is NUL terminated. However, when you read characters from a file, it is better to accept possible NUL characters and perhaps just skip them on reading. Text files created with other than C programs may contain NUL, and it is possible to output NUL in C, too.

11.2.3.3. 8-bit characters and sign extension

Since ASCII was usually implied, it did not matter whether values of the type were treated as unsigned or signed, since the character values always had zero as the first bit (sign bit). This created problems when C was used with a genuinely 8-bit character code, such as ISO-8859-1. Suppose that you declare and assign char ch = 'ä' and then use the character variable in an assignment with an integer on the left side: int i = ch. Since the value has the first bit set (the code of ä is E4 in hexadecimal, 11100100 in binary), the value is sign-extended in the assignment.

Technically, an integer normally occupies two or four octets, and the value is copied to the lowest-order octet, whereas the sign bit is copied to all bit positions in the other octets. In practice, this makes the value a negative number, corresponding to the interpretation of the octet E4 as a signed integer. In the commonly used two's complement method for implementing negative integers, this results in the value -28 (decimal).

Later, a difference was made between unsigned char and the old char type, which may or may not be signed. Declaring a variable unsigned char, you would guarantee that no sign extension is performed when the variables value is treated as an integer due to type conversions. Compilers have compile-time switches that can be used to make the char type implemented as unsigned, but for portability, it is safer to use the explicit type name unsigned char.

Even if you declare your variables and functions as unsigned char, non-ASCII character constants may cause problems. In C, a character constant like 'ä' is in fact of type int (the default integer type), and, for example, a comparison like ch == 'ä' may fail to work properly. The right side could be a negative value when interpreted as an integer and a very large number when interpreted as an unsigned integer. Compile-time switches (like -funsigned-chars in the gcc compiler) might be available for forcing character constants to correct positive values. A more portable alternative is to avoid manifest character constants in statements, using macro definitions. Example:

#define AE ((unsigned char) 'ä') int ch = getchar(); if(ch == AE) { /* 'ä' was received */

The example is somewhat confusing, since the variable is declared as int, which means a signed integer type. The reason for this is that the function getchar may return an end of file indicator, which is a negative number. The comparison works, however, since now the right side, being unsigned, is not sign-extended.

11.2.3.4. The EOF indicator

Since character data was expected to fit into 7 bits, values with the first bit set were used for various purposes such as error indicators. In particular, standard C definitions define the end of file indicator, EOF, as a macro (named constant) that expands to (-1)i.e., minus one. For example, a function for reading a character normally returns the character but returns EOF, when there is no data left.

Therefore, functions like getchar for reading a single character are declared as being of type int and not char. Normally, the return value of such a function should first be tested against the end of file indicator (ch == EOF), typically exiting from a loop when there was no more data. After that, the value can be assumed to be the code number of a character, in the character code being used. If we work with 8-bit characters only, we could next assign the value to a variable of type unsigned char, for clarity and to protect against undesired type conversions.

11.2.3.5. The zero byte (NUL byte) convention

One of the specialties of C is that a zero bytei.e., NUL when interpreted as an ASCII characteris used as a string terminator. Standard C functions that operate on strings effectively operate on arrays (sequences) of characters terminated by NUL. Thus, if you construct a string in C code, you need to write a NUL (conventionally written as '\0' in C, though it really means the same as plain 0) at its end. Using NUL was technically efficient on old byte-oriented computers, since at machine instruction level, testing a byte against zero value was faster than a general test for equality with a given value.

The special rule of NUL in C causes problems, for example, when you have UTF-16 encoded data. If you have, for example, ASCII or ISO-8859 and you encode it in UTF-16, every second octet will be zero. Thus, although C string functions might otherwise be used to process strings with no regard to their internal structure and encoding, this will fail for UTF-16, and for many other encodings.

In any data that might be processed with a C program, a zero octet in data is risky.


11.2.3.6. The null pointer

Thus, C has no genuine character data type but uses char as a mixed type for characters as well as for small integers and other octets. Moreover, C uses the integer 0, either as such or as explicitly cast to a pointer type, as a null pointer. The null pointer is a special pointer value indicating "not a pointer to anything." Pointer values correspond to addresses of storage units, and they are at least two octets long, often longer. Their implementation depends on the addressing architecture of the computer. In a simple implementation, pointers could be simply numbers of storage locations, with the address 0 unused so that it can be used for the null pointer. However, implementations vary, and the null pointer need not be internally represented the same way as the integer 0.

There is also a predefined name (macro) for denoting the null pointer: NULL, which expands to 0. It is often recommended for use instead of the literal 0, to indicate that a pointer is involved and not an integer. The C compiler is supposed to treat the integer 0 in a pointer context as a null pointer, no matter how the value 0 has been written in the source code and no matter what the internal representation of pointers is.

An implementation of C may also define NULL as (void *) 0. This means the value zero converted to the generic pointer type void *, which is compatible with any pointer type.


11.2.3.7. Confusion around NUL, NULL, and relatives

The main reason for discussing the null pointer in this book is its name and its predefined symbol, which are often confused with NUL, the character with code number 0 (U+0000 in Unicode ). The expression NUL is not part of the C language but just a name for a control character. If desired, you could define NUL as a name in C (using, for example, the directive #define NUL '\0'), but it might easily be misread as NULL.

Such things create many possibilities for confusion, as illustrated in Table 11-4. Similar problems exist in other languages as well, though usually to a lesser degree. In the table, the assumed character code is ASCII or some extension of ASCII. The integer zero is implemented as zero octets, typically as two or four of them. The internal format of a floating-point zero is system-dependent in principle, but in practice, it is usually four zero octets. The internal representation of the null pointer is not shown, since it varies by machine architecture.

Table 11-4. Ways of being "nul" in C

Octet(s) in binary

Notation in C source

Meaning

00000000 ...

0

The integer zero

00000000 ...

0.

The floating-point number zero

(Null pointer)

(void *) 0

The null pointer

(Null pointer)

NULL

Macro for the null pointer

11111111

EOF

End of file indicator, same as (-1)

00000000

'\0'

The NUL character, U+0000

00100000

' '

The space character, U+0020

00110000

'0'

The digit zero, U+0030

00000000

""

An empty string

00100000 00000000

" "

A string consisting of a space


Further confusion is caused by the fact that both (void *) 0 and '\0' can be written simply as 0. In a pointer context, as in a comparison p==0 or an assignment p = 0, with p declared as a pointer, the integer zero is automatically converted to the null pointer. In a character context, as in comparing a variable against NUL, say ch=='\0', there is a different type conversion. The char type is internally treated as an integer type, and a character constant like '\0' is technically an integer constant written in a special way. The habit of writing ch=='\0' instead of ch==0 is meant to emphasize that we are dealing with character data and with the NUL character, not, for example, the digit zero.

11.2.3.8. C and Unicode

You might consider using C to process Unicode data in UTF-8 format, where each character consists of one to four octets, or in UTF-16 format, where each BMP character is represented as exactly two octets. We will discuss both approaches in the sequel.

It is however important to note that you should not reinvent the wheel, if you decide to use either of these approaches. There is a lot of existing reusable code, as C function libraries or as C++ class libraries, for operation on UTF-8 or UTF-16 in C. Thus, unless you have a very simple task or a programming assignment on a course, start from looking at existing software, such as the libutf-8 code available from several sites and the ICU code for UTF-16 at http://icu.sourceforge.net/.

11.2.4. Unicode with 8-bit Quantities?

Can you process text in Unicode, if the data type for a character is an 8-bit byte, as in classical C? The answer is yes but requires that you distinguish between "string" as a sequence of Unicode characters and "string" as a programming language concept such as a NUL-terminated array of char. You would not store a Unicode character into a variable of type char but as an array or other collection of such variablese.g., one to four such variables, when using UTF-8.

This means, using the terminology defined in "Unicode and UTF-8" in Chapter 3, that all processing of characters actually takes place at the level of the Character Encoding Scheme. There, the representation of a character is serialized into a sequence of octets. In order to perform even a simple operationsay, scanning through a string to check whether it contains a particular characteryou need to interpret the sequence of octets according to the encoding scheme (unserialize it to code numbers).

If you only read Unicode data and copy it as such, preserving the encoding, you can treat the data as if it were binary data, uninterpreted octets. Such situations are rare. However, consider the example of analyzing a logfile that is known to be encoded in some known Unicode encoding. We might be interested in summing things up, without any internal processing of character data.

The approach is used in the following rather naïve program, which expects its input to be UTF-16 encoded, more specifically in low-endian form (UTF-16LE). The program simply reads the code units and checks whether the more significant octet is zero. If not, it prints the code unit in hexadecimal and in decimal. Such processing might be useful if some data is expected to be UTF-16 encoded but mostly contain just Basic Latin and Latin 1 Supplement characters (i.e., characters U+0000..U+00FF), and you wish to list any other code points that appear:

#include <stdio.h> int main() {   unsigned int first, second;   unsigned long code;   while( (first=getchar()) != EOF) {     if( (second = getchar()) == EOF) {       fprintf(stderr, "\nError at end of data, first octet: %2X\n",               first);       return 1; }     if(second != 0) {       code = first * 0x100 + second;       printf("%4X %6d\n", code, code); } }   return 0; }

Similarly, you could process UTF-8 encoded data using the char type, pointers to char, and arrays of char, as long as you keep track of the situation. Although the string functions of C treat a zero octet (NUL) as a string terminator, this isn't a problem with UTF-8, since UTF-8 uses a zero octet only to encode U+0000. Processing UTF-16 encoded data in a similar way would generally fail, of course.

A value that represents a Unicode code number should be defined as unsigned long (or, more verbosely, unsigned long int) to avoid any surprises. This type is guaranteed to be at least 32 bits. Then you can perform conversions between different encoding forms at input and output only, performing all operations on the characters (code numbers) themselves directly.

You might encounter existing code that uses other integer types, such as the basic integer data type int, for processing Unicode numbers. Implementations of C in most modern computers have int implemented as a 32-bit integer or larger. However, the C standard allows the implementation of int as a 16-bit integer.

Using a specific integer data type such as unsigned int is in principle a clumsy and unnecessarily system-dependent approach. It also makes source code somewhat harder to understand. A more systematic method can be used. You can define a macro like the following:

#define UINT32 unsigned long

You would then systematically use UINT32 when declaring variables and functions with Unicode character values (e.g., UINT32 ch;).

11.2.5. Wide Characters

In modern versions of C, as well as in C++, you can use "wide characters, " which correspond to a character type specified by the current locale. Wideness refers to the storage needed for such a character, not the visual appearance. The storage need not be larger than for normal characters, and it often isn't.

Wide characters need not correspond to Unicode characters. However, they may correspond to Unicode characters. Their meaning depends on the underlying system and possibly locale settings. On modern Windows systems, the internal representation is UTF-16, and wide characters are usually implemented as 16-bit quantities. On Unix and Linux systems, the default locale often uses some 8-bit character code, but this can usually be changed to a Unicode encoding. The repertoire of available locales depends on the implementation. Thus, if your program needs to be portable to different computers, you cannot rely on wide characters.

It is often possible to process Unicode data using wide characters, but not portably.


The type for wide characters is wchar_t, which corresponds to some machine-level "type" (storage unit size) in an implementation-dependent manner. To work on wide character strings, you use standard functions with names beginning with wcs instead of the str prefix in traditional C string functions. For example, you get the length of a traditional string by calling the strlen standard function, and similarly, you use the wcslen function for wide character strings. To create a wide character string constant, you use the normal C string constant syntax but prefix it with the letter "L"; for example, L"Hello". As you may guess, "L" stands for "long," again referring to the storage requirements. It means that the string consists of wide characters.

The standard functions mentioned above, and other features related to wide characters, are included in the wchar and wctype libraries that were added to the C language standard in 1995. Consult suitable textbooks and references for the definitions. The following example illustrates the use of wide characters for a simple problem: reading a UTF-8 encoded file to check for characters beyond the range U+0000..U+00FF. This is similar to the previous example, except that here, UTF-8 encoding is assumed and the techniques are different. In this approach, different encodings can be used simply by changing the attribute of setlocale, if the encoding is supported by the environment:

#include <stdio.h> #include <wchar.h> #include <locale.h> int main() {   wchar_t ch;   if(!setlocale(LC_CTYPE, "en_US.UTF-8")) {     fprintf(stderr, "Cannot work in UTF-8 mode!\n");     return 1; }   while( (ch=fgetwc(stdin)) != WEOF) {     if(ch > 0xFF)       fwprintf(stdout, L"%4x %c\n", ch, ch); }   return 0; }

In order to work with Unicode characters in a reasonably portable way, you could use a type name like UNICHAR and define it with a macro or with a type definition such as the following on a system where wide characters are Unicode characters:

typedef wchar_t UNICHAR;

You would then consistently use the type name so defined for all character variables and functions. When porting the program to a different system, you would replace wchar_t in this definition with, for example, unsigned int, selecting a type that can contain a Unicode code number. Although this approach, suggested in the Unicode standard, makes software more portable, it has substantial limitations. Many constructs in a program, including the standard functions you use, depend on the specific data type. For portability, you would need to modularize the program so that (ideally) only one module depends on the specific definition of the type used for Unicode characters.

11.2.6. Win32 APIs

An Application Programming Interface (API) is a coordinated set of definitions on how computer programs or parts of programs communicate with each other. Usually this involves software at two different levels, such as application programs and system programs. More concretely, an API is a collection of functions and other building blocks that a programmer can use according to their external descriptions, without knowing their internal implementation. The term API is most commonly used to refer to Windows APIs, specifically on relatively modern Windows systems such as Windows NT, Windows 2000, Windows XP, etc., collectively called Win32 APIs. Such APIs are usually described in terms of their manifestation in C or C++.

Win32 APIs support a 16-bit character type, called WCHAR, which ultimately corresponds to a UTF-16 code unit. Internally, Win32 works with such representation of characters and performs code conversions between it and codes used in application programs. As we have seen, UTF-16 code units directly correspond to Unicode characters only on the BMP, but that is sufficient for most practical purposes. Technically, WCHAR is defined as a macro that expands to unsigned shorti.e., the 16-bit unsigned integer type, corresponding to the wchar_t type of standard C.

Using the Win32 API, you can write programs so that they can be compiled to work with some 8-bit encoding (the system's "code page") or with wide characters. In C or C++ programming, you can define the constant (macro) _UNICODE to be 1 (TRue) or 0 (false) depending on whether you want wide characters or not. You would then declare your character variables for being of type TCHAR, which expands to wchar_t or (8-bit) char, depending on the setting of UNICODE. Similarly, you would declare a pointer to a character (or to a string) as being of type LPTSTR. It expands to wchar_t * or char *, again depending on _UNICODE. You can also use the name LPWSTR, which unambiguously means a pointer to a string of wide charactersi.e., wchar_t *. Win32 APIs that operate on text (strings) exist in two versions:


"A" versions (code page versions)

These versions operate on 8-bit characters, according to the code page currently in use, such as windows-1252 (Windows Latin 1). The letter "A" reminds us of the misnomer "ANSI."


"W" versions (Unicode versions)

These versions operate on widei.e., 16-bitcharacters, or UTF-16 code units, to be exact.

For ease of programming, you can use generic names that will be resolved to "A" or "W" versions during compilation, depending on the setting of _UNICODE. For example, if you call a function with the name SetWindowText, it will be resolved to the name SetWindowTextW when _UNICODE is set and to SetWindowTextA otherwise.

11.2.7. Multibyte Character Sets (MBCS) Versus Unicode

For comparison, we will briefly describe the use of sequences of octets to represent characters in a manner that differs from Unicode, namely multibyte character sets (MBCS), which are in practice usually double-byte character sets (DBCS). You may encounter such techniques in existing software, especially on Windows, where they have been a serious competitor of Unicode techniques. They have been used especially for Chinese and Japanese text.

In DBCS, each character is represented as 1 or 2 bytes (octets). Some bytes, called lead bytes, have been reserved for use as the first pair of a double-byte representation and are to be interpreted together with the next byte. Other bytes represent characters as such. The technique implies some underlying character code, often called "code page" in this context. The set of lead bytes depends on the code page, but it could be, for example, the range 81 to 9F (which corresponds to a subset of C1 Controls).

In C programming using a library that supports multibyte characters, function names starting with _mbs are used to handle multibyte character strings, corresponding to standard C string functions with names that begin with str. Thus, for example, _mbslen returns the length of a multibyte character string, as a counterpart to strlen for normal C strings (char strings) and wcslen for wide character strings.

Thus, multibyte characters are not the same as wide characters. Conversions between them are possible, of course, and libraries that support multibyte characters typically contain routines for conversions, such as mbtowc (multibyte to wide character).

It may be desirable in program development to create software that can be set, at compile time, to use 8-bit characters, multibyte characters, or wide character implementation of Unicode. For this purpose, macros that begin with _tcs can be used. They will be resolved at compile time, according to the values of the macros _UNICODE and _MBCS. For example, the name _tcslen resolves to wcslen when _UNICODE is set, to _mbslen when _MBCS is set, and to strlen when neither is set. (Setting both of them makes no sense and causes unpredictable results.)

11.2.8. The Perl Model

The Perl language was primarily designed for processing large amounts of text, though typically text in some fixed format, as one of the expansions of the name suggests: Practical Extraction and Report Language. Yet, Unicode support has been added to Perl only gradually and rather slowly. This book assumes that you are using Perl 5.8 or newer.

We will discuss some basic practical points in using Unicode in Perl. For more information, please refer to the relevant manpages in your Perl environment, in particular, perluniintro and perlunicode. These manpages are also available on the Web, at http://perldoc.perl.org/perluniintro.html and http://perldoc.perl.org/perlunicode.html.

Perl uses UTF-8 encoding (or, in some implementations, UTF-EBCDIC) internally. However, if your Perl source is UTF-8 encoded, you should use the pragma use utf8 for compatibility reasons. Handling the encoding of input data is a completely different matter and will be discussed in the section "Character Input and Output" later in this chapter.

11.2.8.1. Strings and characters in Perl

In Perl, a scalar variable may have either a string or a number as its value, and Perl usually converts automatically between the types as needed. There is no separate character type: to handle an individual character, you use a string of length one.

Perl has powerful tools for working with strings. Dealing with individual characters in a string is somewhat clumsier. To extract the fourth character from the value of $foo, you would use the expression substr($foo,3,1). This means using a substring extraction function, where the second argument is the starting position, counting positions from 0, and the third argument is the length of the substring.

To get the Unicode code number of a character (i.e., of a single-character string), use the ord functione.g., ord('é'). The inverse function is chr. For example, chr(9786) or equivalently chr(0x263A), using the Perl notation for integers in hexadecimal notation, means the character U+263Ai.e., ☺.

There is a pitfall for values smaller than 256 decimal. For them, the chr function returns an 8-bit character, in an encoding that might differ from ISO-8859-1. To avoid the potential problems, use the pack function instead of chr for such values: pack("U", n) gives the Unicode number with code number n. For example, chr(0xE4) usually means ä (U+00E4), but it could mean something different; pack("U",0xE4) certainly yields ä.

11.2.8.2. The catenation operator "."

Many programming languages use the plus sign, +, both for addition of numbers and catenation of strings. There is a risk of confusion here, since adding up 2 and 5 to get 7 is completely different from catenating the strings to get 25. Languages often deal with this issue by using the types of variables and expressions to determine how + should be interpreted. Perl is not a typed language (in a manner that would be useful here), so it uses + for the arithmetic operation and another symbol, the period ".", for string catenation. It is best to leave spaces around the period for readability.

In output statements, you can often use the comma to separate elements, since a function accepts a list of arguments, as in print $foo, $bar;. You could alternatively use the catenation operator: print $foo . $bar;. In that case, there would be only one argument, consisting of an expression. Such an approach is necessary when calling a normal functione.g., somefun($foo . $bar).

11.2.8.3. In Perl, double quotes mean evaluation

The use of double quotation marks versus single quotation marks makes a difference, but completely different than in C. In Perl, 'foo' and "foo" mean the same thing, namely, a particular constant string of length 3, so in such simple cases, the difference between the quotes is a matter of style, or coding guidelines. When constructs that could be Perl variables are involved, there is an essential difference:

  • '$foo' denotes literally a four-character string that begins with a dollar sign.

  • "$foo" denotes a string that consists of the value of the scalar variable $foo at the moment of evaluating this quoted construct.

Thus, single quotes are suitable for normal string constants that need not and should not be processed in any way as expressions. Using double quotes, you can create an expression that will be evaluated by inserting values of variables into it; e.g.:

 print "The product of $a and $b is $c.\n";

The principle that no evaluation takes place between single quotes extends even to "character escapes" like \n (as listed in Table 11-1, earlier in this chapter). They are interpreted inside double quotes, but not inside single quotes. Thus, in the example, \n is interpreted as a line break, but if single quotes were used, even \n would be printed literally.

11.2.8.4. Notations for Unicode characters

In strings enclosed in double quotation marks, you can use the notation \x{ number } to refer to a character by its Unicode number in hexadecimale.g., \x{2300}. The braces are needed; without them, the reference has a different meaning. For arguments smaller than 256 decimal, 100 hexadecimal, the results are based on an 8-bit encoding (as in the case of the chr function); thus, use pack instead.

You can also refer to characters by their Unicode names using a notation of the form \n{ name }, if you first use the pragma use charnames ':full';. Then you can write, for example, \N{DIAMETER SIGN} inside a string constant.

11.2.8.5. Using properties of characters

Collections of characters can be referred to by Unicode properties. For example, in a regular expression used for matching, p{Lu} denotes any character with General Category value of Lu (Letter, uppercase). You can also use script names in a similar manner. Block names can be used when prefixed by Ine.g., \p{InNumberForms}. The following simple example shows how to replace all Cyrillic characters with question marks: s⁠/⁠\⁠p⁠{⁠C⁠y⁠r⁠i⁠l⁠l⁠i⁠c⁠}⁠/⁠?⁠/⁠g.

11.2.9. ECMAScript (JavaScript)

The JavaScript language, developed by Netscape for use in client-side scripting on web pages, has been rather widely implemented in web browsers, though with version differences. Different names, such as JScript, are used for trademark reasons.

11.2.9.1. String oriented

JavaScript is string-oriented, to the extent that it lacks a character type among its basic scalar types. Even numbers are commonly handled as strings. This often causes trouble for beginners, especially since the + operator is overloaded: it means both numeric addition and string catenation, depending on context. If the variable foo contains data obtained, for example, from a form field and the user has typed 42 as the data and you assign foo = foo + 1, you do not get 43 but 421. One way to deal with this is to subtract zero from the value to force it into numeric type: foo = (foo 0) + 1.

JavaScript has an object concept, and it lets you declare string objects, which have many useful properties. For any advanced string processing, you will find string objects appropriate. The following simple code illustrates some basics. It is a form in an HTML document with one text input field and one button, which invokes a JavaScript function when clicked on. The function takes the input field content, converts it to uppercase, and makes it the new content of the field. Here the field is prefilled with the string "éω" ("e" with acute, small omega), and clicking on the button turns it to "ÉΩ." The function (method) toUpperCase is part of the JavaScript language and defined to work by Unicode rules. It should even perform full case mapping, but in practice, it may perform just simple case mapping.

<script type="text/javascript"> function upper(field) {   var s = new String(field.value);   field.value = s.toUpperCase(); } </script> <form action="..."> <input name="foo"  value="é&omega;"> <input type="button" value="Upper" onclick=  "upper(document.getElementById('fld'))"> </form>

11.2.9.2. The ECMAScript standard

The standardized form of JavaScript is called ECMAScript, and it was defined by Ecma (as ECMA 262). The standard is available via http://www.ecma-international.org/. Note, however, that the standard mainly specifies the general features of ECMAScript as a programming language, as opposed to specific constructs defined for use on the Web. Those constructs relate to the Document Object Model (DOM) that specifies the mapping between HTML or XML elements and attributes and expressions in scripting languages.

11.2.9.3. UTF-16 implied

Since Version 1.3, JavaScript uses Unicode for string data. This has been standardized in ECMAScript. More exactly, string data means "Unicode string,"i.e., a sequence of code units in UTF-16 format. The routines for string processing assume that their input is in Normalization Form C.

Although JavaScript uses UTF-16 (or, in practice, UCS-2), we can safely use UTF-8 on a web page that contains JavaScript code. The web browser is supposed to perform the transcoding internally.

Originally, the basic constructs in JavaScript, including variable names, used ASCII characters only. Other characters were permitted only in strings and comments. Later, the syntax was extended to allow Unicode identifiers, with some added features that allow even more than the default Unicode rules. However, programming practice has largely used ASCII only in identifiers.

11.2.9.4. The \u escape notation

As many other languages, JavaScript lets you write characters in string constants (in a source program) using a notation that consists of \u immediately followed by a character's Unicode number in hexadecimal. The following trivial program illustrates this. It has been written inside a script element so that it could be immediately embedded into an HTML document.

<script type="text/javascript"> var message = "I \u2665 Unicode! \u263A"; alert(message); document.write(message); </script>

If you view an HTML document containing such a script element, you should see the text "I Unicode! ☺" appear in your HTML document, provided of course that you use a JavaScript-enabled browser. Whether you see the characters properly depends on the font in use. You should also see the same text appear in a small pop-up window, since thats what the alert function does. However, the font that a browser uses in such windows is often different from the default font it uses for web pages. This may mean that the special characters are not visible, but small boxes might appear instead. The font used in pop-up windows is under the control of the browser and the operating system and cannot be affected by the document author in any normal way. Thus, avoid special characters in pop-ups.

11.2.10. PHP: Mostly Just 8 Bits

The PHP language, commonly used in web authoring, operates on 8-bit characters only. This applies to PHP 5, too. To get some Unicode support, you need to use the string functions utf8_encode and utf8_decode, which convert from ISO-8859-1 to UTF-8 and vice versa. See http://www.php.net/utf8_encode for their usage. Character and string constants in PHP closely follow the Perl model.

An HTML document created by PHP can, however, contain any Unicode characters, since you can express them as character references like &#x2665;.

11.2.11. Java: Rich Support to Unicode

Java has extensive support to Unicode. In addition to basic constructs needed for processing Unicode characters and strings, Java libraries intrinsically work by Unicode models. This means, among other things, that case conversion routines use the definitions in the Unicode character database. Java also allows non-ASCII characters in Java identifiers, though practical considerations still often make programmers avoid them.

Standard Java libraries contain a large number of classes for Unicode support such as input methods for Unicode characters, sorting according to the principles of the Unicode Collation Algorithm, and detection of text boundaries. In modern Java implementations, the output routines support the Unicode bidirectional algorithm as well as contextual shaping of characters as needed for correct rendering of many languagese.g., Arabic. There are also classes for more technical tasks such as conversions between character encodings, so that you can make a Java program accept data in different encodings. In addition to standard libraries included in Java implementations, there are open source libraries available for Unicode-related operations, such as transliteration.

11.2.11.1. Characters, strings, objects, and methods

In Java, 'a' is a character constant, whereas "a" denotes an object of type String. The difference is even more fundamental than in C, since in Java, objects can be used in many ways that cannot be applied to simple scalar values. In an object-oriented language like Java, functions are properties of objects and often called methods of objects.

A character constant is of type char, which is a simple scalar type, not an object. You can however use the Character class, which wraps a simple character value in an object. You can declare, for example, Character ch = new Character('a') to create a new Character object with a specific initial value.

A function invocation in Java generally consists of the name of a class or object, a dot (period, full stop), the name of the method, and a parenthesized list of arguments. (The class or object may be implied in some situations, in which case the dot is omitted, too.) For example, "Hello world".length() is a function invocation, using a method of a String object. No arguments are passed to the function, since the function operates on the object. As you guessed, this is a standard function that returns the length of a string.

11.2.11.2. Encodings and escape notations

A Java implementation may read Java source code in different encodings, but internally, it converts it to Unicode. A programmer may create a source file in some Unicode encoding and use characters directly. However, your system might use some other encoding by default. For example, if you work with Java on Unix or Linux, the odds are that the native encoding is ISO-8859-1 and the Java compiler assumes that, too. You can probably specify the encoding of your Java source in a command option when you invoke the Java compiler (note the spelling UTF8 and note that you might not get any error message if you spell it incorrectly!):

javac encoding UTF8 mytest.java

Alternatively, you can use some other encoding, such as ASCII, and use the \u notation (\u followed by four hexadecimal digits; e.g., \u00df) to write characters that cannot be typed directly. Unlike ECMAScript, which allows such notations in character and string constants only, Java allows them anywhere in the source. Thus, you could use rôle as a variable name in Java, and you could also write it as r\uF4le. However, it is still common to use only ASCII characters in names, to avoid any potential problems with defective implementations and old software that might be needed in conjunction with Java program.

The following Java program is a little more than a "Hello world" program. First, it includes a special character in its output. Second, it writes the output both in the console and in a message window. The reason is that if you test this program, you may well see the console output without the special character, because the default console font is rather limited:

import javax.swing.*; public class HelloWorld {     public static void show(String text) {         JOptionPane.showMessageDialog(null,text);     }     public static void main(String[] args) {         String msg = "Hello world! \u263A";         System.out.println(msg);         show(msg);         System.exit(0);     } }

11.2.11.3. 16-bit characters

In Java, the values of type char are 16 bits long. For example, a character constant such as 'x' is automatically implemented that way. Technically, the values are UTF-16 code units, not characters, though these concepts coincide for characters in the Basic Multilingual Plane (BMP). This means that you can directly use any characters in the BMP, but anything outside it needs to be handled differently. Thus, Java is Unicode-oriented, but in an old-fashioned way.

The character concept in Java corresponds to a BMP character, or a code unit in UTF-16. Other Unicode characters are represented as integers and called "code points" in Java.


Internally, a value of type char is represented by its code number. Logically, characters are distinct from numbers (integers), though. To obtain the Unicode code number of a character variable ch, you can assign it to a numeric variable: int code = ch.

The Java String class (for immutable strings), as well as the StringBuffer class (for strings that may vary in length and content), is based on the char type. Thus, a Java string is really a "Unicode string"i.e., a sequence of UTF-16 code units, not characters.

If you need characters outside BMP, you can use the integer type int for characters. Java 5.0 has added methods to the Character, String, and related classes for working with text in such representation. As an alternative to this, you could use surrogatesi.e., represent a non-BMP character as a pair of two char values that represent a surrogate pair. You cannot use the \u notation for characters outside the BMP, but you can represent them as integers using a notation like 0x2f81a, (for U+2F81A)i.e., digit zero and letter "x" followed by the number in hexadecimal.

11.2.11.4. Java identifiers

Java allows a rich repertoire of characters in identifiers according to the Unicode identifier concept, with the extension that the dollar sign $ and the underscore (low line) _ are allowed anywhere in an identifier. It is however still common to stick to using ASCII in Java identifiers, since programmers do not know about the possibilities or do not dare to use them.

Java identifiers are case-sensitive; isDigit and isdigit are distinct identifiers. It is recommended and common practice to use lowercase and uppercase in identifiers in a particular style. Names of variables and functions (methods) normally start with a lowercase letter, and uppercase letters correspond to starting a new word in the corresponding natural-language expression (e.g., "is digit" makes isDigit).

11.2.11.5. Library routines

A modern installation of Java contains a collection of very useful functions and defined symbols for working with characters, in the java.lang.Character class. You need to use the Character. prefix for identifiers defined in the class when you use them in your program. For example, Character.getType(ch) returns the General Category value of the character stored in the variable ch.

For details, consult the documentation at http://java.sun.com, such as the description of the class at http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html.

The functions (methods) have some naming rules:


is... methods

These are Boolean (yes/no) functions for testing whether a property has the value true for a character, passed as the argument. For example, isDigit. They correspond to yes/no properties formally defined in the Unicode standard, such as the Digit property. (See Chapter 5.)


get... methods

These functions return the value of a property for a character, for a property with something other than a yes/no value. For example, getType gives the value of the General Category property for its argument. The values of this function are defined as symbolic names, formed from the long names of Unicode properties but in all uppercase, with the comma omitted and parts around the comma swapped, and with underscores instead of spaces. Thus, Letter, uppercase is UPPERCASE_LETTER. (There are some deviations from this mapping of names, as indicated in a sample program later.) The getNumericValue function returns the value of the Numeric Value property as an integer, with the convention that it returns -1 if no numeric value exists and -2 if the value is not expressible as an integer.


to... methods

These are various conversion functions. The functions toUpperCase, toLowerCase, and toTitleCase return the uppercase, lowercase, and titlecase form of the argument, respectively. They apply full case mappings, and so do the string functions with the same names. The function toCodePoint takes two Java char values representing a high surrogate and a low surrogate and returns the corresponding Unicode character (codepoint). The toChar function can be used to perform the reverse operationi.e., to convert a non-BMP character to surrogate form.

Not all Unicode properties have direct Java counterparts, but the available methods cover much of the common needs. The web site http://www.fileformat.info/info/unicode/char/ contains searchable information about Unicode characters in a format that contains a table of Java method values for a character, but you could easily write a Java program that prints similar information.

To illustrate the use of the functions, here is Java code that traverses through a string and prints all characters that are neither letters nor whitespace characters. Each character is printed on a new line and followed by an indication of the Unicode block it belongs to. The symbol ! denotes negation in Java, && means "and," and the operator + means string catenation when applied to strings:

import javax.swing.*; public class Hello {     public static void main(String[] args) {         String msg = "Hello world! \u263a";         for(int i = 0; i < msg.length(); i++) {             char ch = msg.charAt(i);             if(!Character.isLetter(ch) &&                !Character.isWhitespace(ch)) {                   System.out.println(ch + " in " +                     Character.UnicodeBlock.of(ch)); }}         System.exit(0);     } }

The program prints the following (except that the smiling face might appear as something differente.g., ?'due to limitations of a font):

! in BASIC_LATIN ☺ in MISCELLANEOUS_SYMBOLS

Unicode property names are case-insensitive, but their Java counterparts are case-sensitive, as are all identifiers in Java. The same applies to values like BASIC_LATIN.


The Java functions corresponding to Unicode properties are listed in Table 11-5 (without the Character. prefix in the function names). The order is by the short name of the Unicode property, as in the description of the properties in Chapter 5. Only a subset of the Unicode properties is directly covered by Java functions.

Table 11-5. Mapping of Unicode properties to Java constructs

Short

Long name of property

Java function

Alpha

Alphabetic

(See note after the table)

bc

Bidi Class

geTDirectionality

Bidi M

Bidi Mirrored

isMirrored

blk

Block

UnicodeBlock.of

gc

General Category

getType

IDC

ID Continue

isUnicodeIdentifierPart

IDS

ID Start

isUnicodeIdentifierStart

lc

Lowercase Mapping

toLowerCase

Lower

Lowercase

isLowerCase

nv

Numeric Value

getNumericValue

tc

Titlecase Mapping

toTitleCase

uc

Uppercase Mapping

toUpperCase

Upper

Uppercase

isUpperCase

WSpace

White Space

isWhitespace


The Java function isLetter doesn't quite correspond to the Alphabetic property, since the latter is true also for characters with General Category value of Nl (Number, letter) and for characters with the OAlpha (Other, Alphabetic) property. For most practical purposes, isLetter is adequate for testing whether a character is alphabetic. In some cases, isUnicodeIdentifierStart is better, since it includes Nl.

In addition to functions like isUnicodeIdentifierStart, there are functions like isJavaIdentifierStart, which are quite similar but allow $ and _, too.

In Java 5.0 and later, most of the functions that correspond to Unicode properties are defined both for character (char) and integer (int) arguments. In the latter case, the argument is treated as a code point, which may refer outside the BMP. Thus, you can relatively conveniently work with non-BMP characters, too.

The return values of functions that correspond to Unicode properties with enumerated values are technically of type byte or int. The values, encoded as integers, have symbolic names, though. For example, the value L (Left-to-Right) of the Bidi Class property corresponds to DIRECTIONALITY_LEFT_TO_RIGHT.

There are some predefined functions in Java that are not directly related to Unicode properties. They are summarized in Table 11-6. The type is indicated in a simple manner, without a static qualifier. In the "Invocation" column, the arguments of functions are specified by the names of their types. The CodePointAt function and relatives (e.g., CodePointBefore) are not listed in the table; they can be used to pick up a code point from a character array or sequence.

Table 11-6. Additional methods in java.lang.Character

Type

Invocation

Meaning

int

charCount(int)

Number of char values (1 or 2) needed to represent the code point

char

charValue()

Value of the Character object as a char

int

compareTo(Character)

Comparison using code numbers

int

digit(char,int)

Numeric value of the character, using the radix specified by the second argument

boolean

equals(Object)

Tests for equality by char value

boolean

isDefined(char)

Tests whether the code point is assigned

boolean

isDigit(char)

Tests for being a decimal digiti.e., gc = Nd

boolean

isHighSurrogate(char)

Tests for being a high surrogate code unit

boolean

isISOControl(char)

Tests for being a C0 or C1 Control character

boolean

isLetter(char)

Tests whether gc is Lu, Ll, Lt, Lm, or Lo

boolean

isLetterOrDigit(char)

Either isLetter or isDigit returns true

boolean

isLowSurrogate(char)

Tests for being a high surrogate code unit

boolean

isSpace(char)

Tests for space character: gc is Zs, Zl, or Zp

boolean

isTitleCase(char)

Tests whether gc = Lt




Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net