Locale Support | PostgreSQL (2nd Edition)

A locale is a named group of properties that defines culture-specific conventions. Each locale is made up of one or more categories. Each category controls the behavior of a set of features. For example, the LC_MONETARY category contains information about how monetary values are formatted in some specific territory. The ISO and IEEE (POSIX) standards bodies have stated that a locale should include information such as the ordering of date components , the formatting of numbers , and the language preferred for message text.

PostgreSQL makes use of the locale-processing facilities provided by the host operating system. When you log into your operating system, you are automatically assigned a locale. On a Linux host (and most Unix hosts ), you can find your current locale using the locale command:

 $ locale LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE="en_US" LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL=

You can see that I am using a locale named en_US . Locale names are composed of multiple parts . The first component identifies a language. In my case, the language is en , meaning English. The second (optional) component identifies a country, region, or territory where the language is used. I am in the U.S., so my country code is set to US . You can think of en_US as meaning "English as spoken in the U.S.", as opposed to en_AU , which means "English as spoken in Australia." The third component of a locale name is an optional codeset. I'll talk more about codesets later in this chapter. Finally, a locale name may include modifiers, such as " @euro " to indicate that the locale uses the Euro for currency values.

Language IDs are usually two characters long, written in lowercase, and chosen from the ISO 639 list of country codes. Territories are usually two characters long, written in uppercase, and chosen from the ISO 3166 standard.

The POSIX (and ISO) standards define two special locales named C and POSIX . The C and POSIX locales are defined so that they can be used in many different locations.

Table 20.1 shows a few locale names taken from my Linux host.

Table 20.1. Sample Locale Names

Locale Name	Language	Region	Codeset	Modifier
`sv_FI`	Swedish	Finland
`sv_FI@euro`	Swedish	Finland		Euro is used in this locale
`sv_FI.utf8`	Swedish	Finland	UTF-8
`sv_FI.utf8@euro`	Swedish	Finland	UTF-8	Euro is used in this locale
`sv_SE`	Swedish	Sweden
`sv_SE.utf8`	Swedish	Sweden	UTF-8
`en_AU`	English	Australia
`en_AU.utf8`	English	Australia	UTF-8
`en_IE`	English	Ireland
`en_IE@euro`	English	Ireland		Euro is used in this locale
`en_IE.utf8`	English	Ireland	UTF-8
`en_IE.utf8@euro`	English	Ireland	UTF-8	Euro is used in this locale

My Red Hat Linux system defines 277 locales. Each locale is broken down into a set of categories. Most locale implementations define (at least) the categories shown in Table 20.2. Some operating systems define additional categories.

Table 20.2. Locale Information Categories

Category	Influences	Used By
`LC_MESSAGES`	Message formatting and message language	Client/Server
`LC_MONETARY`	Monetary value formatting	Server
`LC_NUMERIC`	Numeric value formatting	Server
`LC_TIME`	Date and time formatting	Not used
`LC_CTYPE`	Character classifications (uppercase, punctuation, and so on)	Server
`LC_COLLATE`	Collating order for string values	Cluster
`LC_ALL`	All of the above	See all of the above

Enabling Locale Support

When you build PostgreSQL from scratch, locale support is not included unless you include --enable-locale when you configure the source code. If you enable locale support, you should also enable NLS (National Language Support) ”without NLS, you will always see PostgreSQL messages in English. Here is an example showing how to enable both:

 $ ./configure --enable-locale --enable-nls

You choose a locale by setting one or more environment variables. There are three levels of environment variables that you can use. At the bottom level, you can set the LANG environment variable to the locale that you want to use. For example, if you want all features to run in a French context unless overridden, set LANG=fr_FR . You can mix locales by defining LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_CTYPE , and/or LC_COLLATE . The LC_ xxx environment variables override LANG . If you are working with a data base that stores French names, for example, you may still want to see PostgreSQL messages in English. In this case, you would set LANG=fr_FR and LC_MESSAGES=en_US . At the top level, LC_ALL overrides any other locale- related environment variables: If you want everything to run in German (as spoken in Germany), set LC_ALL=de_DE .

Effects of Locale Support

Let's see what happens when you change locales.

The first category in Table 20.2, LC_MESSAGES , determines the language that PostgreSQL uses when displaying message text. I've been running with LC_MESSAGES set to en_US when I run psql, so messages are displayed in English:

 $ psql -d movies Welcome to psql, the PostgreSQL interactive terminal. Type:  \copyright for distribution terms        \h for help with SQL commands        \? for help on internal slash commands        \g or terminate with semicolon to execute query        \q to quit movies=#

Let's try setting LC_MESSAGES to fr_CA (French as spoken in Canada):

 $ LC_MESSAGES=fr_CA psql -d movies Bienvenu  psql, l'interface interactif de PostgreSQL. Tapez:  \copyright pour l'information de copyright         \h pour l'aide-mmoire sur les commandes SQL         \? pour l'aide-mmoire sur les commandes internes         \g ou point-virgule pour excuter une requte         \q pour quitter movies=#

Voil ! The client messages are now in French.

When you are running a PostgreSQL client connected to a PostgreSQL server, there are three locales in use: the client locale, the server locale, and the cluster locale.

Some locale properties affect the server, some affect the client, and a few are stored with the database cluster itself (see Table 20.2). The LC_MESSAGES category affects both the client and server because each can produce message text.

Now, let's try a few of the other categories.

The server uses the LC_MONETARY category to control the way in which monetary values are formatted. I've modified the customers table in my database to include a balance column (using the MONEY data type). Here is the new column, shown in the en_US locale:

 movies=# SELECT * FROM customers;  id     customer_name       phone    birth_date   balance ----+----------------------+----------+------------+------------   4  Wonderland, Alice N.  555-1122  1969-03-05    1  Jones, Henry          555-1212  1970-10-10      .00   2  Rubin, William        555-2211  1972-07-10   ,000.00   3  Panky, Henry          555-1221  1968-01-21  ,000.00 (4 rows)

Now, I'll stop my server, set the LC_MONETARY environment variable to fr_FR (French as spoken in France), and restart the server. Note that you must restart the server before a change to LC_MONETARY can take effect (you can't change monetary formatting on a per-connection basis):

 $ pg_ctl stop waiting for postmaster to shut down......done postmaster successfully shut down $ export LC_MONETARY=fr_FR $ pg_ctl -l /tmp/pg.log start postmaster successfully started

Now, when I query the customers table, the monetary values are formatted using the fr_FR locale:

 movies=# SELECT * FROM customers;  id     customer_name       phone    birth_date   balance ----+----------------------+----------+------------+------------   4  Wonderland, Alice N.  555-1122  1969-03-05    1  Jones, Henry          555-1212  1970-10-10      F10,00   2  Rubin, William        555-2211  1972-07-10   F1 000,00   3  Panky, Henry          555-1221  1968-01-21  F10 000,00 (4 rows)

Notice that MONEY values are now formatted using French preferences.

The LC_NUMERIC category determines which characters will be used for grouping, the currency symbol, positive and negative signs, and the decimal point. Currently, LC_NUMERIC is used only by the TO_CHAR() function. The LC_NUMERIC category affects the server.

PostgreSQL currently does not use the LC_TIME category (each date/time value can include an explicit time zone).

LC_CTYPE is consulted whenever PostgreSQL needs to categorize a character. The server locale determines which characters are considered uppercase, lowercase, numeric, punctuation, and so on. The most obvious uses of LC_COLLATE are the LOWER() , UPPER() , and INITCAP() string functions. LC_COLLATE is also used when evaluating regular expressions and the LIKE operator.

LC_COLLATE affects the result of an ORDER BY clause that sorts by a string value. LC_COLLATE also affects how an index that covers a string value is built. Setting LC_COLLATE ensures that strings are ordered properly for your locale.

Let's look at an example. Create two new database clusters and insert the same values into each one. The first database uses the French locale for collating:

 $ PGDATA=/usr/local/locale_FR LC_COLLATE=fr_FR initdb ... Success. You can now start the database server using:     postmaster -D /usr/local/locale_FR or     pg_ctl -D /usr/local/locale_FR -l logfile start $ PGDATA=/usr/local/locale_FR pg_ctl start postmaster successfully started $ PGDATA=/usr/local/locale_FR createdb french_locale CREATE DATABASE $ PGDATA=/usr/local/locale_FR psql -q -d french_locale french_locale=# CREATE TABLE sort_test (pkey char); CREATE TABLE french_locale=# INSERT INTO sort_test VALUES ('a'); INSERT french_locale=# INSERT INTO sort_test VALUES (''); INSERT french_locale=# INSERT INTO sort_test VALUES ('b'); INSERT french_locale=# SELECT * FROM sort_test;  pkey ------  a    b  (3 rows) french_locale=# \q

Now, repeat this procedure but set LC_COLLATE=en_US before creating the database cluster:

 $ PGDATA=/usr/local/locale_EN LC_COLLATE=en_US initdb ... Success. You can now start the database server using:     postmaster -D /usr/local/locale_EN or     pg_ctl -D /usr/local/locale_EN -l logfile start $ PGDATA=/usr/local/locale_EN pg_ctl start postmaster successfully started $ PGDATA=/usr/local/locale_EN createdb english_locale CREATE DATABASE $ PGDATA=/usr/local/locale_EN psql -q -d locale_test english_locale=# CREATE TABLE sort_test (pkey char); CREATE TABLE english_locale=# INSERT INTO sort_test VALUES ('a'); INSERT english_locale=# INSERT INTO sort_test VALUES (''); INSERT english_locale=# INSERT INTO sort_test VALUES ('b'); INSERT english_locale=# SELECT * FROM sort_test;  pkey ------  a  b   (3 rows) locale_test=# \q

Notice that the collation sequence has, in fact, changed. With LC_COLLATE set to fr_FR , you see a, ,b . With LC_COLLATE set to en_US , the ORDER BY clause returns a,b, .

The LC_COLLATE category is honored only when you run the initdb command. Imagine what would happen if you were trying to alphabetize a long list of customer names, but the collation rules changed every few minutes. You'd end up with quite a mess ”each portion of the final list would be built with a different ordering. If you could change the collating sequence each time you started a client application, indexes would not be built reliably.