Character Sets

Table of contents:

Appendix Character Sets

The first 128 Unicode charactersthat is, characters 0 through 127are identical to the ASCII character set. 32 is the ASCII space; therefore, 32 is the Unicode space. 33 is the ASCII exclamation point; therefore, 33 is the Unicode exclamation point, and so on. Table A-1 lists this character set.

Table A-1. The first 128 Unicode characters and the ASCII character set
Code	Character	Code	Character	Code	Character	Code	Character
0	NUL (null)	32	space	64	@	96	`
1	SOH (start of header)	33	!	65	A	97	a
2	STX (start of text)	34	"	66	B	98	b
3	ETX (end of text)	35	#	67	C	99	c
4	EOT (end of transmission)	36	$	68	D	100	d
5	ENQ (enquiry)	37	%	69	E	101	e
6	ACK (acknowledge)	38	&	70	F	102	f
7	BEL (bell)	39	`	71	G	103	g
8	BS (backspace)	40	(	72	H	104	h
9	TAB (tab)	41	)	73	I	105	i
10	LF (linefeed)	42	*	74	J	106	j
11	VTB (vertical tab)	43	+	75	K	107	k
12	FF (formfeed)	44	,	76	L	108	l
13	CR (carriage return)	45	-	77	M	109	m
14	SO (shift out)	46	.	78	N	110	n
15	SI (shift in)	47	/	79	O	111	o
16	DLE (data link escape)	48	0	80	P	112	p
17	DC1 (device control 1, XON)	49	1	81	Q	113	q
18	DC2 (device control 2)	50	2	82	R	114	r
19	DC3 (device control 3, XOFF)	51	3	83	S	115	s
20	DC4 (device control 4)	52	4	84	T	116	t
21	NAK (negative acknowledge)	53	5	85	U	117	u
22	SYN (synchronous idle)	54	6	86	V	118	v
23	ETB (end of transmission block)	55	7	87	W	119	w
24	CAN (cancel)	56	8	88	X	120	x
25	EM (end of medium)	57	9	89	Y	121	y
26	SUB (substitute)	58	:	90	Z	122	z
27	ESC (escape)	59	;	91	[	123	{
28	IS4 (file separator)	60	<	92		124	\|
29	IS3 (group separator)	61	=	93	]	125	}
30	IS2 (record separator)	62	>	94	^	126	~
31	is1 (unit separator)	63	?	95	_	127	del (delete)

In the first column, characters 0 through 31 are referred to as control characters because they e traditionally entered by holding down the control key and a letter key (on at least some dumb terminals). For instance, Ctrl-H is often ASCII 8, backspace. Ctrl-S is often mapped to ASCII 19, DC3, or XOFF. Ctrl-Q is often mapped to ASCII 17, DC1, or XON. Generally, each control character is entered by pressing the Control key and the printable character whose ASCII value is the ASCII value of the character you want plus 64 (or 96, if you count from the capitals). Character 127, delete, is also a control character.

The common abbreviation for the character is given first, followed by its common meaning. Some of these codes are pretty much obsolete. For instance, Im not aware of any modern system that actually uses characters 28 through 31 as file, group, record, and unit separators. Those control codes that are still used often have different meanings on different platforms. For example, character 10, the linefeed, originally meant move the platen on the printer up one line, while character 13, the carriage return, meant return the print-head to the beginning of the line. On paper-based teletype terminals, this could be used to position the print-head anywhere on a page and perhaps overtype characters that had already been typed. This no longer makes sense in an era of glass terminals and GUIs, so linefeed has come to mean a generic end-of-line character.

The next 128 Unicode charactersthat is, 128 through 255have the same values as the equivalent characters in the Latin-1 character set defined in ISO standard 8859-1. Latin-1, a slight variation of which is used by Windows, adds the various accented characters, umlauts, cedillas, upside-down question marks, and other characters needed to write text in most Western European languages. shows these characters. The first 128 characters in Latin-1 are the ASCII characters shown in Table A-2.

Table A-2. Unicode characters between 128 and 255, also the second half of the ISO 8859-1 Latin-1 character set
Code	Character	Code	Character	Code	Character	Code	Character
128	PAD (padding character)	160	non-breaking space	192	À	224	à
129	HOP (high octet preset)	161	¡	193	Á	225	á
130	BPH (break permitted here)	162	¢	194	Â	226	â
131	NBH (no break here)	163	£	195	Ã	227	ã
132	IND (index)	164	¤	196	Ä	228	ä
133	NEL (next line)	165	¥	197	Å	229	å
134	SSA (start of selected area)	166	\|	198	Æ	230	æ
135	ESA (end of selected area)	167	§	199	Ç	231	ç
136	HTS (character tabulation set)	168	¨	200	È	232	è
137	HTJ (character tabulation with justification)	169	©	201	É	233	é
138	VTS (line tabulation set)	170	ª	202	Ê	234	ê
139	PLD (partial line forward)	171	«	203	Ë	235	ë
140	PLU (partial line backward)	172	¬	204	Ì	236	ì
141	RI (reverse line feed)	173	soft (optional) hyphen	205	í	237	í
142	SS2 (single-shift two)	174	®	206	Î	238	î
143	SS3 (single-shift three)	175	¯	207	Ï	239	ï
144	DCS (device control string)	176	° (degree)	208		240
145	PU1 (private use one)	177	±	209	Ñ	241	ñ
146	PU2 (private use two)	178	²	210	Ò	242	ò
147	STS (set transmit state)	179	³	211	Ó	243	ó
148	CCH (cancel character)	180	´	212	Ô	244	ô
149	MW (message waiting)	181	m	213	Õ	245	õ
150	SPA (start of guarded area)	182	¶	214	Ö	246	ö
151	EPA (end of guarded area)	183	·	215	x	247	÷
152	SOS (start of string)	184	, (cedilla)	216	Ø	248
153	SGI (single graphic character introducer)	185	¹	217	Ù	249	ù
154	SCI (single character introducer)	186	º	218	Ú	250	ú
155	CSI (control sequence introducer)	187	»	219	û	251	Û
156	ST (string terminator)	188	¹/₄	220	Ü	252	ü
157	OSC (operating system command)	189	^1/₂	221	Ý	253
158	PM (privacy message)	190	^3/₄	222		254
159	APC (application program command)	191	¿	223	ß	255	ÿ

Characters 128 through 159 are nonprinting control characters, much like characters 0 through 31 of the ASCII set. Unicode does not specify any meanings for these 32 characters, but their common interpretations are listed in the table. On Windows, most of these positions are used for noncontrol characters not included in Latin-1. These alternate interpretations are given in Table A-3.

Table A-3. Windows characters between 128 and 159
Code	Character	Code	Character	Code	Character	Code	Character
128		136	^	144	undefined	152	~
129	undefined	137		145	`	153	™
130	,	138		146		154
131	f	139	<	147	"	155	>
132	,	140	Œ	148	"	156	œ
133	...	141	undefined	149	·	157	undefined
134		142	^a	150	-	158
135		143	undefined	151		159	ÿ

Values beyond 255 encode characters from various other character sets. Where possible, character blocks describing a particular group of characters map onto established encodings for that set of characters by simple transposition. For instance, Unicode characters 884 through 1011 encode the Greek alphabet and associated characters like the Greek question mark (;). This is a direct transposition by 720 of characters 128 through 255 of the ISO 8859-7 character set, which is in turn based on the Greek national standard ELOT 928. For example, the small letter delta, d, Unicode character 948, is ISO 8859-7 character 228. A small epsilon, e, Unicode character 949, is ISO 8859-7 character 229. In general, the Unicode value for a Greek character equals the ISO 8859-7 value for the character plus 720. Other character sets are included in Unicode in a similar fashion whenever possible.

As much as Id like to include complete tables for all Unicode characters, if I did so, this book would be little more than that table. For complete lists of all the Unicode characters and associated glyphs, the canonical reference is The Unicode Standard Version 4.0 by the Unicode Consortium, ISBN 0-321-18578-1. Updates to that book can be found at http://www.unicode.org/. Online charts can be found at http://unicode.org/charts.

About the Author

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he currently resides in the Prospect Heights neighborhood of Brooklyn with his wife, Beth, and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). Hes an adjunct professor of computer science at Polytechnic University, where he teaches Java, XML, and object oriented programming. His Cafe au Lait web site (http://www.cafeaulait.org) is one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche (http://www.cafeconleche.org), has become one of the most popular XML sites. Hes currently working on the XOM library for XML, the Jaxen XPath engine, and the Amateur media player. His previous books include Java Network Programming (OReilly) and Processing XML with Java (Addison-Wesley).

Basic I/O

Introducing I/O

Output Streams

Input Streams

Data Sources

File Streams

Network Streams

Filter Streams

Print Streams

Data Streams

Streams in Memory

Compressing Streams

JAR Archives

Cryptographic Streams

Object Serialization

New I/O

Buffers

Channels

Nonblocking I/O

The File System

Working with Files

File Dialogs and Choosers

Text

Character Sets and Unicode

Readers and Writers

Formatted I/O with java.text

Devices

The Java Communications API

USB

The J2ME Generic Connection Framework

Bluetooth

Character Sets

Character Sets

Character Sets

Appendix Character Sets

Table A-1. The first 128 Unicode characters and the ASCII character set

Table A-2. Unicode characters between 128 and 255, also the second half of the ISO 8859-1 Latin-1 character set

Table A-3. Windows characters between 128 and 159