# HG changeset patch # User martin # Date 1205125002 25200 # Node ID 18b856569127f7f86ae2ce3cebc333c280ce0b1d # Parent 560da37936dbb7fae5fa47272fad0395b4949597 4499288: (cs spec) Charset terminology problems Reviewed-by: mr, iris diff -r 560da37936db -r 18b856569127 jdk/src/share/classes/java/nio/charset/Charset.java --- a/jdk/src/share/classes/java/nio/charset/Charset.java Thu Mar 06 07:51:28 2008 -0800 +++ b/jdk/src/share/classes/java/nio/charset/Charset.java Sun Mar 09 21:56:42 2008 -0700 @@ -212,36 +212,47 @@ * *

Terminology

* - *

The name of this class is taken from the terms used in RFC 2278. In that - * document a charset is defined as the combination of a coded character - * set and a character-encoding scheme. + *

The name of this class is taken from the terms used in + * RFC 2278. + * In that document a charset is defined as the combination of + * one or more coded character sets and a character-encoding scheme. + * (This definition is confusing; some other software systems define + * charset as a synonym for coded character set.) * *

A coded character set is a mapping between a set of abstract * characters and a set of integers. US-ASCII, ISO 8859-1, - * JIS X 0201, and full Unicode, which is the same as - * ISO 10646-1, are examples of coded character sets. + * JIS X 0201, and Unicode are examples of coded character sets. + * + *

Some standards have defined a character set to be simply a + * set of abstract characters without an associated assigned numbering. + * An alphabet is an example of such a character set. However, the subtle + * distinction between character set and coded character set + * is rarely used in practice; the former has become a short form for the + * latter, including in the Java API specification. * - *

A character-encoding scheme is a mapping between a coded - * character set and a set of octet (eight-bit byte) sequences. UTF-8, UCS-2, - * UTF-16, ISO 2022, and EUC are examples of character-encoding schemes. - * Encoding schemes are often associated with a particular coded character set; - * UTF-8, for example, is used only to encode Unicode. Some schemes, however, - * are associated with multiple character sets; EUC, for example, can be used - * to encode characters in a variety of Asian character sets. + *

A character-encoding scheme is a mapping between one or more + * coded character sets and a set of octet (eight-bit byte) sequences. + * UTF-8, UTF-16, ISO 2022, and EUC are examples of + * character-encoding schemes. Encoding schemes are often associated with + * a particular coded character set; UTF-8, for example, is used only to + * encode Unicode. Some schemes, however, are associated with multiple + * coded character sets; EUC, for example, can be used to encode + * characters in a variety of Asian coded character sets. * *

When a coded character set is used exclusively with a single - * character-encoding scheme then the corresponding charset is usually named - * for the character set; otherwise a charset is usually named for the encoding - * scheme and, possibly, the locale of the character sets that it supports. - * Hence US-ASCII is the name of the charset for US-ASCII while + * character-encoding scheme then the corresponding charset is usually + * named for the coded character set; otherwise a charset is usually named + * for the encoding scheme and, possibly, the locale of the coded + * character sets that it supports. Hence US-ASCII is both the + * name of a coded character set and of the charset that encodes it, while * EUC-JP is the name of the charset that encodes the * JIS X 0201, JIS X 0208, and JIS X 0212 - * character sets. + * coded character sets for the Japanese language. * *

The native character encoding of the Java programming language is - * UTF-16. A charset in the Java platform therefore defines a mapping between - * sequences of sixteen-bit UTF-16 code units and sequences of bytes.

+ * UTF-16. A charset in the Java platform therefore defines a mapping + * between sequences of sixteen-bit UTF-16 code units (that is, sequences + * of chars) and sequences of bytes.

* * * @author Mark Reinhold