Character Encoding Model

Unicode Technical Report #17: Character Encoding Modelをまとめたものをメモ代わりに。

Abstract Character Repertoire (ACR)


these objects are defined by convention

character と glyph

Characters are different from glyphs.
character と glyph は異なる。

Glyphs do not correspond one-for-one with characters.
character と glyph は一対一対応とは限らない。

abstruct character の範囲

For historical reasons, abstract character repertoires may include many entities that normally would not be considered appropriate members of an abstract character repertoire.
歴史上の理由から、 abstract character は通常は文字と考えないようなものも含む。



Coded Character Set (CCS)

a mapping from a set of abstract characters to the set of nonnegative integers.
CCSとは abstract characters と 非負整数の組 との対応付け。


a character encoding, a coded character repertoire, a character set definition, or a code page
符号化文字集合、 the IBM CDRA architecture における CP (“code page” )

JIS X 0208, ISO/IEC 8859-1, The Unicode Standard, Version 4.0

a nonnegative integers

a code point, an encoded character, the Unicode scalar value


A mapping from the set of integers used in a CCS to the set of sequences of code units.
code point の組と、code unit 列の組の対応付け。

CEF と CCS の対応

A CEF for a CCS is defined to be a CEF that maps all of the encoded characters for that CCS.
ある CCS に対する CEF は、その CCS に含まれる全ての文字について定義される。

句点コード (JIX X 0208), 7-bit encoding form (US ASCII), UCS-2, UTF-16 (Unicode)


Note that Shift-JIS is not an encoding form
Shift_JISはCEFではない。 (Shift_JISは byte 列への対応なのでCES)

code unit

an integer occupying a specified binary width in a computer architecture, such as an 8-bit byte.

Character Encoding Scheme (CES)

a reversible transformation of sequences of code units to sequences of bytes
code unit 列と byte 列の可逆変換


The CES must take into account the byte-order serialization of all code units wider than a byte that are used in the CEF.
CEF の code unit が byte よりも大きい時は、 CES で byte-order を考えないといけない

ISO-2022-JP, EUC-JP, Shift_JIS, CP932 (Microsoft の code page)

The Unicode Standard has seven character encoding schemes:
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Unicode 1.1 had three character encoding schemes:
UTF-8, UCS-2BE, and UCS-2LE, although the latter two were not named that way at the time.



An abstruct character
a nonnegative integers
a sequence of code units
a sequence of bytes

Character Maps

The mapping from a sequence of members of an abstract character repertoire to a serialized sequence of bytes is called a Character Map (CM).


a charset, a character set, a code page (broadly construed), or a CHARMAP

A simple character map

A simple character map thus implicitly includes a CCS, a CEF, and a CES, mapping from abstract characters to code units to bytes.
A simple character map は CCS + CEF + CES である。

A compound character map

A compound character map includes a compound CES, and thus includes more than one CCS and CEF.
A compound character map は (CCS + CEF){1,} + a compound CES である。

IANA charset との関係

Character Maps are the entities that get IANA charset [charset] identifiers in the IAB architecture.
character map は IANA charset の identifier の割り当て対象である。


In the IBM CDRA architecture, Character Maps are the entities that get CCSID (coded character set identifier) values.
character map は CCSID の割り当て対象である。

CES との関係

In many cases, the same name is used for both a character map and for a character encoding scheme, such as UTF-16BE.
多くの場合で、同一の名前が character map と CEF に用いられる、UTF-16BEのように。