Character Encoding Model - なるせにっき

Unicode Technical Report #17: Character Encoding Modelをまとめたものをメモ代わりに。
複雑な文字コードの概念を整理する際、この文書はなかなかに役に立つ。

Abstract Character Repertoire (ACR)

文字の同一性

these objects are defined by convention
文字の同一性は慣習に基づく。

character と glyph

Characters are different from glyphs.
character と glyph は異なる。

Glyphs do not correspond one-for-one with characters.
character と glyph は一対一対応とは限らない。

abstruct character の範囲

For historical reasons, abstract character repertoires may include many entities that normally would not be considered appropriate members of an abstract character repertoire.
歴史上の理由から、 abstract character は通常は文字と考えないようなものも含む。

文字の同一性についてのメモ

文字の同一性をどのように定めるかは極めて困難である。
漢字統合、互換漢字等を参照せよ。

Coded Character Set (CCS)

a mapping from a set of abstract characters to the set of nonnegative integers.
CCSとは abstract characters と非負整数の組との対応付け。

同義語

a character encoding, a coded character repertoire, a character set definition, or a code page
符号化文字集合、 the IBM CDRA architecture における CP (“code page” )

例

JIS X 0208, ISO/IEC 8859-1, The Unicode Standard, Version 4.0

a nonnegative integers

a code point, an encoded character, the Unicode scalar value

CEF

A mapping from the set of integers used in a CCS to the set of sequences of code units.
code point の組と、code unit 列の組の対応付け。

CEF と CCS の対応

A CEF for a CCS is defined to be a CEF that maps all of the encoded characters for that CCS.
ある CCS に対する CEF は、その CCS に含まれる全ての文字について定義される。

例

句点コード (JIX X 0208), 7-bit encoding form (US ASCII), UCS-2, UTF-16 (Unicode)

注意

Note that Shift-JIS is not an encoding form
Shift_JISはCEFではない。 (Shift_JISは byte 列への対応なのでCES)

code unit

an integer occupying a specified binary width in a computer architecture, such as an 8-bit byte.
決められた2進数長のコンピュータ内部表現整数

Character Encoding Scheme (CES)

a reversible transformation of sequences of code units to sequences of bytes
code unit 列と byte 列の可逆変換

Byte-order

The CES must take into account the byte-order serialization of all code units wider than a byte that are used in the CEF.
CEF の code unit が byte よりも大きい時は、 CES で byte-order を考えないといけない

例

ISO-2022-JP, EUC-JP, Shift_JIS, CP932 (Microsoft の code page)

The Unicode Standard has seven character encoding schemes:
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Unicode 1.1 had three character encoding schemes:
UTF-8, UCS-2BE, and UCS-2LE, although the latter two were not named that way at the time.

UCS-2BEとUCS-2LEはもはやCESではない。

図解

An abstruct character
↓ CCS
a nonnegative integers
↓ CEF
a sequence of code units
↓ CES
a sequence of bytes

Character Maps

The mapping from a sequence of members of an abstract character repertoire to a serialized sequence of bytes is called a Character Map (CM).
文字から特定のバイト列への対応付け。

同義語

a charset, a character set, a code page (broadly construed), or a CHARMAP

A simple character map

A simple character map thus implicitly includes a CCS, a CEF, and a CES, mapping from abstract characters to code units to bytes.
A simple character map は CCS + CEF + CES である。

A compound character map

A compound character map includes a compound CES, and thus includes more than one CCS and CEF.
A compound character map は (CCS + CEF){1,} + a compound CES である。

IANA charset との関係

Character Maps are the entities that get IANA charset [charset] identifiers in the IAB architecture.
character map は IANA charset の identifier の割り当て対象である。

IBM CDRA との関係

In the IBM CDRA architecture, Character Maps are the entities that get CCSID (coded character set identifier) values.
character map は CCSID の割り当て対象である。

CES との関係

In many cases, the same name is used for both a character map and for a character encoding scheme, such as UTF-16BE.
多くの場合で、同一の名前が character map と CEF に用いられる、UTF-16BEのように。