Close menu

UTF-1


Source: http://en.wikipedia.org/wiki/UTF-1
Updated: 2017-05-25T15:36Z

UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design, it is not possible to resynchronise if decoding starts in the middle of a character (this makes truncation hard, among other things) and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of division by a number which is not a power of 2. Due to these issues, UTF-1 never gained wide acceptance and has been replaced by UTF-8.

Design

UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five octets. While the ASCII range is encoded as one octet, as in UTF-8, the ASCII octets 0x21 - 0x7E (decimal 33 - 126) are also used in UTF-1 multi-byte encodings; therefore UTF-1 is unsuited for many Internet protocols, including MIME.

UTF-1 does not use the C0 and C1 control codes in other encodings – any 0x00–0x20 or 0x7F–0x9F octet stands for the corresponding code points in ISO-8859-1 (U+0000–0020 and U+007F–009F, respectively). This design with 66 protected octets tried to be ISO 2022 compatible.

The UTF-1 encoding scheme uses "modulo 190" arithmetic (256 - 66 = 190); it was designed to encode the complete 31 bits of the original Universal Character Set (UCS-4). For comparison, UTF-8 protects all 128 ASCII octets, and needs two bits in trailing bytes of multi-byte encodings for this purpose, resulting in "modulo 64" arithmetic (8 - 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 - 13 = 243).

codepointUTF-16BEUTF-16LEUTF-8UTF-1
U+007F00 7F7F 007F7F
U+008000 8080 00C2 8080
U+009F00 9F9F 00C2 9F9F
U+00A000 A0A0 00C2 A0A0 A0
U+00BF00 BFBF 00C2 BFA0 BF
U+00C000 C0C0 00C3 80A0 C0
U+00FF00 FFFF 00C3 BFA0 FF
U+010001 0000 01C4 80A1 21
U+015D01 5D5D 01C5 9DA1 7E
U+015E01 5E5E 01C5 9EA1 A0
U+01BD01 BDBD 01C6 BDA1 FF
U+01BE01 BEBE 01C6 BEA2 21
U+07FF07 FFFF 07DF BFAA 72
U+080008 0000 08E0 A0 80AA 73
U+0FFF0F FFFF 0FE0 BF BFB5 48
U+100010 0000 10E1 80 80B5 49
U+401540 1515 40E4 80 95F5 FF
U+401640 1616 40E4 80 96F6 21 21
U+D7FFD7 FFFF D7ED 9F BFF7 2F C3
U+E000E0 0000 E0EE 80 80F7 3A 79
U+F8FFF8 FFFF F8EF A3 BFF7 5C 3C
U+FDD0FD D0D0 FDEF B7 90F7 62 BA
U+FDEFFD EFEF FDEF B7 AFF7 62 D9
U+FEFFFE FFFF FEEF BB BFF7 64 4C
U+FFFDFF FDFD FFEF BF BDF7 65 AD
U+FFFEFF FEFE FFEF BF BEF7 65 AE
U+FFFFFF FFFF FFEF BF BFF7 65 AF
U+10000D8 00 DC 0000 D8 00 DCF0 90 80 80F7 65 B0
U+38E2DD8 A3 DE 2DA3 D8 2D DEF0 B8 B8 ADFB FF FF
U+38E2ED8 A3 DE 2EA3 D8 2E DEF0 B8 B8 AEFC 21 21 21 21
U+FFFFFDB BF DF FFBF DB FF DFF3 BF BF BFFC 21 37 B2 7A
U+100000DB C0 DC 00C0 DB 00 DCF4 80 80 80FC 21 37 B2 7B
U+10FFFFDB FF DF FFFF DB FF DFF4 8F BF BFFC 21 39 6E 6C
U+7FFFFFFFErrorErrorFD BF BF BF BF BFFD BC 2B B8 40

See also

References

Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

Also On Wow

    Advertisement

    Trending Now