mk-goingutf.tex /size: 8940 b    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent mk-gointutf
4
5\environment mk-environment
6
7\chapter{Going \UTF}
8
9\LUATEX\ only understands input codes in the Universal Character
10Set Transformation Format, aka \UCS\ Transformation Format, better
11known as: \UTF. There is a good reason for this universal view
12on characters: whatever support gets hard coded into the programs,
13it's never enough, as 25 years of \TEX\ history have clearly
14demonstrated. Macro packages often support more or less standard
15input encodings, as well as local standards, user adapted ones,
16etc.
17
18There is enough information on the Internet and in books about what
19exactly is \UTF. If you don't know the details yet: \UTF\ is a
20multi||byte encoding. The characters with a bytecode up to 127 map
21onto their normal \ASCII\ representation. A larger number indicates
22that the following bytes are part of the character code. Up to 4~bytes
23make an \UTF-8 code, while \UTF-16 always uses two pairs of bytes.
24
25\starttabulate[|c|c|c|c|c|]
26\NC \bf byte 1 \NC \bf byte 2 \NC \bf byte 3 \NC \bf byte 4 \NC \bf unicode            \NC \NR
27\NC 192--223   \NC 128--191   \NC            \NC            \NC 0x80--0x7f{}f          \NC \NR
28\NC 224--239   \NC 128--191   \NC 128--191   \NC            \NC 0x800--0xf{}f{}f{}f    \NC \NR
29\NC 240--247   \NC 128--191   \NC 128--191   \NC 128--191   \NC 0x10000--0x1f{}f{}f{}f \NC \NR
30\stoptabulate
31
32In \UTF-8 the characters in the range $128$--$191$ are illegal
33as first characters. The characters 254 and 255 are
34completely illegal and should not appear at all since they are
35related to \UTF-16.
36
37Instead of providing a never|-|complete truckload of other input
38formats, \LUATEX\ sticks to one input encoding but at the same
39time provides hooks that permits users to write filters that
40preprocess their input into \UTF.
41
42While writing the \LUATEX\ code as well as the \CONTEXT\ input
43handling, we experimented a lot. Right from the beginning we had
44a pretty clear picture of what we wanted to achieve and how it
45could be done, but in the end arrived at solutions that permitted
46fast and efficient \LUA\ scripting as well as a simple interface.
47
48What is involved in handling any input encoding and especially
49\UTF?. First of all, we wanted to support \UTF-8 as well as
50\UTF-16. \LUATEX\ implements \UTF-8 rather straightforward: it
51just assumes that the input is usable \UTF. This means that
52it does not combine characters. There is a good reason for this:
53any automation needs to be configurable (on|/|off) and the more
54is done in the core, the slower it gets.
55
56In \UNICODE, when a character is followed by an \quote
57{accent}, the standard may prescribe that these two characters are
58replaced by one. Of course, when characters turn into glyphs, and
59when no matching glyph is present, we may need to decompose any
60character into components and paste them together from glyphs in
61fonts. Therefore, as a first step, a collapser was written. In the
62(pre|)|loaded \LUA\ tables we have stored information about
63what combination of characters need to be combined into another
64character.
65
66So, an \type {a} followed by an \type {`} becomes \type {à} and
67an \type {e} followed by \type {"} becomes \type {ë}. This
68process is repeated till no more sequences combine. After a few
69alternatives we arrived at a solution that is acceptably fast:
70mere milliseconds per average page. Experiments demonstrated that
71we can not gain much by implementing this in pure~C, but we did
72gain some speed by using a dedicated loop||over||utf||string
73function.
74
75A second \UTF\ related issue is \UTF-16. This coding scheme comes
76in two endian variants. We wanted to do the conversion in \LUA,
77but decided to play a bit with a multi||byte file read function.
78After some experiments we quickly learned that hard coding such
79methods in \TEX\ was doomed to be complex, and the whole idea
80behind \LUATEX\ is to make things less complex. The complexity has
81to do with the fact that we need some control over the different
82linebreak triggers, that is, (combinations of) character 10 and/or 13. In
83the end, the multi||byte readers were removed from the code and we
84ended up with a pure \LUA\ solution, which could be sped up by
85using a multi||byte loop||over||string function.
86
87Instead of hard coding solutions in \LUATEX\ a couple of fast
88loop||over||string functions were added to the \LUA\ string
89function repertoire and the solutions were coded in \LUA. We did
90extensive timing with huge \UTF-16 encoded files, and are
91confident that fast solutions can be found. Keep in mind that
92reading files is never the bottleneck anyway. The only drawback
93of an efficient \UTF-16 reader is that the file is loaded into
94memory, but this is hardly a problem.
95
96Concerning arbitrary input encodings, we can be brief. It's rather
97easy to loop over a string and replace characters in the $0$--$255$
98range by their \UTF\ counterparts. All one needs is to maintain
99conversion tables and \TEX\ macro packages have always done that.
100
101Yet another (more obscure) kind of remapping concerns those special
102\TEX\ characters. If we use a traditional \TEX\ auxiliary file, then
103we must make sure that for instance percent signs, hashes, dollars
104and other characters are handled right. If we set the catcode of
105the percent sign to \quote {letter}, then we get into trouble when
106such a percent sign ends up in the table of contents and is read in
107under a different catcode regime (and becomes for instance a comment
108symbol). One way to deal with such situations is to temporarily move
109the problematic characters into a private \UNICODE\ area and deal
110with them accordingly. In that case they no longer can interfere.
111
112Where do we handle such conversions? There are two places where
113we can hook converters into the input.
114
115\startitemize[n,packed]
116\item each time when we read a line from a file, i.e.\ we can hook
117      conversion code into the read callbacks
118\item using the special \type {process_input_buffer} callback which
119      is called whenever \TEX\ needs a new line of input
120\stopitemize
121
122Because we can overload the standard file open and read functions,
123we can easily hook the \UTF\ collapse function into the readers.
124The same is true for the \UTF-16\ handler. In \CONTEXT, for
125performance reasons we load such files into memory, which means
126that we also need to provide a special reader to \TEX. When
127handling \UTF-16, we don't need to combine characters so that stage
128is skipped then.
129
130So, to summarize this, here is what we do in \CONTEXT. Keep in
131mind that we overload the standard input methods and therefore
132have complete control over how \LUATEX\ locates and opens files.
133
134\startitemize[n]
135
136\item When we have a \UTF\ file, we will read from that file line
137      by line, and combine characters when collapsing is enabled.
138
139\item When \LUATEX\ wants to open a file, we look into the first
140      bytes to see if it is a \UTF-16\ file, in either big or
141      little endian format. When this is the case, we load the
142      file into memory, convert the data to \UTF-8, identify
143      lines, and provide a reader that will give back the file
144      linewise.
145
146\item When we have been told to recode the input (i.e.\ when we have
147      enabled an input regime) we use the normal line||by||line
148      reader and convert those lines on the fly into valid \UTF.
149      No collapsing is needed.
150
151\stopitemize
152
153Because we conduct our experiments in \CONTEXT\ \MKIV\ the code that
154we provide may look a bit messy and more complex than the previous
155description may suggest. But keep in mind that a mature macro
156package needs to adapt to what users are accustomed to. The fact
157that \LUATEX\ moved on to \UTF\ input does not mean that all the
158tools that users use and the files that they have produced over
159decades automagically convert as well.
160
161Because we are now living in a \UTF\ world, we need to keep that
162in mind when we do tricky things with sequences of characters, for
163instance in processing verbatim. When we implement verbatim in
164pure \TEX\ we can do as before, but when we let \LUA\ kick in, we
165need to use string methods that are \UTF-aware. In addition to
166the linked-in \UNICODE\ library, there are dedicated iterator
167functions added to the \type {string} namespace; think of:
168
169\starttyping
170for c in string.utfcharacters(str) do
171    something_with(c)
172end
173\stoptyping
174
175Occasionally we need to output raw 8-bit code, for instance
176to \DVI\ or \PDF\ backends (specials and literals). Of course
177we could have cooked up a truckload of conversion functions
178for this, but during one of our travels to a \TEX\ conference,
179we came up with the following trick.
180
181We reserve the top 256 values of the \UNICODE\ range, starting at
182hexadecimal value 0x110000, for byte output. When writing to an
183output stream, that offset will be subtracted. So, 0x1100A9 is written
184out as hexadecimal byte value A9, which is the decimal value 169, which
185in the Latin~1 encoding is the slot for the copyright sign.
186
187\stopcomponent
188