lowlevel-characters.tex /size: 12 Kb    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/lowlevel
2
3\environment lowlevel-style
4
5\startdocument
6  [title=characters,
7   color=middlered]
8
9\startsectionlevel[title=Introduction]
10
11This explanation is part of the low level manuals because in practice users will
12not have to deal with these matters in \MKIV\ and even less in \LMTX. You can
13skip to the last section for commands.
14
15\stopsectionlevel
16
17\startsectionlevel[title=History]
18
19If we travel back in time to when \TEX\ was written we end up in eight bit
20character universe. In fact, the first versions assumed seven bits, but for
21comfortable use with languages other than English that was not sufficient.
22Support for eight bits permits the usage of so called code pages as supported by
23operating systems. Although \ASCII\ input became kind of the standard soon
24afterwards, the engine can be set up for different encodings. This is not only
25true for \TEX, but for many of its companions, like \METAFONT\ and therefore
26\METAPOST. \footnote {This remapping to an internal representation (e.g. ebcdic)
27is not present in \LUATEX\ where we assume \UTF8 to be the input encoding. The
28\METAPOST\ library that comes with \LUATEX\ still has that code but in
29\LUAMETATEX\ it's gone. There one can set up the machinery to be \UTF8 aware
30too.}
31
32Core components of a \TEX\ engine are hyphenation of words, applying
33inter|-|character kerns and build ligatures. In traditional \TEX\ engines those
34processes are interwoven into the par builder but in \LUATEX\ these are separate
35stages. The original approach is the reason that there is a relation between the
36input encoding and the font encoding: the character in the input is the slot used
37in a reference to a glyph. When producing the final result (e.g.\ \PDF) there can
38also be a mapping to an index in a font resource.
39
40\starttyping
41input A [tex ->] font slot A [backend ->] glyph index A
42\stoptyping
43
44The mapping that \TEX\ does is normally one|-|to|-|one but an input character can
45undergo some transformation. For instance a character beyond \ASCII\ 126 can be
46made active and expand to some character number that then becomes the font slot.
47So, it is the expansion (or meaning) of a character that end up as numeric
48reference in the glyph node. Virtual fonts can introduce yet another remapping
49but that's only visible in the backend.
50
51Actually, in \LUATEX\ the same happens but in practice there is no need to go
52active because (at least in \CONTEXT) we assume a \UNICODE\ path so there the
53font slot is the \UNICODE\ got from the \UTF8 input.
54
55In the eight bit universe macro packages (have to) provide all kind of means to
56deal with (in the perspective of English) special characters. For instance, \type
57{\"a} would put a diaeresis on top of the a or even better, refer to a character
58in the encoding that the chosen font provides. Because there are some limitations
59of what can go in an eight bit font, and because in different countries the used
60\TEX\ fonts evolved kind of independent, we ended up with quite some different
61variants of fonts. It was only with the Latin Modern project that this became
62better. Interesting is that when we consider the fact that such a font has often
63also hardly used symbols (like registered or copyright) coming up with an
64encoding vector that covers most (latin based) European languages (scripts) is
65not impossible \footnote {And indeed in the Latin Modern project we came up with
66one but it was already to late for it to become popular.} Special symbols could
67simply go into a dedicated font, also because these are always accessed via a
68macro so who cares about the input. It never happened.
69
70Keep in mind that when \UTF8 is used with eight bit engines, \CONTEXT\ will
71convert sequences of characters into a slot in a font (depending on the font
72encoding used which itself depends on the coverage needed). For this every first
73(possible) byte of a multibyte \UTF\ sequence is an active character, which is no
74big deal because these are outside the \ASCII\ range. Normal \ASCII\ characters
75are single byte \UTF\ sequences and fall through without treatment.
76
77Anyway, in \CONTEXT\ \MKII\ we dealt with this by supporting mixed encodings,
78depending on the (local) language, referencing the relevant font. It permits
79users to enter the text in their preferred input encoding and also get the words
80properly hyphenated. But we can leave these \MKII\ details behind.
81
82\stopsectionlevel
83
84\startsectionlevel[title=The heritage]
85
86In \MKIV\ we got rid of input and font encodings, although one can still load
87files in a specific code page. \footnote {I'm not sure if users ever depend on an
88input encoding different from \UTF8.} We also kept the means to enter special
89characters, if only because text editors seldom support(ed) a wide range of
90visual editing of those. This is why we still have
91
92\starttyping[option=TEX]
93\"u \^a \v{s} \AE \ij \eacute \oslash
94\stoptyping
95
96and many more. The ones with one character names are rather common in the \TEX\
97community but it is definitely a weird mix of symbols. The next two are kind of
98outdated: in these days you delegate that to the font handler, where turning them
99into \quote {single} character references depends on what the font offers, how it
100is set up with respect to (for instance) ligatures, and even might depend on
101language or script.
102
103The ones with the long names partly are tradition, but as we have a lot of them,
104in \MKII\ they actually serve a purpose. These verbose names are used in the so
105called encoding vectors and are part of the \UTF\ expansion vectors. They are
106also used in labels so that we have a good indication if what goes in there:
107remember that in those times editors often didn't show characters, unless the
108font for display had them, or the operating system somehow provided them from
109another font. These verbose names are used for latin, greek and cyrillic and for
110some other scripts and symbols. They take up quite a bit of hash space and the
111format file. \footnote {In \MKII\ we have an abstract front|-|end with respect to
112encodings and also an abstract backend with respect to supported drivers but both
113approaches no longer make sense today.}
114
115\stopsectionlevel
116
117\startsectionlevel[title=The \LMTX\ approach]
118
119In the process of tagging all (public) macros in \LMTX\ (which happened in
1202020|-|2021) I wondered if we should keep these one character macros, the
121references to special characters and the verbose ones. When asked on the mailing
122list it became clear that users still expect the short ones to be present, often
123just because old \BIBTEX\ files are used that might need them. However, in \MKIV\
124and \LMTX\ we load \BIBTEX\ files in a way that turn these special character
125references into proper \UTF8 input so it makes a weak argument. Anyway, although
126they could go, for now we keep them because users expect them. However, in \LMTX\
127the implementation is somewhat different now, a bit more efficient in terms of
128hash and memory, potentially a bit less efficient in runtime, but no one will
129notice that.
130
131A new command has been introduced, the very short \type {\chr}.
132
133\startbuffer
134\chr {} \chr {} \chr {}
135\chr {`a} \chr {'a} \chr {"a}
136\chr {a acute} \chr {a grave} \chr {a umlaut}
137\chr {aacute}  \chr {agrave}  \chr {aumlaut}
138\stopbuffer
139
140\typebuffer[option=TEX]
141
142In the first line the composed character using two characters, a base and a so
143called mark. Actually, one doesn't have to use \type {\chr} in that case because
144\CONTEXT\ does already collapse characters for you. The second line looks like
145the shortcuts \type {\`}, \type {\'} and \type {\"}. The third and fourth lines
146could eventually replace the more symbolic long names, if we feel the need. Watch
147out: in \UNICODE\ input the marks come {\em after}.
148
149\startlines \getbuffer \stoplines
150
151Currently the repertoire is somewhat limited but it can be easily be extended. It
152all depends on user needs (doing Greek and Cyrillic for instance). The reason why
153we actually save code deep down is that the helpers for this have always been
154there. \footnote {So if needed I can port this approach back to \MKIV, but for
155now we keep it as is because we then have a reference.}
156
157The \type {\"} commands are now just aliases to more verbose and less hackery
158looking macros:
159
160\starttabulate[|||||]
161    \NC \type {\withgrave}        \NC \withgrave       {a} \NC \type {\`} \NC \`{a} \NC \NR
162    \NC \type {\withacute}        \NC \withacute       {a} \NC \type {\'} \NC \'{a} \NC \NR
163    \NC \type {\withcircumflex}   \NC \withcircumflex  {a} \NC \type {\^} \NC \^{a} \NC \NR
164    \NC \type {\withtilde}        \NC \withtilde       {a} \NC \type {\~} \NC \~{a} \NC \NR
165    \NC \type {\withmacron}       \NC \withmacron      {a} \NC \type {\=} \NC \={a} \NC \NR
166    \NC \type {\withbreve}        \NC \withbreve       {e} \NC \type {\u} \NC \u{e} \NC \NR
167    \NC \type {\withdotaccent}    \NC \withdot         {c} \NC \type {\.} \NC \.{c} \NC \NR
168    \NC \type {\withdiaeresis}    \NC \withdieresis    {e} \NC \type {\"} \NC \"{e} \NC \NR
169    \NC \type {\withring}         \NC \withring        {u} \NC \type {\r} \NC \r{u} \NC \NR
170    \NC \type {\withhungarumlaut} \NC \withhungarumlaut{u} \NC \type {\H} \NC \H{u} \NC \NR
171    \NC \type {\withcaron}        \NC \withcaron       {e} \NC \type {\v} \NC \v{e} \NC \NR
172    \NC \type {\withcedilla}      \NC \withcedilla     {e} \NC \type {\c} \NC \c{e} \NC \NR
173    \NC \type {\withogonek}       \NC \withogonek      {e} \NC \type {\k} \NC \k{e} \NC \NR
174\stoptabulate
175
176Not all fonts have these special characters. Most natural is to have them
177available as precomposed single glyphs, but it can be that they are just two
178shapes with the marks anchored to the base. It can even be that the font somehow
179overlays them, assuming (roughly) equal widths. The \type {compose} font feature
180in \CONTEXT\ normally can handle most well.
181
182An occasional ugly rendering doesn't matter that much: better have something than
183nothing. But when it's the main language (script) that needs them you'd better
184look for a font that handles them. When in doubt, in \CONTEXT\ you can enable
185checking:
186
187\starttabulate[|l|l|]
188    \BC command                           \BC equivalent to \NC \NR
189    \NC \type {\checkmissingcharacters}   \NC \type{\enabletrackers[fonts.missing]} \NC \NR
190    \NC \type {\removemissingcharacters}  \NC \type{\enabletrackers[fonts.missing=remove]} \NC \NR
191    \NC \type {\replacemissingcharacters} \NC \type{\enabletrackers[fonts.missing=replace]} \NC \NR
192    \NC \type {\handlemissingcharacters}  \NC \type{\enabletrackers[fonts.missing={decompose,replace}]} \NC \NR
193\stoptabulate
194
195The decompose variant will try to turn a composed character into its components
196so that at least you get something. If that fails it will inject a replacement
197symbol that stands out so that you can check it. The console also mentions
198missing glyphs. You don't need to enable this by default \footnote {There is some
199overhead involved here.} but you might occasionally do it when you use a font for
200the first time.
201
202In \LMTX\ this mechanism has been upgraded so that replacements follow the shape
203and are actually real characters. The decomposition has not yet been ported back
204to \MKIV.
205
206The full list of commands can be queried when a tracing module is loaded:
207
208\startbuffer
209\usemodule[s][characters-combinations]
210
211\showcharactercombinations
212\stopbuffer
213
214\typebuffer
215
216We get this list:
217
218\getbuffer
219
220Some combinations are special for \CONTEXT\ because \UNICODE\ doesn't specify
221decomposition for all composed characters.
222
223\stopsectionlevel
224
225\stopdocument
226
227% on an old machine, so consider them just relative measures
228%
229% mkiv  lmtx
230%
231% 0.012 0.009 % faster core code
232% 0.028 0.036 % different io code path
233% 0.055 0.043 % different io code path / faster core code
234% 0.156 0.129 % more efficient resolving
235% 0.153 0.119 % more efficient resolving
236%
237% \ifdefined\withdieresis\else\let\withdieresis\"\fi % for mkiv
238%
239% \setbox0\hpack{\testfeatureonce{100000}{ü}}                \par \elapsedtime \par % direct
240% \setbox0\hpack{\testfeatureonce{100000}{ü}}                \par \elapsedtime \par % composed (input)
241% \setbox0\hpack{\testfeatureonce{100000}{u{}̈}}              \par \elapsedtime \par % overlay
242% \setbox0\hpack{\testfeatureonce{100000}{\withdieresis{u}}} \par \elapsedtime \par % official also \"u
243% \setbox0\hpack{\testfeatureonce{100000}{\" u}}             \par \elapsedtime \par % alias of previous
244
245