ontarget-active.tex /size: 9013 b    last modification: 2024-01-16 10:21
1% language=us runpath=texruns:manuals/ontarget
2
3\startcomponent ontarget-active
4
5\environment ontarget-style
6
7\usemodule[system-tokens]
8
9\startchapter[title={Active characters}]
10
11Each character in \TEX\ has a so called category code. Most are of category
12\quote {letter} or \quote {other character} but some have a special meaning, like
13\quote {superscript} or \quote {subscript} or \quote {math shift}. Of course the
14backslash is special too and it has the \quote {escape} category.
15
16A single character can also be a command in which case it has category \quote
17{active}. In \CONTEXT\ the \type {|} is an example of that. It grabs an argument
18delimited by yet another such (active) bar and handles that argument as compound
19character.
20
21From the perspective of \CONTEXT\ we have a couple of challenges with respect to
22active characters.
23
24\startitemize
25\startitem
26    We want to limit the number of special symbols so we only really have to deal with the
27    active bar and tilde. Both have a history starting with \MKII.
28\stopitem
29\startitem
30    There are cases where we don't want them to be not active, most noticeably in
31    math and verbatim. This means that we either have to make a sure that they
32    are not active bit in nested exceptions, for instance when we flush a page
33    halfway verbatim, made active again.
34\stopitem
35\startitem
36    In text we always hade catcode regimes to deal with this (which is actually
37    why in \LUATEX\ efficient catcode tables were one of the first native
38    features to implement. This involves some namespace management.
39\stopitem
40\startitem
41    In math we have to fall back on a different meaning which adds another
42    (meaning) axis alongside catcode regimes: in math we use the same catcode
43    regime as in text so we have a mode dependent meaning on top of the catcode
44    regime specific one.
45\stopitem
46\startitem
47    In math we have this special active class|/|character definition value \type
48    {"8000} that makes characters active in math only. We use(d) that for permitting
49    regular hat and underscore characters in text mode but let them act as
50    superscript and subscript triggers in math mode.
51\stopitem
52\startitem
53    Active characters travel in a special way trough the system: they are
54    actually stored as macro calls in token lists en macro bodies. This normally
55    goes unnoticed (and is not that different from other catcodes being frozen in
56    macros).
57\stopitem
58\stopitemize
59
60So far we could always comfortably implement whatever we wanted but sometimes the
61code was not that pretty. Because part of the \LUAMETATEX\ project is to make
62code cleaner, I started wondering if we could come up with a better mechanism for
63dealing with active characters especially in math. Among the other reasons were:
64less tracing clutter, a bit more natural approach, and less intercepts for
65special cases. Of course we have to be compatible. Some first experiments were
66promising but as usual it took a while to identify all the cases we have to deal
67with. At moments I wondered if I should go forward but as I stepwise adapted the
68\CONTEXT\ code to the experiment there was no way back. I did however reject
69experiments that out active characters in the catcode table namespaces.
70
71In \LUATEX\ (and its predecessors) internally active characters are stored as a
72reference to a control sequence, although a \type {\show} or trace will report
73the character as \quote {name}. For example:
74
75\startbuffer
76\catcode `!=\activecatcode
77\def !{whatever} % we also have \letcharcode
78\def\foo{x!x}
79\stopbuffer
80
81\typebuffer
82
83is stored as (cs, cmd, chr):
84
85\start
86\getbuffer
87\luatokentable\foo
88\stop
89
90However, when we want some more hybrid approach, a text versus math mix, we need
91to postpone resolving into a control sequence. Examples are macro bodies and
92token registers. When we flag a character (with \type {amcode}) as being of a
93different catcode than active in math mode, we get the following:
94
95\startbuffer
96\amcode`! \othercatcode
97\catcode `!=\activecatcode
98\def !{whatever}
99\def\foo{x!x}
100\stopbuffer
101\typebuffer
102
103\start
104\getbuffer
105\luatokentable\foo
106\stop
107
108The difference is that here we get the active character in the body of the macro.
109Interesting is that this is not something that parser is prepared for so the main
110loop has now to catch active characters. This is no big deal but also not
111something to neglect. The same is true for serialization of tokens.
112
113Other situations when we need to be clever is for instance when we try to enter
114math mode. In math mode we want the (in text) active character as math character
115and a convenient test is checking the mode. However, when we see \type {$} we are
116not yet in math mode and as \TEX\ looks for a potential next \type {$} we grab a
117active character it should not resolve in a reference to an command. The reason
118for that is that when \TEX\ pushes back the token (because it doesn't see a \type
119{$}) we need it to be an active character and not a control sequence. If it were
120a control sequence we would see it as such in math mode which is not what we
121intended. It is one of these cases where \TEX\ is not roundtrip. Similar cases
122occur when \TEX\ looks ahead for (what makes a) number and doesn't see one which
123then results in a push back. Actually, there are many look ahead and push back
124moments in the source.
125
126\startbuffer
127text: \def\foo{x|!|x}
128
129\meaningasis\foo
130
131\luatokentable\foo
132
133$x\foo x$ \foo
134\stopbuffer
135
136\typebuffer \start\getbuffer\stop
137
138\startbuffer
139math: $\gdef\oof{x|!|x}$
140
141\meaningasis\oof
142
143\luatokentable\oof
144
145$x\oof x$ \oof
146\stopbuffer
147
148\typebuffer \start\getbuffer\stop
149
150\startbuffer
151toks: \scratchtoks{x|!|x}
152
153\detokenize\expandafter{\the\scratchtoks}
154
155\luatokentable\scratchtoks
156
157$x\the\scratchtoks x$ \the\scratchtoks
158\stopbuffer
159
160\typebuffer \start\getbuffer\stop
161
162A good test case for \CONTEXT\ is:
163
164\startbuffer
165\def\foo{x|!|x||x}
166
167 x|!|x||x + \foo
168$x|!|x||x + \foo$
169\stopbuffer
170
171\typebuffer
172
173Here we expect bars in math mode but the compound mechanism applied in text mode:
174
175\startlines\getbuffer\stoplines
176
177So the bottom line is this:
178
179\startitemize
180\startitem
181    Active characters should behave as expected, which means that they get
182    replaced by references to commands.
183\stopitem
184\startitem
185    When the \type {\amcode} is set, this signal the engine to delay that
186    replacement and retain the active character.
187\stopitem
188\startitem
189    When the moment is there the engine either expands it as command (text mode)
190    or injects the alternative meaning based on the catcode. There we support
191    letters, other characters, super- and subscripts and alignment codes. The
192    rest we simply ignore (for now).
193\stopitem
194\stopitemize
195
196Of course you can abuse this mechanism and also retain the character's active
197property in text mode by simply setting the \type {\amcode}. We'll see how that
198works out. Actually this mechanism was provided in the first place to get around
199the \type {"8000} limitations! So here is another cheat:
200
201\starttyping
202\catcode `^ \othercatcode       % so a ^ is just that
203\amcode  `^ \superscriptcatcode % but a ^ in math signals a superscript
204\stoptyping
205
206So, the \type {a} in \type {\amcode} stands for both \quote {active} and \quote
207{alternative}. As mentioned, because we distinguish between math and text mode we
208no longer need to adapt the meaning of active commands: think of using \type
209{\mathtext} in a formula where we leave math mode and then need to use the text
210meaning of the bar, just as outside the formula.
211
212In the end, because we only have a few active characters and no user ever
213demanded name spaces that mechanism was declared obsolete. There is no need to
214keep code around that is not really used any more.
215
216% Although this mechanism works okay, there is a pitfall. When you define a macro, and
217% \type {\amcode} is set, the active character is stored as such. That means that doing
218% something like this is likely to fail:
219%
220% \starttyping
221% \def\whatever{\let~\space}
222% \stoptyping
223%
224% when the tilde is active as well as has a \type {\amcode} set. However,
225%
226% \starttyping
227% \def\whatever{\letcharcode\tildeasciicode\space}
228% \stoptyping
229%
230% will work just fine.
231
232Internally an active character is stored in the hash that also stores regular
233control sequences. The character becomes an \UTF\ string prefixed by the \UTF\
234value of \type {0xFFFF} which doesn't exist in \UNICODE. The \type {\csactive}
235primitive is a variant on \type {\csstring} that returns this hash. Its companion
236\type {\expandactive} (a variant on \type {\expand}) can be used to inject the
237related control sequence. If \type {\csactive} is not followed by an active
238character it expands to just the prefix, as does \type {\Uchar"FFFF} but a bit of
239abstraction makes sense.
240
241% control sequence: xxxx
242% 271731  13  126  active char
243% control sequence: xxxx
244% 271732  135    0  protected call  ~
245% control sequence: xxxx
246% 271734   12  65535  other char      ￿ (U+0FFFF)
247% 408124  135      0  protected call  ~
248
249\stopchapter
250
251\stopcomponent
252
253