evenmore-keywords.tex /size: 18 Kb    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/evenmore
2
3% Talking of keywords: Jacob Collier, Count The People is definitely an example
4% of showing keywords and no way that the fonts used there are done by tex:
5%
6% https://www.youtube.com/watch?v=icplHV25fqs
7
8\environment evenmore-style
9
10\startcomponent evenmore-keywords
11
12\startchapter[title=Keywords]
13
14Some primitives in \TEX\ can take one or more optional keywords and|/|or keywords
15followed by one or more values. In traditional \TEX\ it concerns a handful of
16primitives, in \PDFTEX\ there are plenty of backend|-|related primitives,
17\LUATEX\ introduced optional keywords to some math constructs and attributes to
18boxes, and \LUAMETATEX\ adds some more too. The keyword scanner in \TEX\ is
19rather special. Keywords are used in cases like:
20
21\starttyping
22\hbox spread 10cm {...}
23\advance\scratchcounter by 10
24\vrule width 3cm height 1ex
25\stoptyping
26
27Sometimes there are multiple keywords, as with rules, in which case you can
28imagine a case like:
29
30\starttyping
31\vrule width 3cm depth 1ex width 10cm depth 0ex height 1ex\relax
32\stoptyping
33
34Here we add a \type {\relax} to end the scanning. If we don't do that and the
35rule specification is followed by arbitrary (read:\ unpredictable) text, the next
36word might be a valid keyword and when followed by a dimension (unlikely) it will
37happily be read as a directive, or when not followed by a dimension an error
38message will show up. Sometimes the scanning is more restricted, as with glue
39where the optional \type {plus} and \type {minus} are to come in that order, but
40when missing, again a word from the text can be picked up if one doesn't
41explicitly end with a \type {\relax} or some other token.
42
43\starttyping
44\scratchskip = 10pt plus 10pt minus 10pt % okay
45\scratchskip = 10pt plus 10pt            % okay
46\scratchskip = 10pt minus 10pt           % okay
47\scratchskip = 10pt minus 10pt plus 10pt % typesets "plus 10pt"
48\scratchskip = 10pt plus whatever        % an error
49\stoptyping
50
51The scanner is case insensitive, so the following specifications are all valid:
52
53\starttyping
54\hbox To 10cm {To}
55\hbox TO 10cm {TO}
56\hbox tO 10cm {tO}
57\hbox to 10cm {to}
58\stoptyping
59
60It happens that keywords are always simple English words so the engine uses a
61cheap check deep down, just offsetting to uppercase, but of course that will not
62work for arbitrary \UTF-8\ (as used in \LUATEX) and it's also unrelated to the
63upper- and lowercase codes as \TEX\ knows them.
64
65The above lines scan for the keyword \type {to} and after that for a dimension.
66While keyword scanning is case tolerant, dimension scanning is period tolerant:
67
68\starttyping
69\hbox to 10cm   {10cm}
70\hbox to 10.0cm {10.0cm}
71\hbox to .0cm   {.0cm}
72\hbox to .cm    {.cm}
73\hbox to 10.cm  {10.cm}
74\stoptyping
75
76These are all valid and according to the specification; even the single period is
77okay, although it looks funny. It would not be hard to intercept that but I guess
78that when \TEX\ was written anything that could harm performance was taken into
79account. One can even argue for cases like:
80
81\starttyping
82\hbox to \first.\second cm {.cm}
83\stoptyping
84
85Here \type {\first} and|/|or \type {\second} can be empty. Most users won't
86notice these side effects of scanning numbers anyway.
87
88The reason for writing up any discussion of keywords is the following. Optional
89keyword scanning is kind of costly, not so much now, but more so decades ago
90(which led to some interesting optimizations, as we'll see). For instance, in the
91first line below, there is no keyword. The scanner sees a \type {1} and it not
92being a keyword, pushes that character back in the input.
93
94\starttyping
95\advance\scratchcounter 10
96\advance\scratchcounter by 10
97\stoptyping
98
99In the case of:
100
101\starttyping
102\scratchskip 10pt plux
103\stoptyping
104
105it has to push back the four scanned tokens \type {plux}. Now, in the engine
106there are lots of cases where lookahead happens and when a condition is not
107satisfied, the just|-|read token is pushed back. Incidentally, when picking up
108the next token triggered some expansion, it's not the original next token that
109gets pushed back, but the first token seen after the expansion. Pushing back
110tokens is not that inefficient, although it involves allocating a token and
111pushing and popping input stacks (we're talking of a mix of reading from file,
112token memory, \LUA\ prints, etc.)\ but it always takes a little time and memory.
113In \LUATEX\ there are more keywords for boxes, and there we have loops too: in a
114box specification one or more optional attributes are scanned before the optional
115\type {to} or \type {spread}, so again there can be push back when no more \type
116{attr} are seen.
117
118\starttyping
119\hbox attr 1 98 attr 2 99 to 1cm{...}
120\stoptyping
121
122In \LUAMETATEX\ there is even more optional keyword scanning, but we leave that
123for now and just show one example:
124
125\starttyping
126\hbox spread 10em {\hss
127    \hbox orientation 0 yoffset  1mm to 2em   {up}\hss
128    \hbox                            to 2em {here}\hss
129    \hbox orientation 0 xoffset -1mm to 2em {down}\hss
130}
131\stoptyping
132
133Although one cannot mess too much with these low|-|level scanners there was room
134for some optimization, so the penalty we pay for more keyword scanning in
135\LUAMETATEX\ is not that high. (I try to compensate when adding features that
136have a possible performance hit with some gain elsewhere.)
137
138It will be no surprise that there can be interesting side effects to keyword
139scanning. For instance, using the two character keyword \type {by} in an \type
140{\advance} can be more efficient because nothing needs to be pushed back. The
141same is true for the sometimes optional equal:
142
143\starttyping
144\scratchskip = 10pt
145\stoptyping
146
147Similar impacts on efficiency can be found in the way the end of a number is
148seen, basically anything not resolving to a number (or digit). (For these, assume
149a following token will terminate the number if needed; we're focusing on the
150spaces here.)
151
152\starttyping
153\scratchcounter 10%          space not seen, ends \cs
154\scratchcounter =10%         no push back of optional =
155\scratchcounter = 10%        extra optional space gobble
156\scratchcounter = 10 %       efficient ending of number scanning
157\scratchcounter = 10\relax % depending on engine less efficient
158\stoptyping
159
160In the above examples scanning the number involves: skipping over spaces,
161checking for an optional equal, skipping over spaces, scanning for a sign,
162checking for an optional octal or hexadecimal trigger (single or double quote
163character), scanning the number till a non|-|digit is seen. In the case of
164dimensions there is fraction scanning as well as unit scanning too.
165
166In any case, the equal is optional and kind of a keyword. Having an equal can be
167more efficient then not having one, again due to push back in case of no equal
168being seen, In the process spaces have been skipped, so add to the overhead the
169scanning for optional spaces. In \LUAMETATEX\ all that has been optimized a bit.
170By the way, in dimension scanning \type {pt} is actually a keyword and as there
171are several dimensions possible quite some push back can happen there, but we
172scan for the most likely candidates first.
173
174All that said, we're now ready for a surprise. The keyword scanner gets a string
175that it will test for, say, \type {to} in case of a box specification. It then
176will fetch tokens from whatever provides the input. A token encodes a so|-|called
177command and a character and can be related to a control sequence. For instance,
178the character \type {t} becomes a letter command with related value \number`t.
179So, we have three properties: the command code, the character code and the
180control sequence code. Now, instead of checking if the command code is a letter
181or other character (two checks) a fast check happens for the control sequence
182code being zero. If that is the case, the character code is compared. In practice
183that works out well because the characters that make up a keyword are in the
184range \number"41--\number"5A\ and \number"61--\number"7A, and all other character
185codes are either below that (the ones that relate to primitives where the
186character code is actually a subcommand of a limited range) or much larger
187numbers that, for instance, indicate an entry in some array, where the first
188useful index is above the mentioned ranges.
189
190The surprise is in the fact that there is no checking for letters or other
191characters, so this is why the following code will work too: \footnote {No longer
192in \LUAMETATEX\ where we do a bit more robust check.}
193
194\starttyping
195\catcode `O= 1 \hbox tO 10cm {...} % { begingroup
196\catcode `O= 2 \hbox tO 10cm {...} % } endgroup
197\catcode `O= 3 \hbox tO 10cm {...} % $ mathshift
198\catcode `O= 4 \hbox tO 10cm {...} % & alignment
199\catcode `O= 6 \hbox tO 10cm {...} % # parameter
200\catcode `O= 7 \hbox tO 10cm {...} % ^ superscript
201\catcode `O= 8 \hbox tO 10cm {...} % _ subscript
202\catcode `O=11 \hbox tO 10cm {...} %   letter
203\catcode `O=12 \hbox tO 10cm {...} %   other
204\stoptyping
205
206In the first line, if we changed the catcode of \type {T} (instead of \type {O}),
207it gives an error because \TEX\ sees a begin group character (category code 1)
208and starts the group, but as a second character in a keyword (\type {O}) it's
209okay because \TEX\ will not look at the category code.
210
211Of course only the cases \type {11} and \type {12} make sense in practice.
212Messing with the category codes of regular letters this way will definitely give
213problems with processing normal text. In a case like:
214
215\starttyping
216{\catcode `o=3 \hbox to 10cm {oeps}} % $ mathshift {oeps}
217{\catcode `O=3 \hbox to 10cm {Oeps}} % $ mathshift {$eps}
218\stoptyping
219
220we have several issues: the primitive control sequence \type {\hbox} has an \type
221{o} so \TEX\ will stop after \type {\hb} which can be undefined or a valid macro
222and what happens next is hard to predict. Using uppercase will work but then the
223content of the box is bad because there the \type {O} enters math. Now consider:
224
225\starttyping
226{\catcode `O=3 \hbox tO 10cm {Oeps Oeps}} % {$eps $eps}
227\stoptyping
228
229This will work because there are now two \type {O}'s in the box, so we have
230balanced inline math triggers. But how does one explain that to a user? (Who
231probably doesn't understand where an error message comes from in the first
232place.) Anyway, this kind of tolerance is still not pretty, so in \LUAMETATEX\ we
233now check for the command code and stick to letters and other characters. On
234today's machines (and even on my by now ancient workhorse) the performance hit
235can be neglected.
236
237In fact, by intercepting the weird cases we also avoid an unnecessary case check
238when we fall through the zero control sequence test. Of course that also means
239that the above mentioned category code trickery doesn't work any more: only
240letters and other characters are now valid in keyword scanning. Now, it can be
241that some macro programmer actually used those side effects but apart from some
242macro hacker being hurt because no longer mastering those details can be showed
243off, it is users that we care more for, don't we?
244
245To be sure, the abovementioned performance of keyword and equal scanning is not
246that relevant in practice. But for the record, here are some timings on a laptop
247with a i7-3849\cap{QM} processor using \MINGW\ binaries on a 64-bit \MSWINDOWS\
24810 system. The times are the averages of five times a million such assignments
249and advancements.
250
251\starttabulate[|l|c|c|c|]
252\FL
253\NC one million times                    \NC terminal       \NC \LUAMETATEX\ \NC \LUATEX \NC \NR
254\ML
255\NC \type {\advance\scratchcounter 1}    \NC space          \NC 0.068 \NC 0.085 \NC \NR
256\NC \type {\advance\scratchcounter 1}    \NC \type {\relax} \NC 0.135 \NC 0.149 \NC \NR
257\NC \type {\advance\scratchcounter by 1} \NC space          \NC 0.087 \NC 0.099 \NC \NR
258\NC \type {\advance\scratchcounter by 1} \NC \type {\relax} \NC 0.155 \NC 0.161 \NC \NR
259\NC \type {\scratchcounter 1}            \NC space          \NC 0.057 \NC 0.096 \NC \NR
260\NC \type {\scratchcounter 1}            \NC \type {\relax} \NC 0.125 \NC 0.151 \NC \NR
261\NC \type {\scratchcounter=1}            \NC space          \NC 0.063 \NC 0.080 \NC \NR
262\NC \type {\scratchcounter=1}            \NC \type {\relax} \NC 0.131 \NC 0.138 \NC \NR
263\LL
264\stoptabulate
265
266We differentiate here between using a space as terminal or a \type {\relax}. The
267latter is a bit less efficient because more code is involved in resolving the
268meaning of the control sequence (which eventually boils down to nothing) but
269nevertheless, these are not timings that one can lose sleep over, especially when
270the rest of a decent \TEX\ run is taken into account. And yes, \LUAMETATEX\
271(\LMTX) is a bit faster here than \LUATEX, but I would be disappointed if that
272weren't the case.
273
274% luametatex:
275
276% \luaexpr{(0.068+0.070+0.069+0.067+0.068)/5} 0.068\crlf
277% \luaexpr{(0.137+0.132+0.136+0.137+0.134)/5} 0.135\crlf
278% \luaexpr{(0.085+0.088+0.084+0.089+0.087)/5} 0.087\crlf
279% \luaexpr{(0.145+0.160+0.158+0.156+0.154)/5} 0.155\crlf
280% \luaexpr{(0.060+0.055+0.059+0.055+0.056)/5} 0.057\crlf
281% \luaexpr{(0.118+0.127+0.128+0.122+0.130)/5} 0.125\crlf
282% \luaexpr{(0.063+0.062+0.067+0.061+0.063)/5} 0.063\crlf
283% \luaexpr{(0.127+0.128+0.133+0.128+0.140)/5} 0.131\crlf
284
285% luatex:
286
287% \luaexpr{(0.087+0.090+0.083+0.081+0.086)/5} 0.085\crlf
288% \luaexpr{(0.150+0.151+0.146+0.154+0.145)/5} 0.149\crlf
289% \luaexpr{(0.100+0.092+0.113+0.094+0.098)/5} 0.099\crlf
290% \luaexpr{(0.162+0.165+0.161+0.160+0.157)/5} 0.161\crlf
291% \luaexpr{(0.093+0.101+0.086+0.100+0.098)/5} 0.096\crlf
292% \luaexpr{(0.147+0.151+0.160+0.144+0.151)/5} 0.151\crlf
293% \luaexpr{(0.076+0.085+0.088+0.073+0.078)/5} 0.080\crlf
294% \luaexpr{(0.136+0.138+0.142+0.135+0.140)/5} 0.138\crlf
295
296After the \CONTEXT\ 2020 meeting I entered another round of staring at the code.
297One of the decision made at that meeting was to drop the \type {nd} and \type
298{nc} units as they were never official. That made me (again) wonder of that bit
299of the code could be done nicer as there is a mix of scanning units like \type
300{pt}, \type {bp} and \type {cm}, fillers like \type {fi} and \type {fill}, pseudo
301units like \type {ex} and \type {em}, special interception of \type {mu}, as well
302as the \type {plus} and \type {minus} parsing for glue. That code was already
303redone a bit so that here was less push back of tokens which had the side effect
304of dimension scanning being some 50\% faster than in \LUATEX.
305
306The same is true for scanning rule specs and scanning the box properties. In the
307later case part of the optimization came from not checking properties that
308already had been set, or only scanning them when for instance the \type
309{orientation} flag had been set (a new option in \LUAMETATEX\ with an additional
310four offset and move parameters). Also, some options, like the target dimensions,
311came after scanning the new ones. Again, this was quite a bit faster than in
312\LUATEX, not that it is noticeable on a normal run. All is mixed with skipping
313spacers and relax tokens plus quitting at a brace.
314
315Similar mixed scanning happens in some of the (new) math command, but these are
316less critical. Actually there some new commands had to be used because for
317instance \type {\over} takes any character as valid argument and keywords would
318definitely be incompatible there.
319
320Anyway, I started wondering if some could be done differently and finally decided
321to use a method that I already played with years ago. The main reason for not
322using it was that I wanted to remain compatible with the way traditional \TEX\
323scans. However, as we have many more keyword we already are no longer compatible
324in that department and the alternative implementation makes the code look nicer
325and has the benefit of being (more than) twice as fast. And when I run into
326issues in \CONTEXT\ I should just fix sloppy code.
327
328The compatibility issue is not really a problem when you consider the following
329cases.
330
331\starttyping
332\hbox reverse attr 123 456 orientation 4 xoffset 10pt spread 10cm { }
333\hrule xoffset 10pt width 10cm depth 3mm
334\hskip 3pt plus 2pt minus 1pt
335\stoptyping
336
337In the original approach these three case each have their own special side
338effects. In the case of a \type {\hbox} the scanning stops at a relax or left
339brace. An unknown keyword gives an error. So, there is no real benefit in pushing
340back tokens here. The order matters: the \type {spread} or \type {to} comes last.
341
342In the case of a \type {\hrule} the scanning stops when the keyword is not one of
343the known. This has the side effect that when such a rule definition is hidden in
344a macro and followed by for instance \type {width} without unit one gets an error
345and when a unit is given the rule can come out different than expected and the
346text is gone. For that reason a rule specification like this is often closed by
347\type {\relax} (any command that doesn't expand to a keyword works too). Here
348keywords can occur multiple times. As we have additional keyword a lookahead
349becomes even more an issue (not that \type {xoffset}) is a likely candidate.
350
351The last example is special in a different way: order matters in the sense that a
352\type {minus} specifier can follow a \type {plus} but not the reverse. And only
353one \type {plus} and \type {minus} can be given. Again one can best finish this
354specification by a something that doesn't look like a keyword, so often one will
355see a \type {\relax}.
356
357The advantage of the new method is that the order doesn't matter any more and
358that using a keyword multiple times overloads earlier settings. And this is
359consistent for all commands that used keywords (with a few exceptions in math
360where keywords drive later parsing and for font definitions where we need to be
361compatible. We give a slightly better error message: we mention the expected
362keyword. Another side effect is that any characters that is a legal start of a
363known keyword will trigger further parsing and issue an error message when it
364fails. Indeed, \LUAMETATEX\ has no mercy.
365
366In practice the mentioned special effects mean that a macro package will not run
367into trouble with boxes because unknown keywords make it crash and that rules and
368glue is terminated in a way that prevents lookahead. The new method kind of
369assumes this and one can argue that when something breaks one has to fix the
370macro code. Macro writers know that one cannot predict what users come up with
371and that users also don't look into the macros and therefore they take
372precautions. Also, a more rigorous parsing results in hopefully a better message.
373
374And yes, when I ran the test suite there was indeed a case where I had to add a
375\type {\relax}, but I can live with that. As long as users don't notice it.
376
377Now, one of the interesting properties of the slightly different scanning is
378that we can do this:
379
380\starttyping
381\hbox to 4cm attr 123 456 reverse to 3cm {...}
382\stoptyping
383
384So, we have a less strict order and we can overload arguments too. We'll see how
385this will be applied in \CONTEXT.
386
387\stopchapter
388
389\stopcomponent
390
391% another nice example: \the is expanded so we get the old value
392
393% \scratchskip = 10pt plus 1fill     \the\scratchskip  % old value
394% \scratchskip = 10pt plus 1fill    [\the\scratchskip] % new value
395% \scratchskip = 10pt plus 1fi l l  [\the\scratchskip] % also works
396