evenmore-parsing.tex /size: 15 Kb    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/evenmore
2
3\environment evenmore-style
4
5\startcomponent evenmore-parsing
6
7\startchapter[title=Parsing]
8
9The macro mechanism is \TEX\ is quite powerful and once you understand the
10concept of mixing parameters and delimiters you can do a lot with it. I assume
11that you know what we're talking about, otherwise quit reading. When grabbing
12arguments, there are a few catches.
13
14\startitemize
15\startitem
16    When they are used, delimiters are mandate: \TEX\ will go on reading an
17    argument till the (current) delimiter condition is met. This means that when
18    you forget one you end up with way more in the argument than expected or even
19    run out of input.
20\stopitem
21\startitem
22    Because specified arguments and delimiters are mandate, when you want to
23    parse input, you often need multi|-|step macros that first pick up the to be
24    parsed input, and then piecewise fetch snippets. Bogus delimiters have to be
25    appended to the original in order to catch a run away argument and checking
26    has to be done to get rid of them when all is ok.
27\stopitem
28\stopitemize
29
30The first item can be illustrated as follows:
31
32\starttyping[option=TEX]
33\def\foo[#1]{...}
34\stoptyping
35
36When \type {\foo} gets expanded \TEX\ first looks for a \type{[} and then starts
37collecting tokens for parameter \type {#1}. It stops doing that when aa \type {]}
38is seen. So,
39
40\starttyping[option=TEX]
41\starttext
42    \foo[whatever
43\stoptext
44\stoptyping
45
46will for sure give an error. When collecting tokens, \TEX\ doesn't expand them so
47the \type {\stoptext} is just turned into a token that gets appended.
48
49The second item is harder to explain (or grasp):
50
51\starttyping[option=TEX]
52\def\foo[#1=#2]{(#1/#2)}
53\stoptyping
54
55Here we expect a key and a value, so these will work:
56
57\starttyping[option=TEX]
58\foo[key=value]
59\foo[key=]
60\stoptyping
61
62while these will fail:
63
64\starttyping[option=TEX]
65\foo[key]
66\foo[]
67\stoptyping
68
69unless we have:
70
71\starttyping[option=TEX]
72\foo[key]=]
73\foo[]=]
74\stoptyping
75
76But, when processing the result, we then need to analyze the found arguments and
77correct for them being wrong. For instance, argument \type {#1} can become \type
78{]} or here \type {key]}. When indeed a valid key|/|value combination is given we
79need to get rid of the two \quote {fixup} tokens \type{=]}. Normally we will have
80multiple key|/|value pairs separated by a comma, and in practice we only need to
81catch the missing equal because we can ignore empty cases. There are plenty of
82examples (rather old old code but also more modern variants) in the \CONTEXT\
83code base.
84
85I will now show some new magic that is available in \LUAMETATEX\ as experimental
86code. It will be tested in \LMTX\ for a while and might evolve in the process.
87
88\startbuffer
89\def\foo#1=#2,{(#1/#2)}
90
91\foo 1=2,\ignorearguments
92\foo 1=2\ignorearguments
93\foo 1\ignorearguments
94\foo \ignorearguments
95\stopbuffer
96
97\typebuffer[option=TEX]
98
99Here we pick up a key and value separated by an equal sign. We end the input with
100a special signal command: \type {\ignorearguments}. This tells the parser to quit
101scanning. So, we get this, without any warning with respect to a missing
102delimiter of running away:
103
104\getbuffer
105
106The implementation is actually fairly simple and adds not much overhead.
107Alternatives (and I pondered a few) are just too messy, would remind me too much
108of those awful expression syntaxes, and definitely impact performance of macro
109expansion, therefore: a no|-|go.
110
111Using this new feature, we can implement a key value parser that does a sequence.
112The prototypes used to get here made only use of this one new feature and
113therefore still had to do some testing of the results. But, after looking at the
114code, I decided that a few more helpers could make better looking code. So this
115is what I ended up with:
116
117\startbuffer
118\def\grabparameter#1=#2,%
119  {\ifarguments\or\or
120     % (\whatever/#1/#2)\par%
121     \expandafter\def\csname\namespace#1\endcsname{#2}%
122     \expandafter\grabnextparameter
123   \fi}
124
125\def\grabnextparameter
126  {\expandafterspaces\grabparameter}
127
128\def\grabparameters[#1]#2[#3]%
129  {\def\namespace{#1}%
130   \expandafterspaces\grabparameter#3\ignorearguments\ignorearguments}
131\stopbuffer
132
133\typebuffer[option=TEX]
134
135Now, this one actually does what the \CONTEXT\ \type {\getparameters} command
136does: setting variables in a namespace. Being a parameter driven macro package
137this kind of macros have been part of \CONTEXT\ since the beginning. There are
138some variants and we also need to deal with the multilingual interface. Actually,
139\MKIV\ (and therefore \LMTX) do things a bit different, but the same principles
140apply.
141
142The \type {\ignorearguments} quits the scanning. Here we need two because we
143actually quit twice. The \type {\expandafterspaces} can be implemented in
144traditional \TEX\ macros but I though it is nice to have it this way; the fact
145that I only now added it has more to do with cosmetics. One could use the already
146somewhat older extension \type {\futureexpandis} (which expands the second or
147third token depending seeing the first, in this variant ignoring spaces) or a
148bunch of good old primitives to do the same. The new conditional \type
149{\ifarguments} can be used to act upon the number of arguments given. It reflects
150the most recently expanded macro. There is also a \type {\lastarguments}
151primitive (that provides the number of arguments.
152
153So, what are the benefits? You might think that it is about performance, but in
154practice there are not that many parameter settings going on. When I process the
155\LUAMETATEX\ manual, only some 5000 times one or more parameters are set. And
156even in a way more complex document that I asked my colleague to run I was a bit
157disappointed that only some 30.000 cases were reported. I know of users who have
158documents with hundreds of thousands of cases, but compared to the rest of
159processing this is not where the performance bottleneck is. \footnote {Think of
160thousands of pages of tables with cell settings applied.} This means that a
161change in implementation like the above is not paying off in significantly better
162runtime: all these low level mechanisms in \CONTEXT\ have been very well
163optimized over the years. And faster machines made old bottlenecks go away
164anyway. Take this use case:
165
166\starttyping[option=TEX]
167\grabparameters
168  [foo]
169  [key0=value0,
170   key1=value1,
171   key2=value2,
172   key3=value3]
173\stoptyping
174
175After this, parameters can be accessed with:
176
177\starttyping[option=TEX]
178\def\getvalue#1#2{\csname#1#2\endcsname}
179\stoptyping
180
181used as:
182
183\starttyping[option=TEX]
184\getvalue{foo}{key2}
185\stoptyping
186
187which takes care of characters normally not permitted in macro names, like the
188digits in this example. Of course some namespace protection can be added, like
189adding a colon between the namespace and the key, but let's take just this one.
190
191Some 10.000 expansions of the grabber take on my machine 0.045 seconds while the
192original \type {\getparameters} takes 0.090 so although for this case we're twice
193as fast, the 0.045 difference will not be noticed on a real run. After all, when
194these parameters are set some action will take place. Also, we don't actually use
195this macro for collecting settings with the \type {\setupsomething} commands, so
196the additional overhead that is involved adds a baseline to performance that can
197turn any gain into noise. But some users might notice some gain. Of course this
198observation might change once we apply this trickery in more places than
199parameter parsing, because I have to admit that there might be other places in
200the support macros where we can benefit: less code, double performance, but these
201are all support macros that made sense in \MKII\ and not that much in \MKIV\ or
202\LMTX\ and are kept just for convenience and backward compatibility. Think of
203some list processing macros. So, as a kind of nostalgic trip I decided to rewrite
204some low level macros anyway, if only to see what is no longer used and|/|or to
205make the code base somewhat (c)leaner.
206
207Elsewhere I introduce the \type {#0} argument indicator. That one will just
208gobbles the argument and does not store a token list on the stack. It saves some
209memory access and token recycling when arguments are not used. Another special
210indicator is \type {#+}. That one will flag an argument to be passed as|-|is. The
211\type {#-} variant will simply discard an argument and move on. The following
212examples demonstrate this:
213
214\startbuffer
215\def\foo    [#1]{\detokenize{#1}}
216\def\ofo    [#0]{\detokenize{#1}}
217\def\oof    [#+]{\detokenize{#1}}
218\def\fof[#1#-#2]{\detokenize{#1#2}}
219\def\fff[#1#0#3]{\detokenize{#1#3}}
220
221\meaning\foo\ : <\foo[{123}]> \crlf
222\meaning\ofo\ : <\ofo[{123}]> \crlf
223\meaning\oof\ : <\oof[{123}]> \crlf
224\meaning\fof\ : <\fof[123]>   \crlf
225\meaning\fff\ : <\fof[123]>   \crlf
226\stopbuffer
227
228\typebuffer[option=TEX]
229
230This gives:
231
232{\tttf \getbuffer}
233
234% \getcommalistsize[a,b,c]   \commalistsize\par
235% \getcommalistsize[{a,b,c}] \commalistsize\par
236
237When playing with new features like the one described here, it makes sense to use
238them in existing macros so that they get well tested. Some of the low level
239system files come in different versions: for \MKII, \MKIV\ and \LMTX. The \MKII\
240files often also have the older implementations, so they are also good for
241looking at the history. The \LMTX\ files can be leaner and meaner than the \MKIV\
242files because they use the latest features. \footnote {Some 70 primitives present
243in \LUATEX\ are not in \LUAMETATEX. On the other hand there are also about 70 new
244primitives. Of those gone, most concerned the backend, fonts or no longer
245relevant features from other engines. Of those new, some are really new
246primitives (conditionals, expansion magic), some control previously hardwired
247behaviour, some give access to properties of for instance boxes, and some are
248just variants of existing ones but with options for control.}
249
250When I was rewriting some of these low level \MKIV\ macros using the newer features,
251at some point I wondered why I still had to jump through some hoops. Why not just
252add some more primitives to deal with that? After all, \LUATEX\ and \LUAMETATEX\
253already have more primitives that are helpful in parsing, so a few dozen more lines
254don't hurt. As long as these primitives are generic and not that specific. In this
255particular case we talk about two new conditionals (in addition to the already
256present comparison primitives):
257
258\starttyping[option=TEX]
259\ifhastok    <token>       {<token list>}
260\ifhastoks  {<token list>} {<token list>}
261\ifhasxtoks {<token list>} {<token list>}
262\stoptyping
263
264You can probably guess what they do from their names. The last one is the
265expandable variant of the second one. The first one is the fast one. When playing
266with these I decided to redo the set checker. In \MKII\ that one is done in good
267old \TEX, in \MKIV\ we use \LUA. So, how about going back to \TEX ?
268
269\starttyping[option=TEX]
270\ifhasxtoks {cd} {abcdef}
271\stoptyping
272
273This check is true. But that doesn't work well with a comma separated list, but
274there is a way out:
275
276\starttyping[option=TEX]
277\ifhasxtoks {,cd,} {,ab,cd,ef,}
278\stoptyping
279
280However, when I applied that a user reported that it didn't handle optional
281spaces before commas. So how do we deal with such optional characters tokens?
282
283\startbuffer
284\def\setcontains#1#2{\ifhasxtoks{,#1,}{,#2,}}
285
286\ifcondition\setcontains{cd}{ab,cd,ef}YES \else NO \fi
287\ifcondition\setcontains{cd}{ab, cd, ef}YES \else NO \fi
288\stopbuffer
289
290\typebuffer[option=TEX]
291
292We get:
293
294\getbuffer
295
296The \type {\ifcondition} is an old one. When nested in a condition it will be
297seen as an \type {\if...} by the fast skipping scanner, but when expanded it will
298go on and a following macro has to expand to a proper condition. That said, we
299can take care of the optional space by entering some new territory. Look at this:
300
301\startbuffer
302\def\setcontains#1#2{\ifhasxtoks{,\expandtoken 9 "20 #1,}{,#2,}}
303
304\ifcondition\setcontains{cd}{ab,cd,ef}YES \else NO \fi
305\ifcondition\setcontains{cd}{ab, cd, ef}YES \else NO \fi
306\stopbuffer
307
308\typebuffer[option=TEX]
309
310We get:
311
312\getbuffer
313
314So how does that work? The \type {\expandtoken} injects a space token with
315catcode~9 which means that it is in the to be ignored category. When a to be
316ignored token is seen, and the to be checked token is a character (letter, other,
317space or ignored) then the character code will be compared. When they match, we
318move on, otherwise we just skip over the ignored token (here the space).
319
320In the \CONTEXT\ code base there are already files that are specific for \MKIV\
321and \LMTX. The most visible difference is that we use the \type {\orelse}
322primitive to construct nicer test trees, and we also use some of the additional
323\type {\future...} and \type {\expandafter...} features. The extensions discussed
324here make for the most recent differences (we're talking end May 2020).
325
326After implementing this trick I decided to look at the macro definition mechanism
327one more time and see if I could also use this there. Before I demonstrate
328another next feature, I will again show the argument extensions, this time with
329a fourth variant:
330
331\startbuffer[definitions]
332\def\TestA#1#2#3{{(#1)(#2)(#3)}}
333\def\TestB#1#0#3{(#1)(#2)(#3)}
334\def\TestC#1#+#3{(#1)(#2)(#3)}
335\def\TestD#1#-#2{(#1)(#2)}
336\stopbuffer
337
338\typebuffer[definitions][option=TEX] \getbuffer[definitions]
339
340The last one specifies a to be thrashed argument: \type {#-}. It goes further
341than the second one (\type {#0}) which still keeps a reference. This is why in
342this last case the third argument gets number \type {2}. The meanings of these
343four are:
344
345\startlines \tttf
346\meaning\TestA
347\meaning\TestB
348\meaning\TestC
349\meaning\TestD
350\stoplines
351
352There are some subtle differences between these variants, as you can see from
353the following examples:
354
355\startbuffer[usage]
356\TestA1{\red 2}3
357\TestB1{\red 2}3
358\TestC1{\red 2}3
359\TestD1{\red 2}3
360\stopbuffer
361
362\typebuffer[usage][option=TEX]
363
364Here you also see the side effect of keeping the braces. The zero argument (\type
365{#0}) is ignored, and the thrashed argument (\type {#-}) can't even be accessed.
366
367\startlines \tttf \getbuffer[usage] \stoplines
368
369In the next example we see two delimiters being used, a comma and a space, but
370they have catcode~9 which flags them as ignored. This is a signal for the parser
371that both the comma and the space can be skipped. The zero arguments are still on
372the parameter stack, but the thrashed ones result in a smaller stack, not that
373the later matters much on today's machines.
374
375\startbuffer
376\normalexpanded {
377    \def\noexpand\foo
378    \expandtoken 9 "2C % comma
379    \expandtoken 9 "20 % space
380    #1=#2]%
381}{(#1)(#2)}
382\stopbuffer
383
384\typebuffer[option=TEX] \getbuffer
385
386This means that the next tree expansions won't bark:
387
388\startbuffer
389\foo,key=value]
390\foo, key=value]
391\foo  key=value]
392\stopbuffer
393
394\typebuffer[option=TEX]
395
396or expanded:
397
398\startlines \tttf \getbuffer \stoplines
399
400Now, why didn't I add these primitives long ago already? After all, I already
401added dozens of new primitives over the years. To quote Andrew Cuomo, what
402follows now are opinions, not facts.
403
404Decades ago, when \TEX\ showed up, there was no Internet. I remember that I got
405my first copy on floppy disks. Computers were slow and memory was limited. The
406\TEX book was the main resource and writing macros was a kind of art. One could
407not look up solutions, so trial and error was a valid way to go. Figuring out
408what was efficient in terms of memory consumption and runtime was often needed
409too. I remember meetings where one was not taken serious when not talking in the
410right \quote {token}, \quote {node}, \quote {stomach} and \quote {mouth} speak.
411Suggesting extensions could end up in being told that there was no need because
412all could be done in macros or even arguments of the \quotation {who needs that}.
413I must admit that nowadays I wonder to what extend that was related to extensions
414taking away some of the craftmanship and showing off. In a way it is no surprise
415that (even trivial to implement) extensions never surfaced. Of course then the
416question is: will extensions that once were considered not of importance be used
417today? We'll see.
418
419Let's end by saying that, as with other experiments, I might port some of the new
420features in \LUAMETATEX\ to \LUATEX, but only after they have become stable and
421have been tested in \LMTX\ for quite a while.
422
423\stopchapter
424
425\stopcomponent
426