lowlevel-tokens.tex /size: 24 Kb    last modification: 2024-01-16 10:21
1% language=us runpath=texruns:manuals/lowlevel
2
3\environment lowlevel-style
4
5\usemodule[system-tokens]
6
7\startdocument
8  [title=tokens,
9   color=middleblue]
10
11\startsectionlevel[title=Introduction]
12
13Most users don't need to know anything about tokens but it happens that when \TEX
14ies meet in person (users group meetings), or online (support platforms) there
15always seem to pop up folks who love token speak. When you try to explain
16something to a user it makes sense to talk in terms of characters but then those
17token speakers can jump in and start correcting you. In the past I have been
18puzzled by this because, when one can write a decent macro that does the job
19well, it really doesn't matter if one knows about tokens. Of course one should
20never make the assumption that token speakers really know \TEX\ that well or can
21come up with better solutions than users but that is another matter. \footnote
22{Talking about fashion: it would be more impressive to talk about \TEX\ and
23friends as a software stack than calling it a distribution. Today, it's all about
24marketing.}
25
26That said, because in documents about \TEX\ the word \quote {token} does pop up I
27will try to give a little insight here. But for using \TEX\ it's mostly
28irrelevant. The descriptions below for sure won't match the proper token speak
29criteria which is why at a presentation for the 2020 user meeting I used the
30title \quotation {Tokens as I see them.}
31
32\stopsectionlevel
33
34\startsectionlevel[title=What are tokens]
35
36Both the words \quote {node} and \quote {token} are quite common in programming
37and also rather old which is proven by the fact that they also are used in the
38\TEX\ source. A node is a storage container that is part of a linked list. When
39you input the characters \type {tex} the three characters become part of the
40current linked list. They become \quote {character} nodes (or in \LUATEX\ speak
41\quote {glyph} nodes) with properties like the font and the character referred
42to. But before that happens, the three characters in the input \type {t}, \type
43{e} and \type {x}, are interpreted as in this case being just that: characters.
44When you enter \type {\TeX} the input processors first sees a backslash and
45because that has a special meaning in \TEX\ it will read following characters and
46when done does a lookup in it's internal hash table to see what it actually is: a
47macro that assembled the word \TEX\ in uppercase with special kerning and a
48shifted (therefore boxed) \quote {E}. When you enter \type {$} \TEX\ will look
49ahead for a second one in order to determine display math, push back the found
50token when there is no match and then enter inline math mode.
51
52A token is internally just a 32 bit number that encodes what \TEX\ has seen. It
53is the assembled token that travels through the system, get stored, interpreted
54and often discarded afterwards. So, the character \quote {e} in our example gets
55tagged as such and encoded in this number in a way that the intention can be
56derived later on.
57
58Now, the way \TEX\ looks at these tokens can differ. In some cases it will just
59look at this (32 bit) number, for instance when checking for a specific token,
60which is fast, but sometimes it needs to know some detail. The mentioned integer
61actually encodes a command (opcode) and a so called char code (operand). The
62second name is somewhat confusing because in many cases that code is not
63representing a character but that is not that relevant here. When you look at the
64source code of a \TEX\ engine it is enough to know that a char can also be a sub
65command.
66
67\startlinecorrection[blank]
68    \setupTABLE[each][align=middle]
69    \setupTABLE[c][1][width=44mm]
70    \setupTABLE[c][2][width=4em]
71    \setupTABLE[c][3][width=11mm]
72    \setupTABLE[c][4][width=33mm]
73    \bTABLE
74        \bTR
75            \bTD token \eTD
76            \bTD[frame=off] = \eTD
77            \bTD cmd   \eTD
78            \bTD chr   \eTD
79        \eTR
80    \eTABLE
81\stoplinecorrection
82
83Back to the three characters: these become tokens where the command code
84indicates that it is a letter and the char code stores what letter we have at
85hand and in the case of \LUATEX\ and \LUAMETATEX\ these are \UNICODE\ values.
86Contrary to the traditional 8 bit \TEX\ engine, in the \UNICODE\ engines an \UTF\
87sequence is read, but these multiple bytes still become one number that will be
88encoded in the token number. In order to determine that something is a letter the
89engine has to be told (which is what a macro package does when it sets up the
90engine). For instance, digits are so called other characters and the backslash is
91called escape. Every \TEX\ user knows that curly braces are special and so are
92dollar symbols and hashes. If this rings a bell, and you relate this to catcodes,
93you can indeed assume that the command codes of these tokens have the same
94numbers as the catcodes. Given that \UNICODE\ has plenty of characters slots you
95can imagine that combining 16 catcode commands with all the possible \UNICODE\
96values makes a large repertoire of tokens.
97
98There are more commands than the 16 basic characters related ones, in
99\LUAMETATEX\ we have just over 150 command codes (\LUATEX\ has a few more but
100they are also organized differently). Each of these codes can have a sub
101command, For instance the primitives \type {\vbox} and \type {\hbox} are both a
102\type {make_box_cmd} (we use the symbolic name here) and in \LUAMETATEX\ the
103first one has sub command code 9 (\type {vbox_code}) and the second one has code
10410 (\type {hbox_code}). There are twelve primitives that are in the same
105category. The many primitives that make up the core of the engine are grouped in
106a way that permits processing similar ones with one function and also makes it
107possible to distinguish between the way commands are handled, for instance with
108respect to expansion.
109
110Now, before we move on it is important to know that al these codes are in fact
111abstract numbers. Although it is quite likely that engines that are derived from
112each other have similar numbers (just more) this is not the case for \LUAMETATEX.
113Because the internals have been opened up (even more than in \LUATEX) the command
114and char codes have been reorganized in a such a way that exposure is consistent.
115We could not use some of the reuse and remap tricks that the other engines use
116because it would simply be too confusing (and demand real in depth knowledge of
117the internals). This is also the reason why development took some time. You
118probably won't notice it from the current source but it was a very stepwise
119process. We not only had to make sure that all kept working (\CONTEXT\ \LMTX\ and
120\LUAMETATEX\ were pretty useable during the process), but also had to
121(re)consider intermediate choices.
122
123So, input is converted into tokens, in most cases one|-|by|-|one. When a token is
124assembled, it either gets stored (deliberately or as part of some look ahead
125scanning), or it immediately gets (what is called:) expanded. Depending on what
126the command is, some action is triggered. For instance, a character gets appended
127to the node list immediately. An \type {\hbox} command will start assembling a
128box which its own node list that then gets some treatment: if this primitive was a
129follow up on \type {\setbox} it will get stored, otherwise it might end up in the
130current node list as so called hlist node. Commands that relate to registers have
131\type {0xFFFF} char codes because that is how many registers we have per category.
132
133When a token gets stored for later processing it becomes part of a larger data
134structure, a so called memory word. These memory words are taken from a large
135pool of words and they store a token and additional properties. The info field
136contains the token value, the mentioned command and char. When there is no linked
137list, the link can actually be used to store a value, something that in
138\LUAMETATEX\ we actually do.
139
140\startlinecorrection[blank]
141    \setupTABLE[each][align=middle]
142    \setupTABLE[c][1][width=8mm]
143    \setupTABLE[c][2][width=64mm]
144    \setupTABLE[c][3][width=64mm]
145    \bTABLE
146        \bTR \bTD 1 \eTD \bTD info \eTD \bTD link \eTD \eTR
147        \bTR \bTD 2 \eTD \bTD info \eTD \bTD link \eTD \eTR
148        \bTR \bTD 3 \eTD \bTD info \eTD \bTD link \eTD \eTR
149        \bTR \bTD n \eTD \bTD info \eTD \bTD link \eTD \eTR
150    \eTABLE
151\stoplinecorrection
152
153When for instance we say \typ {\toks 0 {tex}} the scanner sees an escape,
154followed by 4 letters (\type {toks}) and the escape triggers a lookup of the
155primitive (or macro or \unknown) with that name, in this case a primitive
156assignment command. The found primitive (its property gets stored in the token)
157triggers scanning for a number and when that is successful scanning of a brace
158delimited token list starts. The three characters become three letter tokens and
159these are a linked list of the mentioned memory words. This list then gets stored
160in token register zero. The input sequence \typ {\the \toks 0} will push back a
161copy of this list into the input.
162
163In addition to the token memory pool, there is also a table of equivalents. That
164one is part of a larger table of memory words where \TEX\ stores all it needs to
165store. The 16 groups of character commands are virtual, storing these makes no
166sense, so the first real entries are all these registers (count, dimension, skip,
167box, etc). The rest is taken up by possible hash entries.
168
169\startlinecorrection[blank]
170    \bTABLE
171        \bTR \bTD[ny=4] main hash \eTD \bTD null control sequence              \eTD \eTR
172        \bTR                           \bTD 128K hash entries                  \eTD \eTR
173        \bTR                           \bTD frozen control sequences           \eTD \eTR
174        \bTR                           \bTD special sequences (undefined)      \eTD \eTR
175        \bTR \bTD[ny=7] registers \eTD \bTD  17 internal & 64K user glues      \eTD \eTR
176        \bTR                           \bTD   4 internal & 64K user mu glues   \eTD \eTR
177        \bTR                           \bTD  12 internal & 64K user tokens     \eTD \eTR
178        \bTR                           \bTD   2 internal & 64K user boxes      \eTD \eTR
179        \bTR                           \bTD 116 internal & 64K user integers   \eTD \eTR
180        \bTR                           \bTD   0 internal & 64K user attribute  \eTD \eTR
181        \bTR                           \bTD  22 internal & 64K user dimensions \eTD \eTR
182        \bTR \bTD specifications  \eTD \bTD   5 internal &   0 user            \eTD \eTR
183        \bTR \bTD extra hash      \eTD \bTD additional entries (grows dynamic) \eTD \eTR
184    \eTABLE
185\stoplinecorrection
186
187So, a letter token \type {t} is just that, a token. A token referring to a register
188is again just a number, but its char code points to a slot in the equivalents table.
189A macro, which we haven't discussed yet, is actually just a token list. When a name
190lookup happens the hash table is consulted and that one runs in parallel to part of the
191table of equivalents. When there is a match, the corresponding entry in the equivalents
192table points to a token list.
193
194\startlinecorrection[blank]
195    \setupTABLE[each][align=middle]
196    \setupTABLE[c][1][width=16mm]
197    \setupTABLE[c][2][width=64mm]
198    \setupTABLE[c][3][width=64mm]
199    \bTABLE
200        \bTR \bTD 1     \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR
201        \bTR \bTD 2     \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR
202        \bTR \bTD n     \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR
203        \bTR \bTD n + 1 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR
204        \bTR \bTD n + 2 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR
205        \bTR \bTD n + m \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR
206    \eTABLE
207\stoplinecorrection
208
209It sounds complex and it actually also is somewhat complex. It is not made easier
210by the fact that we also track information related to grouping (saving and
211restoring), need reference counts for copies of macros and token lists, sometimes
212store information directly instead of via links to token lists, etc. And again
213one cannot compare \LUAMETATEX\ with the other engines. Because we did away with
214some of the limitations of the traditional engine we not only could save some
215memory but in the end also simplify matters (we're 32/64 bit after all). On the one
216hand some traditional speedups were removed but these have been compensated by
217improvements elsewhere, so overall processing is more efficient.
218
219\startlinecorrection[blank]
220    \setupTABLE[each][align=middle]
221    \setupTABLE[c][1][width=8mm]
222    \setupTABLE[c][2][width=32mm]
223    \setupTABLE[c][3][width=16mm]
224    \setupTABLE[c][4][width=16mm]
225    \setupTABLE[c][5][width=64mm]
226    \bTABLE
227        \bTR \bTD 1 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR
228        \bTR \bTD 2 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR
229        \bTR \bTD 3 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR
230        \bTR \bTD n \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR
231    \eTABLE
232\stoplinecorrection
233
234So, here \LUAMETATEX\ differs from other engines because it combines two tables,
235which is possible because we have at least 32 bits. There are at most \type
236{0xFFFF} levels but we need at most \type {0xFF} types. in \LUAMETATEX\ macros
237can have extra properties (flags) and these also need one byte. Contrary to the
238other engines, \type {\protected} macros are native and have their own command
239code, but \type {\tolerant} macros duplicate that (so we have four distinct macro
240commands). All other properties, like the \type {\permanent} ones are stored in
241the flags.
242
243Because a macro starts with a reference count we have some room in the info field
244to store information about it having arguments or not. It is these details that
245make \LUAMETATEX\ a bit more efficient in terms of memory usage and performance
246than its ancestor \LUATEX. But as with the other changes, it was a very stepwise
247process in order to keep the system compatible and working.
248
249\stopsectionlevel
250
251\startsectionlevel[title=Some implementation details]
252
253Sometimes there is a special head token at the start. This makes for easier
254appending of extra tokens. In traditional \TEX\ node lists are forward linked, in
255\LUATEX\ they are double linked \footnote {On the agenda of \LUAMETATEX\ is to
256use this property in the underlying code, that doesn't yet profit from this and
257therefore keep previous pointers in store.}. Token lists are always forward
258linked. Shared token lists use the head node for a reference count.
259
260For various reasons original \TEX\ uses global variables temporary lists. This is
261for instance needed when we expand (nested) and need to report issues. But in
262\LUATEX\ we often just serialize lists and using local variables makes more
263sense. One of the first things done in \LUAMETATEX\ was to group all global
264variables in (still global) structures but well isolated. That also made it
265possible to actually get rid of some globals.
266
267Because \TEX\ had to run on machines that we nowadays consider rather limited, it
268had to be sparse and efficient. There are quite some optimizations to limit code
269and memory consumption. The engine also does its own memory management. Freed
270token memory words are collected in a cache and reused but they can get scattered
271which is not that bad, apart from maybe cache hits. In \LUAMETATEX\ we stay as
272close to original \TEX\ as possible but there have been some improvements. The
273\LUA\ interfaces force us to occasionally divert from the original, and that in
274fact might lead to some retrofit but the original documentation still mostly
275applies. However, keep in mind that in \LUATEX\ we store much more in nodes (each
276has a prev pointer and an attribute list pointer and for instance glyph nodes
277have some 20 extra fields compared to traditional \TEX\ character nodes).
278
279\stopsectionlevel
280
281\startsectionlevel[title=Other data management]
282
283There is plenty going on in \TEX\ when it processes your input, just to mention a
284few:
285
286\startitemize[packed]
287\startitem Grouping is handled by a nesting stack. \stopitem
288\startitem Nested conditionals (\type {\if...}) have their own stack. \stopitem
289\startitem The values before assignments are saved on the save stack. \stopitem
290\startitem Also other local changes (housekeeping) ends up in the save stack. \stopitem
291\startitem Token lists and macro aliases have references pointers (reuse). \stopitem
292\startitem Attributes, being linked node lists, have their own management. \stopitem
293\stopitemize
294
295In all these subsystems tokens or references to tokens can play a role. Reading a
296single character from the input can trigger a lot of action. A curly brace tagged
297as begin group command will push the grouping level and from then on registers
298and some other quantities that are changed will be stored on the save stack
299so that after the group ends they can be restored. When primitives take keywords,
300and no match happens, tokens are pushed back into the input which introduces a
301new input level (also some stack). When numbers are read a token that represents
302no digit is pushed back too and macro packages use numbers and dimensions a lot.
303It is a surprise that \TEX\ is so fast.
304
305\stopsectionlevel
306
307\startsectionlevel[title=Macros]
308
309There is a distinction between primitives, the build in commands, and macros, the
310commands defined by users. A primitive relates to a command code and char code
311but macros are, unless they are made an alias to something else, like a \type
312{\countdef} or \type {\let} does, basically pointers to a token list. There is
313some additional data stored that makes it possible to parse and grab arguments.
314
315When we have a control sequence (macro) \type {\controlsequence} the name is
316looked up in the hash table. When found its value will point to the table of
317equivalents. As mentioned, that table keeps track of the cmd and points to a
318token list (the meaning). We saw that this table also stores the current level
319of grouping and flags.
320
321If we say, in the input, \typ {\hbox to 10pt {x\hss}}, the box is assembled as we
322go and when it is appended to the current node list there are no tokens left.
323When scanning this, the engine literally sees a backslash and the four letters
324\type {hbox}. However when we have this:
325
326\starttyping[option=TEX]
327\def\MyMacro{\hbox to 10pt {x\hss}}
328\stoptyping
329
330the \type {\hbox} has become one memory word which has a token representing the
331\type {\hbox} primitive plus a link to the next token. The space after a control
332sequence is gobbled so the next two tokens, again stored in a linked memory word,
333are letter tokens, followed by two other and two letter tokens for the
334dimensions. Then we have a space, a brace, a letter, a primitive and a brace. The
335about 20 characters in the input became a dozen memory words each two times four
336bytes, so in terms of memory usage we end up with quite a bit more. However, when
337\TEX\ runs over that list it only has to interpret the token values because the
338scanning and conversion already happened. So, the space that a macro takes is
339more than compensated by efficient reprocessing.
340
341\stopsectionlevel
342
343\startsectionlevel[title=Looking at tokens]
344
345When you say \type {\tracingall} you will see what the engine does: read input,
346expand primitives and macros, typesetting etc.\ You might need to set \type
347{\tracingonline} to get a bit more output on the console. One way to look at
348macros is to use the \type {\meaning} command, so if we have:
349
350\startbuffer[definition]
351\permanent\protected\def\MyMacro#1#2{Do #1 or #2!}
352\stopbuffer
353
354\startbuffer[meaning]
355\meaning    \MyMacro
356\meaningless\MyMacro
357\meaningfull\MyMacro
358\stopbuffer
359
360\typebuffer[definition][option=TEX]
361
362we can say this:
363
364\typebuffer[meaning][option=TEX]
365
366and get:
367
368{\getbuffer[definition]\startlines\tttf \getbuffer[meaning]\stoplines}
369
370You get less when you ask for the meaning of a primitive, just its name. The
371\type {\meaningfull} primitive gives the most information. In \LUAMETATEX\
372protected macros are first class commands: they have their own command code. In
373the other engines they are just regular macros with an initial token indicating
374that they are protected. There are specific command codes for \type {\outer} and
375\type {\long} macros but we dropped that in \LUAMETATEX . Instead we have \type
376{\tolerant} macros but that's another story. The flags that were mentioned can
377mark macros in a way that permits overload protection as well as some special
378treatment in otherwise tricky cases (like alignments). The overload related flags
379permits a rather granular way to prevent users from redefining macros and such.
380They are set via prefixes, and add to that repertoire: we have 14 prefixes but
381only some eight deal with flags (we can add more if really needed). The probably
382most wel known prefix is \type {\global} and that one will not become a flag: it
383has immediate effect.
384
385For the above definition, the \type {\showluatokens} command will show a meaning
386on the console.
387
388\starttyping[option=TEX]
389\showluatokens\MyMacro
390\stoptyping
391
392% {\getbuffer[definition]\getbuffer}
393
394This gives the next list, where the first column is the address of the token, the
395second one the command code, and the third one the char code. When there are
396arguments involved, the list of what needs to get matched is shown.
397
398\starttyping
399permanent protected control sequence: MyMacro
400501263  19   49  match                argument 1
401501087  19   50  match                argument 2
402385528  20    0  end match
403--------------
404501090  11   68  letter               D (U+00044)
405 30833  11  111  letter               o (U+0006F)
406500776  10   32  spacer
407385540  21    1  parameter reference
408112057  10   32  spacer
409431886  11  111  letter               o (U+0006F)
410 30830  11  114  letter               r (U+00072)
411 30805  10   32  spacer
412500787  21    2  parameter reference
413213412  12   33  other char           ! (U+00021)
414\stoptyping
415
416In the next subsections I will give some examples. This time we use
417helper defined in a module:
418
419\starttyping[option=TEX]
420\usemodule[system-tokens]
421\stoptyping
422
423\startsectionlevel[title=Example 1: in the input]
424
425\startbuffer
426\luatokentable{1 \bf{2} 3\what {!}}
427\stopbuffer
428
429\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer}
430
431\stopsectionlevel
432
433\startsectionlevel[title=Example 2: in the input]
434
435\startbuffer
436\luatokentable{a \the\scratchcounter b \the\parindent \hbox to 10pt{x}}
437\stopbuffer
438
439\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer}
440
441\stopsectionlevel
442
443\startsectionlevel[title=Example 3: user registers]
444
445\startbuffer
446\scratchtoks{foo \framed{\red 123}456}
447
448\luatokentable\scratchtoks
449\stopbuffer
450
451\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer}
452
453\stopsectionlevel
454
455\startsectionlevel[title=Example 4: internal variables]
456
457\startbuffer
458\luatokentable\everypar
459\stopbuffer
460
461\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer}
462
463\stopsectionlevel
464
465\startsectionlevel[title=Example 5: macro definitions]
466
467\startbuffer
468\protected\def\whatever#1[#2](#3)\relax
469  {oeps #1 and #2 & #3 done ## error}
470
471\luatokentable\whatever
472\stopbuffer
473
474\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer}
475
476\stopsectionlevel
477
478\startsectionlevel[title=Example 6: commands]
479
480\startbuffer
481\luatokentable\startitemize
482\luatokentable\stopitemize
483\stopbuffer
484
485\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer}
486
487\stopsectionlevel
488
489\startsectionlevel[title=Example 7: commands]
490
491\startbuffer
492\luatokentable\doifelse
493\stopbuffer
494
495\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer }
496
497\stopsectionlevel
498
499\startsectionlevel[title=Example 8: nothing]
500
501\startbuffer
502\luatokentable\relax
503\stopbuffer
504
505\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer }
506
507\stopsectionlevel
508
509\startsectionlevel[title=Example 9: hashes]
510
511\startbuffer
512\edef\foo#1#2{(#1)(\letterhash)(#2)}  \luatokentable\foo
513\stopbuffer
514
515\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer }
516
517\stopsectionlevel
518
519\startsectionlevel[title=Example 10: nesting]
520
521\startbuffer
522\def\foo#1{\def\foo##1{(#1)(##1)}}  \luatokentable\foo
523\stopbuffer
524
525\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer }
526
527\stopsectionlevel
528
529\startsectionlevel[title=Remark]
530
531In all these examples the numbers are to be seen as abstractions. Some command
532codes and sub command codes might change as the engine evolves. This is why the
533\LUAMETATEX\ engine has lots of \LUA\ functions that provide information about
534what number represents what command.
535
536\stopsectionlevel
537
538\stopsectionlevel
539
540\stopdocument
541