evenmore-tokens.tex /size: 18 Kb    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/evenmore
2
3% TODO: copy_node_list : return tail
4% TODO: grabnested
5
6\environment evenmore-style
7
8\startcomponent evenmore-tokens
9
10\startchapter[title=Tokens]
11
12{\em This is mostly a wrapup of some developments, and definitely not a tutorial.}
13
14Talking deep down \TEX\ is talking about tokens and nodes. Roughly spoken, from
15the perspective of the user, tokens are what goes in and stays in (as macro,
16token list of whatever) and nodes is what get produced and eventually results in
17output. A character in the input becomes one token (before expansion) and a
18control sequence like \type {\foo} also is turned into a token. Tokens can be
19linked into lists. This actually means that in the engine we can talk of tokens
20in two ways: the single item with properties that trigger actions, or as compound
21item with that item and a pointer to the next token (called link). In \LUA\ speak
22token memory can be seen as:
23
24\starttyping
25fixmem = {
26    { info, link },
27    { info, link },
28    { info, link },
29    { info, link },
30    ....
31}
32\stoptyping
33
34Both are 32 bit integers. The \type {info} is a combination of a command code (an
35operator) and a so called chr code (operand) and these determine its behaviour.
36For instance the command code can indicate an integer register and the chr code
37then indicates the number of that register. So, like:
38
39\starttyping
40fixmem = {
41    { { cmd, chr}, index_into_fixmem },
42    { { cmd, chr}, index_into_fixmem },
43    { { cmd, chr}, index_into_fixmem },
44    { { cmd, chr}, index_into_fixmem },
45    ....
46}
47\stoptyping
48
49
50In the following line the characters that make three words are tokens (letters),
51so are the space (spacer), the curly braces (begin- and endgroup token) and the
52bold face switch (which becomes one token which resolves to a token list of
53tokens that trigger actions (in this case switching to a bolder font).
54
55\starttyping
56foo {\bf bar} foo
57\stoptyping
58
59When \TEX\ reads a line of input tokens are expanded immediately but a sequence
60can also become part fo a macro body or token list. Here we have $3_{\type{foo}}
61+ 1 + 1_{\type+{+} + 1_{\type{\bf}} + 3_{\type{bar}} + 1_{\type+}+} + 1 +
623_{\type{foo}} = 14$ tokens.
63
64A control sequence normally starts with a backslash. Some are built in, these are
65called primitives, and others are defined by the macro package or the user. There
66is a lookup table that relates the tokenized control sequence to some action. For
67instance:
68
69\starttyping
70\def\foo{foo}
71\stoptyping
72
73creates an entry that leads (directly or following a hash chain) to the three
74letter token list. Every time the input sees \type {\foo} it gets resolved to
75that list via a hash lookup. However, once internalized and part of a token list,
76it is a direct reference. On the other hand,
77
78\starttyping
79\the\count0
80\stoptyping
81
82triggers the \type {\the} action that relates to this control sequence, which
83then reads a next token and operates on that. That next token itself expects a
84number as follow up. In the end the value of \type {\count0} is found and that
85one is also in the so called equivalent lookup table, in what \TEX\ calls
86specific regions.
87
88\starttyping
89equivalents = {
90    { level, type, value },
91    { level, type, value },
92    { level, type, value },
93    ...
94}
95\stoptyping
96
97The value is in most cases similar to the info (cmd & chr) field in fixmem, but
98one difference is that counters, dimensions etc directly store their value, which
99is why we sometimes need the type separately, for instance in order to reclaim
100memory for glue or node specifications. It sound complicated and it is, but as
101long as you get a rough idea we can continue. Just keep in mind that tokens
102sometimes get expanded on the fly, and sometimes just get stored.
103
104There are a lot of primitives and each has a unique info. The same is true for
105characters (each category has its own command code, so regular letters can be
106distinguished from other tokens, comment signs, math triggers etc). All important
107basic bits are in table of equivalents: macros as well as registers although the
108meaning of a macro and content of token lists lives in the fixmem table and
109the content of boxes in so called node lists (nodes have their own memory).
110
111In traditional \TEX\ the lookup table for primitives, registers and macros is as
112compact as can be: it is an array of so called 32 bit memory words. These can be
113divided into halfs and quarters, so in the source you find terms like \type
114{halfword} and \type {quarterword}. The lookup table is a hybrid:
115
116\starttyping
117[level 8] [type 8] [value 16] | [equivalent 32]
118[level 8] [type 8] [value 16] | [equivalent 32]
119[level 8] [type 8] [value 16] | [equivalent 32]
120...
121\stoptyping
122
123The mentioned counters and such are directly encoded in an equivalent and the
124rest is a combination of level, type and value. The level is used for the
125grouping, and in for instance \PDFTEX\ there can therefore be at most 255 levels.
126In \LUATEX\ we use a wider model. There we have 64 bit memory words which means
127that we have way more levels and don't need to have this dual nature:
128
129\starttyping
130[level 16] [type 16] [value 32]
131[level 16] [type 16] [value 32]
132[level 16] [type 16] [value 32]
133...
134\stoptyping
135
136We already showed a \LUA\ representation. The type in this table is what a
137command code is in an \quote {info} field. In such a token the integer encodes
138the command as well as a value (called chr). In the lookup table the type is the
139command code. When \TEX\ is dealing with a control sequences it looks at the
140type, otherwise it filters the command from the token integer. This means that a
141token cannot store an integer (or dimension), but the lookup table actually can
142do that. However, commands can limit the range, for instance characters are bound
143by what \UNICODE\ permits.
144
145Internally, \LUATEX\ still uses these ranges of fast accessible registers, like
146counters, dimensions and attributes. However, we saw that in \LUATEX\ they don't
147overlap with the level and type. In \LUATEX, at least till version 1.13 we still
148have the shadow array for levels but in \LUAMETATEX\ we just use those in the
149equivalents lookup table. If you look in the \PASCAL\ source you will notice that
150arrays run from \type {[somemin ... somemax]} which in the \CCODE\ source would
151mean using offsets. Actually, the shadow array starts at zero so we waste the
152part that doesn't need shadowing. It is good to remind ourselves that traditional
153\TEX\ is 8 bit character based.
154
155The equivalents lookup table has all kind of special ranges (combined into
156regions of similar nature, in \TEX\ speak), like those for lowercase mapping,
157specific catcode mappings, etc.\ but we're still talking of $n \times 256$
158entries. In \LUATEX\ all these mappings are in dedicated sparse hash tables
159because we need to support the full \UNICODE\ repertoire. This means that, while
160on the one hand \LUATEX\ uses more memory for the lookup table the number of
161slots can be less. But still there was the waste of the shadow level table: I
162didn't calculate the exact saving of ditching that one, but I bet it came close
163to what was available as total memory for programs and data on the first machines
164that I used for running \TEX. But \unknown\ after more than a decade of \LUATEX\
165we now reclaimed that space in \LUAMETATEX. \footnote {Don't expect a gain in
166performance, although using less memory might pay back on a virtual machine or
167when \TEX\ has to share the \CPU\ cache.}
168
169Now, in case you're interested (and actually I just write it down because I don't
170want to forget it myself) the lookup table in \LUAMETATEX\ is layout as follows
171
172\starttabulate
173\NC the hash table                    \NC \NC \NR
174\NC some frozen primitives            \NC \NC \NR
175\NC current and defined fonts         \NC one slot + many pointers \NC \NR
176\NC undefined control sequence        \NC one slot \NC \NR
177\NC internal and register glue        \NC pointer to node \NC \NR
178\NC internal and register muglue      \NC pointer to node \NC \NR
179\NC internal and register toks        \NC pointer to token list \NC \NR
180\NC internal and register boxes       \NC pointer to node list \NC \NR
181\NC internal and register counts      \NC value in token \NC \NR
182\NC internal and register attributes  \NC value in token \NC \NR
183\NC internal and register dimens      \NC value in token \NC \NR
184\NC some special data structures      \NC pointer to node list \NC  \NC \NR
185\NC the (runtime) extended hash table \NC  \NC \NR
186\stoptabulate
187
188Normally a user doesn't need to know anything about these specific properties of
189the engine and it might comfort you to know that for a long time I could stay
190away from these details. One difference with the other engines is that we have
191internal variables and registers split more explicitly. The special data
192structures have their own slots and are not just put somewhere (semi random). The
193initialization is bit more granular in that we properly set the types (cmd codes)
194for registers which in turn is possible because for instance we're able to
195distinguish glue types. This is all part of coming up with a bit more consistent
196interface to tokens from the \LUA\ end. It also permits diagnostics.
197
198Anyway, we now are ready for some more details about tokens. You don't need to
199understand all of it in order to define decent macros. But when you are using
200\LUATEX\ and do want to mess around here is some insight. Assume we have defined
201these macros:
202
203\startluacode
204    local alsoraw = false
205    function documentdata.StartShowTokens(rawtoo)
206        context.starttabulate { "|T|rT|lT|rT|rT|rT|" .. (rawtoo and "rT|" or "") }
207        context.BC()
208        context.BC() context("cmd")
209        context.BC() context("name")
210        context.BC() context("chr")
211        context.BC() context("cs")
212        if rawtoo then
213            context.BC() context("rawchr")
214        end
215        context.BC() context.NR()
216        context.SL()
217        alsoraw = rawtoo
218    end
219    function documentdata.StopShowTokens()
220        context.stoptabulate()
221    end
222    function documentdata.ShowToken(name)
223        local cmd, chr, cs = token.get_cmdchrcs(name)
224        local _,   raw, _  = token.get_cmdchrcs(name,true)
225        context.NC() context("\\string\\"..name)
226        context.NC() context(cmd)
227        context.NC() context(tokens.commands[cmd])
228        context.NC() context(chr)
229        context.NC() context(cs)
230        if alsoraw and chr ~= raw then
231            context.NC() context(raw)
232        end
233        context.NC() context.NR()
234    end
235\stopluacode
236
237\startbuffer
238\def\MacroA{a} \def\MacroB{b}
239\def\macroa{a} \def\macrob{b}
240\def\MACROa{a} \def\MACROb{b}
241\stopbuffer
242
243\typebuffer \getbuffer
244
245How does that end up internally?
246
247\startluacode
248    documentdata.StartShowTokens(true)
249    documentdata.ShowToken("scratchcounterone")
250    documentdata.ShowToken("scratchcountertwo")
251    documentdata.ShowToken("scratchdimen")
252    documentdata.ShowToken("scratchtoks")
253    documentdata.ShowToken("scratchcounter")
254    documentdata.ShowToken("letterpercent")
255    documentdata.ShowToken("everypar")
256    documentdata.ShowToken("%")
257    documentdata.ShowToken("pagegoal")
258    documentdata.ShowToken("pagetotal")
259    documentdata.ShowToken("hangindent")
260    documentdata.ShowToken("hangafter")
261    documentdata.ShowToken("dimdim")
262    documentdata.ShowToken("relax")
263    documentdata.ShowToken("dimen")
264    documentdata.ShowToken("stoptext")
265    documentdata.ShowToken("MacroA")
266    documentdata.ShowToken("MacroB")
267    documentdata.ShowToken("MacroC")
268    documentdata.ShowToken("macroa")
269    documentdata.ShowToken("macrob")
270    documentdata.ShowToken("macroc")
271    documentdata.ShowToken("MACROa")
272    documentdata.ShowToken("MACROb")
273    documentdata.ShowToken("MACROc")
274    documentdata.StopShowTokens()
275\stopluacode
276
277We show the raw chr value but in the \LUA\ interface these are normalized to for
278instance proper register indices. This is because the raw numbers can for
279instance be indices into memory or some \UNICODE\ reference with catcode specific
280bits set. But, while these indices are real and stable, these offsets can
281actually change when the implementation changes. For that reason, in \LUAMETATEX\
282we can better talk of command codes as main indicator and:
283
284\starttabulate
285\NC subcommand       \NC for tokens that have variants, like \type {\ifnum} \NC \NR
286\NC register indices \NC for the 64K register banks, like \type {\count0} \NC \NR
287\NC internal indices \NC for internal variables like \type {\parindent} \NC \NR
288\NC characters       \NC specific \UNICODE\ slots combined with catcode \NC \NR
289\NC pointers         \NC to token lists, macros, \LUA\ functions, nodes \NC \NR
290\stoptabulate
291
292This so called \type {cs} number is a pointer into the table of equivalents. That
293number results comes from the hash table. A macro name, when scanned the first
294time, is still a sequence of bytes. This sequence is used to compute a hash
295number, which is a pointer to a slot in the lower part of the hash (lookup)
296table. That slot points to a string and a next hash entry in the higher end. A
297lookup goes as follows:
298
299\startitemize[n,packed]
300    \startitem
301        compute the index into the hash table from the string
302    \stopitem
303    \startitem
304        goto the slot with that index and compare the \type {string} field
305    \stopitem
306    \startitem
307        when there is no match goto the slot indicated by the \type {next} field
308    \stopitem
309    \startitem
310        compare again and keep following \type {next} fields till there is no
311        follow up
312    \stopitem
313    \startitem
314        optionally create a new entry
315    \stopitem
316    \startitem
317        use the index of that entry as index in the table of equivalents
318    \stopitem
319\stopitemize
320
321So, in \LUA\ speak, we have:
322
323\starttyping
324hashtable = {
325    -- lower part, accessed via the calculated hash number
326    { stringpointer, nextindex },
327    { stringpointer, nextindex },
328    ...
329    -- higher part, accessed by following nextindex
330    { stringpointer, nextindex },
331    { stringpointer, nextindex },
332    ...
333}
334\stoptyping
335
336Eventually, after following a lookup chain in the hash tabl;e, we end up at
337pointer to the equivalents lookup table that we already discussed. From then on
338we're talking tokens. When you're lucky, the list is small and you have a quick
339match. The maximum initial hash index is not that large, around 64K (double that
340in \LUAMETATEX), so in practice there will often be some indirect
341(multi|-|compare) match but increasing the lower end of the hash table might
342result in less string comparisons later on, but also increases the time to
343calculate the initial hash needed for accessing the lower part. Here you can sort
344of see that:
345
346\startbuffer
347\dostepwiserecurse{`a}{`z}{1}{
348    \expandafter\def\csname whatever\Uchar#1\endcsname
349      {}
350}
351\dostepwiserecurse{`a}{`z}{1}{
352    \expandafter\let\csname somemore\Uchar#1\expandafter\endcsname
353        \csname whatever\Uchar#1\endcsname
354}
355\stopbuffer
356
357\typebuffer \getbuffer
358
359\startluacode
360    documentdata.StartShowTokens(true)
361    for i=utf.byte("a"),utf.byte("z") do
362        documentdata.ShowToken("whatever"..utf.char(i))
363        documentdata.ShowToken("somemore"..utf.char(i))
364    end
365    documentdata.StopShowTokens()
366\stopluacode
367
368The command code indicates a macro and the action related to it is an expandable
369call. We have no sub command \footnote {We cheat a little here because chr
370actually is an index into token memory but we don't show them as such.} so that
371column shows zeros. The fifth column is the hash entry which can bring us back to
372the verbose name as needed in reporting while the last column is the index to
373into token memory (watch the duplicates for \type {\let} macros: a ref count is
374kept in order to be able to manage such shared references). When you look a the
375cs column you will notice that some numbers are close which (I think) in this
376case indicates some closeness in the calculated hash name and followed chain.
377
378It will be clear that it is best to not make any assumptions with respect to the
379numbers which is why, in \LUAMETATEX\ we sort of normalize them when accessing
380properties.
381
382\starttabulate
383\NC field      \NC meaning \NC \NR
384\FL
385\NC command    \NC operator \NC \NR
386\NC cmdname    \NC internal name of operator \NC \NR
387\NC index      \NC sanitized operand \NC \NR
388\NC mode       \NC original operand  \NC \NR
389\NC csname     \NC associated name   \NC \NR
390\NC id         \NC the index in token memory (a virtual address) \NC \NR
391\NC tok        \NC the integer representation \NC \NR
392\ML
393\NC active     \NC true when an active character \NC \NR
394\NC expandable \NC true when expandable command \NC \NR
395\NC protected  \NC true when a protected command \NC \NR
396\NC frozen     \NC true when a frozen command \NC \NR
397\NC user       \NC true when a user defined command \NC \NR
398\LL
399\stoptabulate
400
401When a control sequence is an alias to an existing primitive, for instance
402made by \type {\let}, the operand (chr) picked up from its meaning. Take this:
403
404\startbuffer
405\newif\ifmyconditionone
406\newif\ifmyconditiontwo
407
408                    \meaning\ifmyconditionone    \crlf
409                    \meaning\ifmyconditiontwo    \crlf
410                    \meaning\myconditiononetrue  \crlf
411                    \meaning\myconditiontwofalse \crlf
412\myconditiononetrue \meaning\ifmyconditionone    \crlf
413\myconditiontwofalse\meaning\ifmyconditiontwo    \crlf
414\stopbuffer
415
416\typebuffer \getbuffer
417
418Internally this is:
419
420\startluacode
421    documentdata.StartShowTokens(false)
422    documentdata.ShowToken("ifmyconditionone")
423    documentdata.ShowToken("ifmyconditiontwo")
424    documentdata.ShowToken("iftrue")
425    documentdata.ShowToken("iffalse")
426    documentdata.StopShowTokens()
427\stopluacode
428
429The whole list of available commands is given below. Once they are stable the \LUAMETATEX\ manual
430will document the accessors. In this chapter we use:
431
432\starttyping
433kind, min, max, fixedvalue token.get_range("primitive")
434cmd, chr, cs  = token.get_cmdchrcs("primitive")
435\stoptyping
436
437The kind of command is given in the first column, which can have the following values:
438
439\starttabulate[|l|l|p|]
440\NC 0 \NC no        \NC not accessible \NC \NR
441\NC 1 \NC regular   \NC possibly with subcommand \NC \NR
442\NC 2 \NC character \NC the \UNICODE\ slot is encodes in the the token \NC \NR
443\NC 3 \NC register  \NC this is an indexed register (zero upto 64K) \NC \NR
444\NC 4 \NC internal  \NC this is an internal register (range given) \NC \NR
445\NC 5 \NC reference \NC this is a reference to a node, \LUA\ function, etc. \NC \NR
446\NC 6 \NC data      \NC a general data entry (kind of private) \NC \NR
447\NC 7 \NC token     \NC a token reference (that can have a followup) \NC \NR
448\stoptabulate
449
450\usemodule[system-tokens]
451
452\start \switchtobodyfont[7pt] \showsystemtokens \stop
453
454\stopchapter
455
456\stopcomponent
457