% language=us runpath=texruns:manuals/lowlevel \environment lowlevel-style \usemodule[system-tokens] \startdocument [title=tokens, color=middleblue] \startsectionlevel[title=Introduction] Most users don't need to know anything about tokens but it happens that when \TEX ies meet in person (users group meetings), or online (support platforms) there always seem to pop up folks who love token speak. When you try to explain something to a user it makes sense to talk in terms of characters but then those token speakers can jump in and start correcting you. In the past I have been puzzled by this because, when one can write a decent macro that does the job well, it really doesn't matter if one knows about tokens. Of course one should never make the assumption that token speakers really know \TEX\ that well or can come up with better solutions than users but that is another matter. \footnote {Talking about fashion: it would be more impressive to talk about \TEX\ and friends as a software stack than calling it a distribution. Today, it's all about marketing.} That said, because in documents about \TEX\ the word \quote {token} does pop up I will try to give a little insight here. But for using \TEX\ it's mostly irrelevant. The descriptions below for sure won't match the proper token speak criteria which is why at a presentation for the 2020 user meeting I used the title \quotation {Tokens as I see them.} \stopsectionlevel \startsectionlevel[title=What are tokens] Both the words \quote {node} and \quote {token} are quite common in programming and also rather old which is proven by the fact that they also are used in the \TEX\ source. A node is a storage container that is part of a linked list. When you input the characters \type {tex} the three characters become part of the current linked list. They become \quote {character} nodes (or in \LUATEX\ speak \quote {glyph} nodes) with properties like the font and the character referred to. But before that happens, the three characters in the input \type {t}, \type {e} and \type {x}, are interpreted as in this case being just that: characters. When you enter \type {\TeX} the input processors first sees a backslash and because that has a special meaning in \TEX\ it will read following characters and when done does a lookup in it's internal hash table to see what it actually is: a macro that assembled the word \TEX\ in uppercase with special kerning and a shifted (therefore boxed) \quote {E}. When you enter \type {$} \TEX\ will look ahead for a second one in order to determine display math, push back the found token when there is no match and then enter inline math mode. A token is internally just a 32 bit number that encodes what \TEX\ has seen. It is the assembled token that travels through the system, get stored, interpreted and often discarded afterwards. So, the character \quote {e} in our example gets tagged as such and encoded in this number in a way that the intention can be derived later on. Now, the way \TEX\ looks at these tokens can differ. In some cases it will just look at this (32 bit) number, for instance when checking for a specific token, which is fast, but sometimes it needs to know some detail. The mentioned integer actually encodes a command (opcode) and a so called char code (operand). The second name is somewhat confusing because in many cases that code is not representing a character but that is not that relevant here. When you look at the source code of a \TEX\ engine it is enough to know that a char can also be a sub command. \startlinecorrection[blank] \setupTABLE[each][align=middle] \setupTABLE[c][1][width=44mm] \setupTABLE[c][2][width=4em] \setupTABLE[c][3][width=11mm] \setupTABLE[c][4][width=33mm] \bTABLE \bTR \bTD token \eTD \bTD[frame=off] = \eTD \bTD cmd \eTD \bTD chr \eTD \eTR \eTABLE \stoplinecorrection Back to the three characters: these become tokens where the command code indicates that it is a letter and the char code stores what letter we have at hand and in the case of \LUATEX\ and \LUAMETATEX\ these are \UNICODE\ values. Contrary to the traditional 8 bit \TEX\ engine, in the \UNICODE\ engines an \UTF\ sequence is read, but these multiple bytes still become one number that will be encoded in the token number. In order to determine that something is a letter the engine has to be told (which is what a macro package does when it sets up the engine). For instance, digits are so called other characters and the backslash is called escape. Every \TEX\ user knows that curly braces are special and so are dollar symbols and hashes. If this rings a bell, and you relate this to catcodes, you can indeed assume that the command codes of these tokens have the same numbers as the catcodes. Given that \UNICODE\ has plenty of characters slots you can imagine that combining 16 catcode commands with all the possible \UNICODE\ values makes a large repertoire of tokens. There are more commands than the 16 basic characters related ones, in \LUAMETATEX\ we have just over 150 command codes (\LUATEX\ has a few more but they are also organized differently). Each of these codes can have a sub command, For instance the primitives \type {\vbox} and \type {\hbox} are both a \type {make_box_cmd} (we use the symbolic name here) and in \LUAMETATEX\ the first one has sub command code 9 (\type {vbox_code}) and the second one has code 10 (\type {hbox_code}). There are twelve primitives that are in the same category. The many primitives that make up the core of the engine are grouped in a way that permits processing similar ones with one function and also makes it possible to distinguish between the way commands are handled, for instance with respect to expansion. Now, before we move on it is important to know that al these codes are in fact abstract numbers. Although it is quite likely that engines that are derived from each other have similar numbers (just more) this is not the case for \LUAMETATEX. Because the internals have been opened up (even more than in \LUATEX) the command and char codes have been reorganized in a such a way that exposure is consistent. We could not use some of the reuse and remap tricks that the other engines use because it would simply be too confusing (and demand real in depth knowledge of the internals). This is also the reason why development took some time. You probably won't notice it from the current source but it was a very stepwise process. We not only had to make sure that all kept working (\CONTEXT\ \LMTX\ and \LUAMETATEX\ were pretty useable during the process), but also had to (re)consider intermediate choices. So, input is converted into tokens, in most cases one|-|by|-|one. When a token is assembled, it either gets stored (deliberately or as part of some look ahead scanning), or it immediately gets (what is called:) expanded. Depending on what the command is, some action is triggered. For instance, a character gets appended to the node list immediately. An \type {\hbox} command will start assembling a box which its own node list that then gets some treatment: if this primitive was a follow up on \type {\setbox} it will get stored, otherwise it might end up in the current node list as so called hlist node. Commands that relate to registers have \type {0xFFFF} char codes because that is how many registers we have per category. When a token gets stored for later processing it becomes part of a larger data structure, a so called memory word. These memory words are taken from a large pool of words and they store a token and additional properties. The info field contains the token value, the mentioned command and char. When there is no linked list, the link can actually be used to store a value, something that in \LUAMETATEX\ we actually do. \startlinecorrection[blank] \setupTABLE[each][align=middle] \setupTABLE[c][1][width=8mm] \setupTABLE[c][2][width=64mm] \setupTABLE[c][3][width=64mm] \bTABLE \bTR \bTD 1 \eTD \bTD info \eTD \bTD link \eTD \eTR \bTR \bTD 2 \eTD \bTD info \eTD \bTD link \eTD \eTR \bTR \bTD 3 \eTD \bTD info \eTD \bTD link \eTD \eTR \bTR \bTD n \eTD \bTD info \eTD \bTD link \eTD \eTR \eTABLE \stoplinecorrection When for instance we say \typ {\toks 0 {tex}} the scanner sees an escape, followed by 4 letters (\type {toks}) and the escape triggers a lookup of the primitive (or macro or \unknown) with that name, in this case a primitive assignment command. The found primitive (its property gets stored in the token) triggers scanning for a number and when that is successful scanning of a brace delimited token list starts. The three characters become three letter tokens and these are a linked list of the mentioned memory words. This list then gets stored in token register zero. The input sequence \typ {\the \toks 0} will push back a copy of this list into the input. In addition to the token memory pool, there is also a table of equivalents. That one is part of a larger table of memory words where \TEX\ stores all it needs to store. The 16 groups of character commands are virtual, storing these makes no sense, so the first real entries are all these registers (count, dimension, skip, box, etc). The rest is taken up by possible hash entries. \startlinecorrection[blank] \bTABLE \bTR \bTD[ny=4] main hash \eTD \bTD null control sequence \eTD \eTR \bTR \bTD 128K hash entries \eTD \eTR \bTR \bTD frozen control sequences \eTD \eTR \bTR \bTD special sequences (undefined) \eTD \eTR \bTR \bTD[ny=7] registers \eTD \bTD 17 internal & 64K user glues \eTD \eTR \bTR \bTD 4 internal & 64K user mu glues \eTD \eTR \bTR \bTD 12 internal & 64K user tokens \eTD \eTR \bTR \bTD 2 internal & 64K user boxes \eTD \eTR \bTR \bTD 116 internal & 64K user integers \eTD \eTR \bTR \bTD 0 internal & 64K user attribute \eTD \eTR \bTR \bTD 22 internal & 64K user dimensions \eTD \eTR \bTR \bTD specifications \eTD \bTD 5 internal & 0 user \eTD \eTR \bTR \bTD extra hash \eTD \bTD additional entries (grows dynamic) \eTD \eTR \eTABLE \stoplinecorrection So, a letter token \type {t} is just that, a token. A token referring to a register is again just a number, but its char code points to a slot in the equivalents table. A macro, which we haven't discussed yet, is actually just a token list. When a name lookup happens the hash table is consulted and that one runs in parallel to part of the table of equivalents. When there is a match, the corresponding entry in the equivalents table points to a token list. \startlinecorrection[blank] \setupTABLE[each][align=middle] \setupTABLE[c][1][width=16mm] \setupTABLE[c][2][width=64mm] \setupTABLE[c][3][width=64mm] \bTABLE \bTR \bTD 1 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR \bTR \bTD 2 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR \bTR \bTD n \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR \bTR \bTD n + 1 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR \bTR \bTD n + 2 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR \bTR \bTD n + m \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR \eTABLE \stoplinecorrection It sounds complex and it actually also is somewhat complex. It is not made easier by the fact that we also track information related to grouping (saving and restoring), need reference counts for copies of macros and token lists, sometimes store information directly instead of via links to token lists, etc. And again one cannot compare \LUAMETATEX\ with the other engines. Because we did away with some of the limitations of the traditional engine we not only could save some memory but in the end also simplify matters (we're 32/64 bit after all). On the one hand some traditional speedups were removed but these have been compensated by improvements elsewhere, so overall processing is more efficient. \startlinecorrection[blank] \setupTABLE[each][align=middle] \setupTABLE[c][1][width=8mm] \setupTABLE[c][2][width=32mm] \setupTABLE[c][3][width=16mm] \setupTABLE[c][4][width=16mm] \setupTABLE[c][5][width=64mm] \bTABLE \bTR \bTD 1 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR \bTR \bTD 2 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR \bTR \bTD 3 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR \bTR \bTD n \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR \eTABLE \stoplinecorrection So, here \LUAMETATEX\ differs from other engines because it combines two tables, which is possible because we have at least 32 bits. There are at most \type {0xFFFF} levels but we need at most \type {0xFF} types. in \LUAMETATEX\ macros can have extra properties (flags) and these also need one byte. Contrary to the other engines, \type {\protected} macros are native and have their own command code, but \type {\tolerant} macros duplicate that (so we have four distinct macro commands). All other properties, like the \type {\permanent} ones are stored in the flags. Because a macro starts with a reference count we have some room in the info field to store information about it having arguments or not. It is these details that make \LUAMETATEX\ a bit more efficient in terms of memory usage and performance than its ancestor \LUATEX. But as with the other changes, it was a very stepwise process in order to keep the system compatible and working. \stopsectionlevel \startsectionlevel[title=Some implementation details] Sometimes there is a special head token at the start. This makes for easier appending of extra tokens. In traditional \TEX\ node lists are forward linked, in \LUATEX\ they are double linked \footnote {On the agenda of \LUAMETATEX\ is to use this property in the underlying code, that doesn't yet profit from this and therefore keep previous pointers in store.}. Token lists are always forward linked. Shared token lists use the head node for a reference count. For various reasons original \TEX\ uses global variables temporary lists. This is for instance needed when we expand (nested) and need to report issues. But in \LUATEX\ we often just serialize lists and using local variables makes more sense. One of the first things done in \LUAMETATEX\ was to group all global variables in (still global) structures but well isolated. That also made it possible to actually get rid of some globals. Because \TEX\ had to run on machines that we nowadays consider rather limited, it had to be sparse and efficient. There are quite some optimizations to limit code and memory consumption. The engine also does its own memory management. Freed token memory words are collected in a cache and reused but they can get scattered which is not that bad, apart from maybe cache hits. In \LUAMETATEX\ we stay as close to original \TEX\ as possible but there have been some improvements. The \LUA\ interfaces force us to occasionally divert from the original, and that in fact might lead to some retrofit but the original documentation still mostly applies. However, keep in mind that in \LUATEX\ we store much more in nodes (each has a prev pointer and an attribute list pointer and for instance glyph nodes have some 20 extra fields compared to traditional \TEX\ character nodes). \stopsectionlevel \startsectionlevel[title=Other data management] There is plenty going on in \TEX\ when it processes your input, just to mention a few: \startitemize[packed] \startitem Grouping is handled by a nesting stack. \stopitem \startitem Nested conditionals (\type {\if...}) have their own stack. \stopitem \startitem The values before assignments are saved on the save stack. \stopitem \startitem Also other local changes (housekeeping) ends up in the save stack. \stopitem \startitem Token lists and macro aliases have references pointers (reuse). \stopitem \startitem Attributes, being linked node lists, have their own management. \stopitem \stopitemize In all these subsystems tokens or references to tokens can play a role. Reading a single character from the input can trigger a lot of action. A curly brace tagged as begin group command will push the grouping level and from then on registers and some other quantities that are changed will be stored on the save stack so that after the group ends they can be restored. When primitives take keywords, and no match happens, tokens are pushed back into the input which introduces a new input level (also some stack). When numbers are read a token that represents no digit is pushed back too and macro packages use numbers and dimensions a lot. It is a surprise that \TEX\ is so fast. \stopsectionlevel \startsectionlevel[title=Macros] There is a distinction between primitives, the build in commands, and macros, the commands defined by users. A primitive relates to a command code and char code but macros are, unless they are made an alias to something else, like a \type {\countdef} or \type {\let} does, basically pointers to a token list. There is some additional data stored that makes it possible to parse and grab arguments. When we have a control sequence (macro) \type {\controlsequence} the name is looked up in the hash table. When found its value will point to the table of equivalents. As mentioned, that table keeps track of the cmd and points to a token list (the meaning). We saw that this table also stores the current level of grouping and flags. If we say, in the input, \typ {\hbox to 10pt {x\hss}}, the box is assembled as we go and when it is appended to the current node list there are no tokens left. When scanning this, the engine literally sees a backslash and the four letters \type {hbox}. However when we have this: \starttyping[option=TEX] \def\MyMacro{\hbox to 10pt {x\hss}} \stoptyping the \type {\hbox} has become one memory word which has a token representing the \type {\hbox} primitive plus a link to the next token. The space after a control sequence is gobbled so the next two tokens, again stored in a linked memory word, are letter tokens, followed by two other and two letter tokens for the dimensions. Then we have a space, a brace, a letter, a primitive and a brace. The about 20 characters in the input became a dozen memory words each two times four bytes, so in terms of memory usage we end up with quite a bit more. However, when \TEX\ runs over that list it only has to interpret the token values because the scanning and conversion already happened. So, the space that a macro takes is more than compensated by efficient reprocessing. \stopsectionlevel \startsectionlevel[title=Looking at tokens] When you say \type {\tracingall} you will see what the engine does: read input, expand primitives and macros, typesetting etc.\ You might need to set \type {\tracingonline} to get a bit more output on the console. One way to look at macros is to use the \type {\meaning} command, so if we have: \startbuffer[definition] \permanent\protected\def\MyMacro#1#2{Do #1 or #2!} \stopbuffer \startbuffer[meaning] \meaning \MyMacro \meaningless\MyMacro \meaningfull\MyMacro \stopbuffer \typebuffer[definition][option=TEX] we can say this: \typebuffer[meaning][option=TEX] and get: {\getbuffer[definition]\startlines\tttf \getbuffer[meaning]\stoplines} You get less when you ask for the meaning of a primitive, just its name. The \type {\meaningfull} primitive gives the most information. In \LUAMETATEX\ protected macros are first class commands: they have their own command code. In the other engines they are just regular macros with an initial token indicating that they are protected. There are specific command codes for \type {\outer} and \type {\long} macros but we dropped that in \LUAMETATEX . Instead we have \type {\tolerant} macros but that's another story. The flags that were mentioned can mark macros in a way that permits overload protection as well as some special treatment in otherwise tricky cases (like alignments). The overload related flags permits a rather granular way to prevent users from redefining macros and such. They are set via prefixes, and add to that repertoire: we have 14 prefixes but only some eight deal with flags (we can add more if really needed). The probably most wel known prefix is \type {\global} and that one will not become a flag: it has immediate effect. For the above definition, the \type {\showluatokens} command will show a meaning on the console. \starttyping[option=TEX] \showluatokens\MyMacro \stoptyping % {\getbuffer[definition]\getbuffer} This gives the next list, where the first column is the address of the token, the second one the command code, and the third one the char code. When there are arguments involved, the list of what needs to get matched is shown. \starttyping permanent protected control sequence: MyMacro 501263 19 49 match argument 1 501087 19 50 match argument 2 385528 20 0 end match -------------- 501090 11 68 letter D (U+00044) 30833 11 111 letter o (U+0006F) 500776 10 32 spacer 385540 21 1 parameter reference 112057 10 32 spacer 431886 11 111 letter o (U+0006F) 30830 11 114 letter r (U+00072) 30805 10 32 spacer 500787 21 2 parameter reference 213412 12 33 other char ! (U+00021) \stoptyping In the next subsections I will give some examples. This time we use helper defined in a module: \starttyping[option=TEX] \usemodule[system-tokens] \stoptyping \startsectionlevel[title=Example 1: in the input] \startbuffer \luatokentable{1 \bf{2} 3\what {!}} \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} \stopsectionlevel \startsectionlevel[title=Example 2: in the input] \startbuffer \luatokentable{a \the\scratchcounter b \the\parindent \hbox to 10pt{x}} \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} \stopsectionlevel \startsectionlevel[title=Example 3: user registers] \startbuffer \scratchtoks{foo \framed{\red 123}456} \luatokentable\scratchtoks \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} \stopsectionlevel \startsectionlevel[title=Example 4: internal variables] \startbuffer \luatokentable\everypar \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} \stopsectionlevel \startsectionlevel[title=Example 5: macro definitions] \startbuffer \protected\def\whatever#1[#2](#3)\relax {oeps #1 and #2 & #3 done ## error} \luatokentable\whatever \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} \stopsectionlevel \startsectionlevel[title=Example 6: commands] \startbuffer \luatokentable\startitemize \luatokentable\stopitemize \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} \stopsectionlevel \startsectionlevel[title=Example 7: commands] \startbuffer \luatokentable\doifelse \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } \stopsectionlevel \startsectionlevel[title=Example 8: nothing] \startbuffer \luatokentable\relax \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } \stopsectionlevel \startsectionlevel[title=Example 9: hashes] \startbuffer \edef\foo#1#2{(#1)(\letterhash)(#2)} \luatokentable\foo \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } \stopsectionlevel \startsectionlevel[title=Example 10: nesting] \startbuffer \def\foo#1{\def\foo##1{(#1)(##1)}} \luatokentable\foo \stopbuffer \typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } \stopsectionlevel \startsectionlevel[title=Remark] In all these examples the numbers are to be seen as abstractions. Some command codes and sub command codes might change as the engine evolves. This is why the \LUAMETATEX\ engine has lots of \LUA\ functions that provide information about what number represents what command. \stopsectionlevel \stopsectionlevel \stopdocument