% language=us runpath=texruns:manuals/evenmore % TODO: copy_node_list : return tail % TODO: grabnested \environment evenmore-style \startcomponent evenmore-tokens \startchapter[title=Tokens] {\em This is mostly a wrapup of some developments, and definitely not a tutorial.} Talking deep down \TEX\ is talking about tokens and nodes. Roughly spoken, from the perspective of the user, tokens are what goes in and stays in (as macro, token list of whatever) and nodes is what get produced and eventually results in output. A character in the input becomes one token (before expansion) and a control sequence like \type {\foo} also is turned into a token. Tokens can be linked into lists. This actually means that in the engine we can talk of tokens in two ways: the single item with properties that trigger actions, or as compound item with that item and a pointer to the next token (called link). In \LUA\ speak token memory can be seen as: \starttyping fixmem = { { info, link }, { info, link }, { info, link }, { info, link }, .... } \stoptyping Both are 32 bit integers. The \type {info} is a combination of a command code (an operator) and a so called chr code (operand) and these determine its behaviour. For instance the command code can indicate an integer register and the chr code then indicates the number of that register. So, like: \starttyping fixmem = { { { cmd, chr}, index_into_fixmem }, { { cmd, chr}, index_into_fixmem }, { { cmd, chr}, index_into_fixmem }, { { cmd, chr}, index_into_fixmem }, .... } \stoptyping In the following line the characters that make three words are tokens (letters), so are the space (spacer), the curly braces (begin- and endgroup token) and the bold face switch (which becomes one token which resolves to a token list of tokens that trigger actions (in this case switching to a bolder font). \starttyping foo {\bf bar} foo \stoptyping When \TEX\ reads a line of input tokens are expanded immediately but a sequence can also become part fo a macro body or token list. Here we have $3_{\type{foo}} + 1 + 1_{\type+{+} + 1_{\type{\bf}} + 3_{\type{bar}} + 1_{\type+}+} + 1 + 3_{\type{foo}} = 14$ tokens. A control sequence normally starts with a backslash. Some are built in, these are called primitives, and others are defined by the macro package or the user. There is a lookup table that relates the tokenized control sequence to some action. For instance: \starttyping \def\foo{foo} \stoptyping creates an entry that leads (directly or following a hash chain) to the three letter token list. Every time the input sees \type {\foo} it gets resolved to that list via a hash lookup. However, once internalized and part of a token list, it is a direct reference. On the other hand, \starttyping \the\count0 \stoptyping triggers the \type {\the} action that relates to this control sequence, which then reads a next token and operates on that. That next token itself expects a number as follow up. In the end the value of \type {\count0} is found and that one is also in the so called equivalent lookup table, in what \TEX\ calls specific regions. \starttyping equivalents = { { level, type, value }, { level, type, value }, { level, type, value }, ... } \stoptyping The value is in most cases similar to the info (cmd & chr) field in fixmem, but one difference is that counters, dimensions etc directly store their value, which is why we sometimes need the type separately, for instance in order to reclaim memory for glue or node specifications. It sound complicated and it is, but as long as you get a rough idea we can continue. Just keep in mind that tokens sometimes get expanded on the fly, and sometimes just get stored. There are a lot of primitives and each has a unique info. The same is true for characters (each category has its own command code, so regular letters can be distinguished from other tokens, comment signs, math triggers etc). All important basic bits are in table of equivalents: macros as well as registers although the meaning of a macro and content of token lists lives in the fixmem table and the content of boxes in so called node lists (nodes have their own memory). In traditional \TEX\ the lookup table for primitives, registers and macros is as compact as can be: it is an array of so called 32 bit memory words. These can be divided into halfs and quarters, so in the source you find terms like \type {halfword} and \type {quarterword}. The lookup table is a hybrid: \starttyping [level 8] [type 8] [value 16] | [equivalent 32] [level 8] [type 8] [value 16] | [equivalent 32] [level 8] [type 8] [value 16] | [equivalent 32] ... \stoptyping The mentioned counters and such are directly encoded in an equivalent and the rest is a combination of level, type and value. The level is used for the grouping, and in for instance \PDFTEX\ there can therefore be at most 255 levels. In \LUATEX\ we use a wider model. There we have 64 bit memory words which means that we have way more levels and don't need to have this dual nature: \starttyping [level 16] [type 16] [value 32] [level 16] [type 16] [value 32] [level 16] [type 16] [value 32] ... \stoptyping We already showed a \LUA\ representation. The type in this table is what a command code is in an \quote {info} field. In such a token the integer encodes the command as well as a value (called chr). In the lookup table the type is the command code. When \TEX\ is dealing with a control sequences it looks at the type, otherwise it filters the command from the token integer. This means that a token cannot store an integer (or dimension), but the lookup table actually can do that. However, commands can limit the range, for instance characters are bound by what \UNICODE\ permits. Internally, \LUATEX\ still uses these ranges of fast accessible registers, like counters, dimensions and attributes. However, we saw that in \LUATEX\ they don't overlap with the level and type. In \LUATEX, at least till version 1.13 we still have the shadow array for levels but in \LUAMETATEX\ we just use those in the equivalents lookup table. If you look in the \PASCAL\ source you will notice that arrays run from \type {[somemin ... somemax]} which in the \CCODE\ source would mean using offsets. Actually, the shadow array starts at zero so we waste the part that doesn't need shadowing. It is good to remind ourselves that traditional \TEX\ is 8 bit character based. The equivalents lookup table has all kind of special ranges (combined into regions of similar nature, in \TEX\ speak), like those for lowercase mapping, specific catcode mappings, etc.\ but we're still talking of $n \times 256$ entries. In \LUATEX\ all these mappings are in dedicated sparse hash tables because we need to support the full \UNICODE\ repertoire. This means that, while on the one hand \LUATEX\ uses more memory for the lookup table the number of slots can be less. But still there was the waste of the shadow level table: I didn't calculate the exact saving of ditching that one, but I bet it came close to what was available as total memory for programs and data on the first machines that I used for running \TEX. But \unknown\ after more than a decade of \LUATEX\ we now reclaimed that space in \LUAMETATEX. \footnote {Don't expect a gain in performance, although using less memory might pay back on a virtual machine or when \TEX\ has to share the \CPU\ cache.} Now, in case you're interested (and actually I just write it down because I don't want to forget it myself) the lookup table in \LUAMETATEX\ is layout as follows \starttabulate \NC the hash table \NC \NC \NR \NC some frozen primitives \NC \NC \NR \NC current and defined fonts \NC one slot + many pointers \NC \NR \NC undefined control sequence \NC one slot \NC \NR \NC internal and register glue \NC pointer to node \NC \NR \NC internal and register muglue \NC pointer to node \NC \NR \NC internal and register toks \NC pointer to token list \NC \NR \NC internal and register boxes \NC pointer to node list \NC \NR \NC internal and register counts \NC value in token \NC \NR \NC internal and register attributes \NC value in token \NC \NR \NC internal and register dimens \NC value in token \NC \NR \NC some special data structures \NC pointer to node list \NC \NC \NR \NC the (runtime) extended hash table \NC \NC \NR \stoptabulate Normally a user doesn't need to know anything about these specific properties of the engine and it might comfort you to know that for a long time I could stay away from these details. One difference with the other engines is that we have internal variables and registers split more explicitly. The special data structures have their own slots and are not just put somewhere (semi random). The initialization is bit more granular in that we properly set the types (cmd codes) for registers which in turn is possible because for instance we're able to distinguish glue types. This is all part of coming up with a bit more consistent interface to tokens from the \LUA\ end. It also permits diagnostics. Anyway, we now are ready for some more details about tokens. You don't need to understand all of it in order to define decent macros. But when you are using \LUATEX\ and do want to mess around here is some insight. Assume we have defined these macros: \startluacode local alsoraw = false function documentdata.StartShowTokens(rawtoo) context.starttabulate { "|T|rT|lT|rT|rT|rT|" .. (rawtoo and "rT|" or "") } context.BC() context.BC() context("cmd") context.BC() context("name") context.BC() context("chr") context.BC() context("cs") if rawtoo then context.BC() context("rawchr") end context.BC() context.NR() context.SL() alsoraw = rawtoo end function documentdata.StopShowTokens() context.stoptabulate() end function documentdata.ShowToken(name) local cmd, chr, cs = token.get_cmdchrcs(name) local _, raw, _ = token.get_cmdchrcs(name,true) context.NC() context("\\string\\"..name) context.NC() context(cmd) context.NC() context(tokens.commands[cmd]) context.NC() context(chr) context.NC() context(cs) if alsoraw and chr ~= raw then context.NC() context(raw) end context.NC() context.NR() end \stopluacode \startbuffer \def\MacroA{a} \def\MacroB{b} \def\macroa{a} \def\macrob{b} \def\MACROa{a} \def\MACROb{b} \stopbuffer \typebuffer \getbuffer How does that end up internally? \startluacode documentdata.StartShowTokens(true) documentdata.ShowToken("scratchcounterone") documentdata.ShowToken("scratchcountertwo") documentdata.ShowToken("scratchdimen") documentdata.ShowToken("scratchtoks") documentdata.ShowToken("scratchcounter") documentdata.ShowToken("letterpercent") documentdata.ShowToken("everypar") documentdata.ShowToken("%") documentdata.ShowToken("pagegoal") documentdata.ShowToken("pagetotal") documentdata.ShowToken("hangindent") documentdata.ShowToken("hangafter") documentdata.ShowToken("dimdim") documentdata.ShowToken("relax") documentdata.ShowToken("dimen") documentdata.ShowToken("stoptext") documentdata.ShowToken("MacroA") documentdata.ShowToken("MacroB") documentdata.ShowToken("MacroC") documentdata.ShowToken("macroa") documentdata.ShowToken("macrob") documentdata.ShowToken("macroc") documentdata.ShowToken("MACROa") documentdata.ShowToken("MACROb") documentdata.ShowToken("MACROc") documentdata.StopShowTokens() \stopluacode We show the raw chr value but in the \LUA\ interface these are normalized to for instance proper register indices. This is because the raw numbers can for instance be indices into memory or some \UNICODE\ reference with catcode specific bits set. But, while these indices are real and stable, these offsets can actually change when the implementation changes. For that reason, in \LUAMETATEX\ we can better talk of command codes as main indicator and: \starttabulate \NC subcommand \NC for tokens that have variants, like \type {\ifnum} \NC \NR \NC register indices \NC for the 64K register banks, like \type {\count0} \NC \NR \NC internal indices \NC for internal variables like \type {\parindent} \NC \NR \NC characters \NC specific \UNICODE\ slots combined with catcode \NC \NR \NC pointers \NC to token lists, macros, \LUA\ functions, nodes \NC \NR \stoptabulate This so called \type {cs} number is a pointer into the table of equivalents. That number results comes from the hash table. A macro name, when scanned the first time, is still a sequence of bytes. This sequence is used to compute a hash number, which is a pointer to a slot in the lower part of the hash (lookup) table. That slot points to a string and a next hash entry in the higher end. A lookup goes as follows: \startitemize[n,packed] \startitem compute the index into the hash table from the string \stopitem \startitem goto the slot with that index and compare the \type {string} field \stopitem \startitem when there is no match goto the slot indicated by the \type {next} field \stopitem \startitem compare again and keep following \type {next} fields till there is no follow up \stopitem \startitem optionally create a new entry \stopitem \startitem use the index of that entry as index in the table of equivalents \stopitem \stopitemize So, in \LUA\ speak, we have: \starttyping hashtable = { -- lower part, accessed via the calculated hash number { stringpointer, nextindex }, { stringpointer, nextindex }, ... -- higher part, accessed by following nextindex { stringpointer, nextindex }, { stringpointer, nextindex }, ... } \stoptyping Eventually, after following a lookup chain in the hash tabl;e, we end up at pointer to the equivalents lookup table that we already discussed. From then on we're talking tokens. When you're lucky, the list is small and you have a quick match. The maximum initial hash index is not that large, around 64K (double that in \LUAMETATEX), so in practice there will often be some indirect (multi|-|compare) match but increasing the lower end of the hash table might result in less string comparisons later on, but also increases the time to calculate the initial hash needed for accessing the lower part. Here you can sort of see that: \startbuffer \dostepwiserecurse{`a}{`z}{1}{ \expandafter\def\csname whatever\Uchar#1\endcsname {} } \dostepwiserecurse{`a}{`z}{1}{ \expandafter\let\csname somemore\Uchar#1\expandafter\endcsname \csname whatever\Uchar#1\endcsname } \stopbuffer \typebuffer \getbuffer \startluacode documentdata.StartShowTokens(true) for i=utf.byte("a"),utf.byte("z") do documentdata.ShowToken("whatever"..utf.char(i)) documentdata.ShowToken("somemore"..utf.char(i)) end documentdata.StopShowTokens() \stopluacode The command code indicates a macro and the action related to it is an expandable call. We have no sub command \footnote {We cheat a little here because chr actually is an index into token memory but we don't show them as such.} so that column shows zeros. The fifth column is the hash entry which can bring us back to the verbose name as needed in reporting while the last column is the index to into token memory (watch the duplicates for \type {\let} macros: a ref count is kept in order to be able to manage such shared references). When you look a the cs column you will notice that some numbers are close which (I think) in this case indicates some closeness in the calculated hash name and followed chain. It will be clear that it is best to not make any assumptions with respect to the numbers which is why, in \LUAMETATEX\ we sort of normalize them when accessing properties. \starttabulate \NC field \NC meaning \NC \NR \FL \NC command \NC operator \NC \NR \NC cmdname \NC internal name of operator \NC \NR \NC index \NC sanitized operand \NC \NR \NC mode \NC original operand \NC \NR \NC csname \NC associated name \NC \NR \NC id \NC the index in token memory (a virtual address) \NC \NR \NC tok \NC the integer representation \NC \NR \ML \NC active \NC true when an active character \NC \NR \NC expandable \NC true when expandable command \NC \NR \NC protected \NC true when a protected command \NC \NR \NC frozen \NC true when a frozen command \NC \NR \NC user \NC true when a user defined command \NC \NR \LL \stoptabulate When a control sequence is an alias to an existing primitive, for instance made by \type {\let}, the operand (chr) picked up from its meaning. Take this: \startbuffer \newif\ifmyconditionone \newif\ifmyconditiontwo \meaning\ifmyconditionone \crlf \meaning\ifmyconditiontwo \crlf \meaning\myconditiononetrue \crlf \meaning\myconditiontwofalse \crlf \myconditiononetrue \meaning\ifmyconditionone \crlf \myconditiontwofalse\meaning\ifmyconditiontwo \crlf \stopbuffer \typebuffer \getbuffer Internally this is: \startluacode documentdata.StartShowTokens(false) documentdata.ShowToken("ifmyconditionone") documentdata.ShowToken("ifmyconditiontwo") documentdata.ShowToken("iftrue") documentdata.ShowToken("iffalse") documentdata.StopShowTokens() \stopluacode The whole list of available commands is given below. Once they are stable the \LUAMETATEX\ manual will document the accessors. In this chapter we use: \starttyping kind, min, max, fixedvalue token.get_range("primitive") cmd, chr, cs = token.get_cmdchrcs("primitive") \stoptyping The kind of command is given in the first column, which can have the following values: \starttabulate[|l|l|p|] \NC 0 \NC no \NC not accessible \NC \NR \NC 1 \NC regular \NC possibly with subcommand \NC \NR \NC 2 \NC character \NC the \UNICODE\ slot is encodes in the the token \NC \NR \NC 3 \NC register \NC this is an indexed register (zero upto 64K) \NC \NR \NC 4 \NC internal \NC this is an internal register (range given) \NC \NR \NC 5 \NC reference \NC this is a reference to a node, \LUA\ function, etc. \NC \NR \NC 6 \NC data \NC a general data entry (kind of private) \NC \NR \NC 7 \NC token \NC a token reference (that can have a followup) \NC \NR \stoptabulate \usemodule[system-tokens] \start \switchtobodyfont[7pt] \showsystemtokens \stop \stopchapter \stopcomponent