luametatex-tokens.tex /size: 37 Kb    last modification: 2025-02-21 11:03
1% language=us runpath=texruns:manuals/luametatex
2
3\environment luametatex-style
4
5\startdocument[title=Tokens]
6
7\startsection[title={Introduction}]
8
9If a \TEX\ programmer talks tokens (and nodes) the average user can safely ignore
10it. Often it is enough to now that your input is tokenized which means that one
11or more characters in the input got converted into some efficient internal
12representation that then travels through the system and triggers actions. When
13you see an error message with \TEX\ code, the reverse happened: tokens were
14converted back into commands that resemble the (often expanded) input.
15
16There are not that many examples here because the functions discusses here are
17often not used directly but instead integrated in a bit more convenient
18interfaces. However, in due time more examples might show up here.
19
20\stopsection
21
22\startsection[title={\LUA\ token representation}]
23
24A token is an 32 bit integer that encodes a command and a value, index, reference
25or whatever goes with a command. The input is converted into a token and the body
26of macros are stored as linked list of tokens. In the later case we combine a
27token and a next pointer in what is called a memory word. If we see tokens in
28\LUA\ we don't get the integer but a userdata object that comes with accessors.
29
30Unless you're into very low level programming the likelihood of encountering
31tokens is low. But related to tokens is scanning so that is what we cover here in
32more detail.
33
34\stopsection
35
36\startsection[title={Helpers}]
37
38\startsubsection[title={Basics}]
39
40References to macros are stored in a table along with some extra properties but
41in the end they travel around as tokens. The same is true for characters, they
42are also encoded in a token. We have three ways to create a token:
43
44\starttyping[option=LUA]
45function token.create ( <t:integer> value )
46    return <t:token> -- userdata
47end
48
49function token.create ( <t:integer> value, <t:integer> command)
50    return <t:token> -- userdata
51end
52
53function token.create ( <t:string> csname )
54    return <t:token> -- userdata
55end
56\stoptyping
57
58An example of the first variant is \type {token.create(65)}. When we
59print (inspect) this in \CONTEXT\ we get:
60
61\starttyping[option=LUA]
62<lua token : 476151 == letter 65>={
63 ["category"]="letter",
64 ["character"]="A",
65 ["id"]=476151,
66}
67\stoptyping
68
69If we say \type {token.create(65,12)} instead we get:
70
71\starttyping[option=LUA]
72<lua token : 476151 == other_char 65>={
73 ["category"]="other",
74 ["character"]="A",
75 ["id"]=476151,
76}
77\stoptyping
78
79An example of the third call is \type {token.create("relax")}. This time get:
80
81\starttyping[option=LUA]
82<lua token : 580111 == relax : relax 0>={
83 ["active"]=false,
84 ["cmdname"]="relax",
85 ["command"]=16,
86 ["csname"]="relax",
87 ["expandable"]=false,
88 ["frozen"]=false,
89 ["id"]=580111,
90 ["immutable"]=false,
91 ["index"]=0,
92 ["instance"]=false,
93 ["mutable"]=false,
94 ["noaligned"]=false,
95 ["permanent"]=false,
96 ["primitive"]=true,
97 ["protected"]=false,
98 ["tolerant"]=false,
99}
100\stoptyping
101
102Another example is \type {token.create("dimen")}:
103
104\starttyping[option=LUA]
105<lua token : 467905 == dimen : register 3>={
106 ["active"]=false,
107 ["cmdname"]="register",
108 ["command"]=121,
109 ["csname"]="dimen",
110 ["expandable"]=false,
111 ["frozen"]=false,
112 ["id"]=467905,
113 ["immutable"]=false,
114 ["index"]=3,
115 ["instance"]=false,
116 ["mutable"]=false,
117 ["noaligned"]=false,
118 ["permanent"]=false,
119 ["primitive"]=true,
120 ["protected"]=false,
121 ["tolerant"]=false,
122}
123\stoptyping
124
125The most important properties are \type {command} and \type {index} because the
126combination determines what it does. The macros (here primitives) have a lot of extra
127properties. These are discusses in the low level manuals.
128
129You can check if something is a token with the next function; when a token is
130passed the return value is the string literal \type {token}.
131
132\starttyping[option=LUA]
133function token.type ( <t:whatever> )
134    return <t:string> "token" | <t:nil>
135end
136\stoptyping
137
138A maybe more natural test is:
139
140\starttyping[option=LUA]
141function token.istoken ( <t:whatever> )
142    return <t:boolean> -- success
143end
144\stoptyping
145
146Internally we can see variables like \type {cmd}, \type {chr}, \type {tok} and
147such, where the later is a combination of the first two. The \type {create}
148variant that take two integers relate to this. Of course you need to know what
149the magic numbers are. Passing weird numbers can give side effects so don't
150expect too much help with that. You need to know what you're doing. The best way
151to explore the way these internals work is to just look at how primitives or
152macros or \type {\chardef}'d commands are tokenized. Just create a known one and
153inspect its fields. A variant that ignores the current catcode table is:
154
155\startbuffer
156\protected\def\MyMacro#1{\dimen 0 = \numexpr #1 + 10 \relax}
157\stopbuffer
158
159\typebuffer % \showluatokens\MyMacro
160
161A macro like this is actually a little program:
162
163\starttyping
164467922   19   49  match                argument 1
165580083   20    0  end match
166--------------
167467931  121    3  register             dimen
168580013   12   48  other char           0 (U+00030)
169582314   10   32  spacer
170582312   12   61  other char           = (U+0003D)
171580193   10   32  spacer
172582783   81   75  some item            numexpr
173582310   21    1  parameter reference
174190952   10   32  spacer
175582785   12   43  other char           + (U+0002B)
176476151   10   32  spacer
177580190   12   49  other char           1 (U+00031)
178582265   12   48  other char           0 (U+00030)
179467939   10   32  spacer
180580045   16    0  relax                relax
181\stoptyping
182
183The first column shows indices in token memory where we have a token combined
184with a next pointer. So, in slot \type {467931} we have both a token and a
185pointer to slot \type {580013}.
186
187There is another way to create a token.
188
189\starttyping[option=LUA]
190function token.new ( <t:string> command, <t:integer> value )
191    return <t:token>
192end
193
194function token.new ( <t:integer> value, <t:integer> command )
195    return <t:token>
196end
197\stoptyping
198
199Watch the order of arguments. We not have four ways to create a token
200
201\starttyping[option=LUA]
202<lua token : 580087 == letter 65>={
203 ["category"]="letter",
204 ["character"]="A",
205 ["id"]=580087,
206}
207\stoptyping
208
209namely:
210
211\starttyping[option=LUA]
212token.new("letter",65)
213token.new(65,11)
214token.create(65,11)
215token.create(65)
216\stoptyping
217
218You can test if a control sequence is defined with:
219
220\starttyping[option=LUA]
221function token.isdefined ( <t:string> t )
222    return <t:boolean> -- success
223end
224\stoptyping
225
226The engine was never meant to be this open which means that in various places the
227assumption is that tokens are valid. However, it is possible to create tokens that
228make little sense in some context and can even make the system crash. When
229possible we catch this but checking everywhere would bloat the code and harm
230performance. Compare this to changing a few bytes in a binary that at some point
231create can havoc.
232
233\stopsubsection
234
235\startsubsection[title={Getters}]
236
237The userdata objects have a virtual interface that permits access by fieldname.
238Instead you can use one of the getters.
239
240% function token.gettok ( ) -- obsolete end
241
242\starttyping[option=LUA]
243function token.getcommand ( <t:token> t ) return <t:integer> end
244function token.getindex   ( <t:token> t ) return <t:integer> end
245function token.getcmdname ( <t:token> t ) return <t:string>  end
246function token.getcsname  ( <t:token> t ) return <t:string>  end
247function token.getid      ( <t:token> t ) return <t:integer> end
248function token.getactive  ( <t:token> t ) return <t:boolean> end
249\stoptyping
250
251If you want to know what the possible values are, you can use:
252
253\starttyping[option=LUA]
254function token.getrange (
255    <t:token> | <t:integer>
256)
257return
258    <t:integer>, -- first
259    <t:integer>  -- last
260end
261\stoptyping
262
263We can also ask for the macro properties but instead you can just fetch the bit
264set that describes them.
265
266\starttyping[option=LUA]
267function token.getexpandable ( <t:token> t ) return <t:boolean> end
268function token.getprotected  ( <t:token> t ) return <t:boolean> end
269function token.getfrozen     ( <t:token> t ) return <t:boolean> end
270function token.gettolerant   ( <t:token> t ) return <t:boolean> end
271function token.getnoaligned  ( <t:token> t ) return <t:boolean> end
272function token.getprimitive  ( <t:token> t ) return <t:boolean> end
273function token.getpermanent  ( <t:token> t ) return <t:boolean> end
274function token.getimmutable  ( <t:token> t ) return <t:boolean> end
275function token.getinstance   ( <t:token> t ) return <t:boolean> end
276function token.getconstant   ( <t:token> t ) return <t:boolean> end
277\stoptyping
278
279The bit set can be fetched with:
280
281\starttyping[option=LUA]
282function token.getflags ( <t:token> t )
283    return <t:integer> -- bit set
284end
285\stoptyping
286
287The possible flags are:
288
289\startthreerows
290\getbuffer[engine:syntax:flagcodes]
291\stopthreerows
292
293The number of parameters of a macro can be queried with:
294
295\starttyping[option=LUA]
296function token.getparameters ( <t:token> t )
297    return <t:integer>
298end
299\stoptyping
300
301The three properties that are used to identify a token can be fetched with:
302
303\starttyping[option=LUA]
304function token.getcmdchrcs ( <t:token> t )
305    return
306        <t:integer>, -- command (cmd)
307        <t:integer>, -- value   (chr)
308        <t:integer>  -- index   (cs)
309end
310\stoptyping
311
312A simpler call is:
313
314\starttyping[option=LUA]
315function token.getcstoken ( <t:string> csname )
316    return <t:integer> -- token number
317end
318\stoptyping
319
320A table with relevant properties of a token (or control sequence) can be fetched
321with:
322
323\starttyping[option=LUA]
324function token.getfields ( <t:token> token )
325    return <t:table> -- fields
326end
327
328function token.getfields ( <t:string> csname )
329    return <t:table> -- fields
330end
331\stoptyping
332
333\stopsubsection
334
335\startsubsection[title={Setters}]
336
337The \type {setmacro} function can be called with a different amount of arguments,
338where the prefix list comes last. Examples of prefixes are \type {global} and \type
339{protected}.
340
341\starttyping[option=LUA]
342function token.setmacro (
343    <t:string> csname
344)
345
346function token.setmacro (
347    <t:integer> catcodetable,
348    <t:string>  csname
349)
350    -- no return values
351end
352
353function token.setmacro (
354    <t:string> csname,
355    <t:string> content
356)
357    -- no return values
358end
359
360function token.setmacro (
361    <t:integer> catcodetable,
362    <t:string>  csname,
363    <t:string>  content
364)
365    -- no return values
366end
367
368function token.setmacro (
369    <t:string> csname,
370    <t:string> content,
371    <t:string> prefix
372 -- there can be more prefixes
373)
374    -- no return values
375end
376
377function token.setmacro (
378    <t:integer> catcodetable,
379    <t:string>  csname,
380    <t:string>  content,
381    <t:string>  prefix
382 -- there can be more prefixes
383)
384    -- no return values
385end
386\stoptyping
387
388A macro can also be queried:
389
390\starttyping[option=LUA]
391function token.getmacro (
392    <t:string>  csname,
393    <t:boolean> preamble,
394    <t:boolean> onlypreamble
395)
396    return <t:string>
397end
398\stoptyping
399
400The various arguments determine what you get:
401
402\startbuffer
403\def\foo#1{foo: #1}
404
405\ctxlua{context.type(token.getmacro("foo"))}
406\ctxlua{context.type(token.getmacro("foo",true))}
407\ctxlua{context.type(token.getmacro("foo",false,true))}
408\stopbuffer
409
410\typebuffer
411
412We get:
413
414\startlines
415\getbuffer
416\stoplines
417
418The meaning can be fetched as string or table:
419
420\starttyping[option=LUA]
421function token.getmeaning (
422    <t:string>  csname,
423)
424    return <t:string>
425end
426
427function token.getmeaning (
428    <t:string>  csname,
429    <t:true>    astable,
430    <t:boolean> subtables,
431    <t:boolean> originalindices -- special usage
432)
433    return <t:table>
434end
435\stoptyping
436
437The name says it:
438
439\starttyping[option=LUA]
440function token.undefinemacro ( <t:string> csname)
441    -- no return values
442end
443\stoptyping
444
445Expanding a macro happens in a \quote {local control} context which makes it
446immediate, that is, while running \LUA\ code.
447
448\starttyping[option=LUA]
449function token.expandmacro ( <t:string> csname)
450    -- no return values
451end
452\stoptyping
453
454This means that:
455
456\startbuffer
457\def\foo{\scratchdimen100pt \edef\oof{\the\scratchdimen}}
458% used in:
459\startluacode
460token.expandmacro("foo")
461context(token.getmacro("oof"))
462\stopluacode
463\stopbuffer
464
465\typebuffer
466
467gives:\inlinebuffer, because when \typ {getmacro} is called the expansion has
468been performed. You can consider this a sort of subrun (local to the main control
469loop).
470
471The next helper creates a token that refers to a \LUA\ function with an entry in
472the table that you can access with \typ {lua.getfunctionstable}. It is the
473companion to \type {\luadef}. When the first (and only) argument is true the size
474will preset to the value of \typ {texconfig.functionsize}.
475
476\starttyping[option=LUA]
477function token.setlua (
478    <t:string>  csname,
479    <t:integer> id,
480    <t:string>  prefix
481 -- there can be more prefixes
482)
483    return <t:token>
484end
485\stoptyping
486
487%   function token.setinteger   -- can go ... also in texlib
488%   function token.getinteger   -- can go ... also in texlib
489%   function token.setdimension -- can go ... also in texlib
490%   function token.getdimension -- can go ... also in texlib
491
492\stopsubsection
493
494\startsubsection[title={Writers}]
495
496In the \type {tex} library we have various ways to print something back to the
497input and the these print helpers in most cases also accept tokens. The \type
498{token.putnext} function is rather tolerant with respect to its arguments and
499there can be multiple. As with most prints, a new input level is created.
500
501\starttyping[option=LUA]
502function token.putnext ( <t:string> | <t:number> | <t:token> | <t:table> )
503    -- no return values
504end
505\stoptyping
506
507Here are some examples. We save some scanned tokens and flush them
508
509\starttyping
510local t1 = token.scannext()
511local t2 = token.scannext()
512local t3 = token.scannext()
513local t4 = token.scannext()
514-- watch out, we flush in sequence
515token.putnext { t1, t2 }
516-- but this one gets pushed in front
517token.putnext ( t3, t4 )
518\stoptyping
519
520When we scan \type {wxyz!} we get \type {yzwx!} back. The argument is either a
521table with tokens or a list of tokens. The \type {token.expand} function will
522trigger expansion but what happens really depends on what you're doing where.
523
524This putter is actually a bit more flexible because the following input also
525works out okay:
526
527\startbuffer
528\def\foo#1{[#1]}
529
530\directlua {
531    local list = { 101, 102, 103, token.create("foo"), "{abracadabra}" }
532    token.putnext("(the)")
533    token.putnext(list)
534    token.putnext("(order)")
535    token.putnext(unpack(list))
536    token.putnext("(is reversed)")
537}
538\stopbuffer
539
540\typebuffer
541
542We get this: \blank {\tt \inlinebuffer} \blank So, strings get converted to
543individual tokens according to the current catcode regime and numbers become
544characters also according to this regime. A more low level, single token push
545back is the next one, it does the same as when \TEX\ itself puts a token back into
546the input, something that for instance happens when an integer is scanned and the
547last scanned token is not a digit.
548
549\starttyping[option=LUA]
550function token.putback ( <t:token> )
551    -- no return values
552end
553\stoptyping
554
555You can force an \quote {expand step} with the following function. What happens
556depends on the input and scanner states \TEX\ is.
557
558\starttyping[option=LUA]
559function token.expand ( )
560    -- no return values
561end
562\stoptyping
563
564\stopsubsection
565
566\startsubsection[title={Scanning}]
567
568The token library provides means to intercept the input and deal with it at the
569\LUA\ level. The library provides a basic scanner infrastructure that can be used
570to write macros that accept a wide range of arguments. This interface is on
571purpose kept general and as performance is quite okay so one can build additional
572parsers without too much overhead. It's up to macro package writers to see how
573they can benefit from this as the main principle behind \LUAMETATEX\ is to
574provide a minimal set of tools and no solutions. The scanner functions are
575probably the most intriguing.
576
577We start with token scanners. The first one just reads the next token from the
578current input (file, token list, \LUA\ output) while the second variant expands
579the next token, which can push back results and make us enter a new input level,
580and then reads a token from what is then the input.
581
582\starttyping[option=LUA]
583function token.scannext ( )
584    return <t:token>
585end
586
587function token.scannextexpanded ( )
588    return <t:token>
589end
590\stoptyping
591
592This is a simple scanner that picks up a character:
593
594\starttyping[option=LUA]
595function token.scannextchar ( )
596    return <t:string>
597end
598\stoptyping
599
600We can look ahead, that is: pick up a token and push a copy back into the input.
601The second helper first expands the upcoming token and the third one is the peek
602variant of \type {scannextchar}.
603
604\starttyping[option=LUA]
605function token.peeknext ( )
606    return <t:token>
607end
608
609function token.peeknextexpanded ( )
610    return <t:token>
611end
612
613function token.peeknextchar ( )
614    return <t:token>
615end
616\stoptyping
617
618We can skip tokens with the following two helpers where the second one first
619expands the upcoming token
620
621\starttyping[option=LUA]
622function token.skipnext ( )
623    -- no return values
624end
625
626function token.skipnextexpanded ( )
627    -- no return values
628end
629\stoptyping
630
631The next token can be converted into a combination of command and value. The
632second variant shown below first expands the upcoming token.
633
634\starttyping[option=LUA]
635function token.scancmdchr ( )
636    return
637        <t:integer>, -- command a.k.a cmd
638        <t:integer>, -- value   a.k.a chr
639end
640
641function token.scancmdchrexpanded ( )
642    return
643        <t:integer>, -- command a.k.a cmd
644        <t:integer>, -- value   a.k.a chr
645end
646\stoptyping
647
648We have two keywords scanners. The first scans how \TEX\ does it: a mixture of
649lower- and uppercase. The second is case sensitive.
650
651\starttyping[option=LUA]
652function token.scankeyword ( <t:string> keyword )
653    return <t:boolean> -- success
654end
655
656function token.scankeywordcs ( <t:string> keyword )
657    return <t:boolean> -- success
658end
659\stoptyping
660
661The integer, dimension and glue scanners take an extra optional argument that
662signals that en optional equal is permitted. The next function errors when
663the integer exceeds the maximum that \TEX\ likes: \number \maxcount .
664
665\starttyping[option=LUA]
666function token.scaninteger ( <t:boolean> optionalequal )
667    return <t:integer>
668end
669\stoptyping
670
671Cardinals are unsigned integers:
672
673\starttyping[option=LUA]
674function token.scancardinal ( <t:boolean> optionalequal )
675    return <t:cardinal>
676end
677\stoptyping
678
679When an integer or dimension is wrapped in curly braces, like \type {{123}} and
680\type {{4.5pt}}, you can use one of the next two. Of course unwrapped integers
681and dimensions are also read.
682
683\starttyping[option=LUA]
684function token.scanintegerargument ( <t:boolean> optionalequal )
685    return <t:integer>
686end
687
688function token.scandimensionargument (
689    <t:boolean> infinity,
690    <t:boolean> mu,
691    <t:boolean> optionalequal
692)
693    return <t:integer>
694end
695\stoptyping
696
697When we scan for a float, we also accept an exponent, so \type {123.45} and
698\type {-1.23e45} are valid:
699
700% \cldcontext{type(token.scanfloat())} 1.23
701% \cldcontext{type(token.scanfloat())} 1.23e100
702
703\starttyping[option=LUA]
704function token.scanfloat ( )
705    return <t:number>
706end
707\stoptyping
708
709Contrary to the previous scanner here we don't handle the exponent:
710
711\starttyping[option=LUA]
712function token.scanreal ( )
713    return <t:number>
714end
715\stoptyping
716
717In \LUA\ a very precise representation of a float is the hexadecimal notation. In
718addition to regular floating point, optionally with an exponent, you can also
719have \type {0x1.23p45}.
720
721% \cldcontext{"\letterpercent q",token.scanluanumber()} 0x1.23p45
722
723\starttyping[option=LUA]
724function token.scanluanumber ( )
725    return <t:number>
726end
727\stoptyping
728
729Integers can be signed:
730
731\starttyping[option=LUA]
732function token.scanluainteger ( )
733    return <t:integer>
734end
735\stoptyping
736
737while cardinals (\MODULA2 speak) are unsigned:
738unsigned
739
740\starttyping[option=LUA]
741function token.scanluacardinal ( )
742    return <t:cardinal>
743end
744\stoptyping
745
746\cldcontext{token.scanscale()} 122.345
747
748\starttyping[option=LUA]
749function token.scanscale ( )
750    return <t:integer>
751end
752\stoptyping
753
754A posit is (in \LUAMETATEX) a float packed into an integer, but contrary to a
755scaled value it can have exponents. Here \type {12.34} gives {\tttf
756\cldcontext{token.scanposit()} 12.34} and Here \type {12.34e5} gives {\tttf
757\cldcontext{token.scanposit()}12.34e5}. Because we have integers we can store
758them in \LUAMETATEX\ float registers. Optionally you can return a float instead
759of the integer that encodes the posit.
760
761\starttyping[option=LUA]
762function token.scanposit (
763    <t:boolean> optionalqual,
764    <t:boolean> float
765)
766    return <t:integer> | <t:float>
767end
768\stoptyping
769
770In (traditional) \TEX\ we don't really have floats. If we enter for instance a
771dimension in point units, we actually scan for two 16 bit integers that will be
772packed into a 32 bit integer. The next scanner expects a number plus a unit, like
773\type {pt}, \type {cm} and \type {em}, but also handles user defined units, like
774in \CONTEXT\ \type {tw}.
775
776\starttyping[option=LUA]
777function token.scandimension (
778    <t:boolean> infinity,
779    <t:boolean> mu,
780    <t:boolean> optionalequal
781)
782    return <t:integer>
783end
784\stoptyping
785
786A glue (spec) is a dimension with optional stretch and|/|or shrink, like \typ {12pt plus
7874pt minus 2pt} or \typ {10pt plus 1 fill}. The glue scanner returns five values:
788
789\starttyping[option=LUA]
790function token.scanglue (
791    <t:boolean> mu,
792    <t:boolean> optionalequal
793)
794    return
795        <t:integer>, -- amount
796        <t:integer>, -- stretch
797        <t:integer>, -- shrink
798        <t:integer>, -- stretchorder
799        <t:integer>  -- shrinkorder
800end
801
802function token.scanglue (
803    <t:boolean> mu,
804    <t:boolean> optionalequal,
805    <t:true>
806)
807    return {
808        <t:integer>, -- amount
809        <t:integer>, -- stretch
810        <t:integer>, -- shrink
811        <t:integer>, -- stretchorder
812        <t:integer>  -- shrinkorder
813    }
814end
815\stoptyping
816
817The skip scanner does the same but returns a \type {gluespec} node:
818
819\starttyping[option=LUA]
820function token.scanskip (
821    <t:boolean> mu,
822    <t:boolean> optionalequal
823)
824    return <t:node> -- gluespec
825end
826\stoptyping
827
828There are several token scanners, for instance one that returns a table:
829
830\starttyping[option=LUA]
831function token.scantoks (
832    <t:boolean> macro,
833    <t:boolean> expand
834)
835    -- return <t:table> -- tokens
836end
837\stoptyping
838
839Here \type {token.scantoks()} will return \type {{123}} as
840
841\starttyping[option=LUA]
842{
843 "<lua token : 589866 == other_char 49>",
844 "<lua token : 589867 == other_char 50>",
845 "<lua token : 589870 == other_char 51>",
846}
847\stoptyping
848
849The next variant returns a token list:
850
851\starttyping[option=LUA]
852function token.scantokenlist (
853    <t:boolean> macro,
854    <t:boolean> expand
855)
856    return <t:token> -- tokenlist
857end
858\stoptyping
859
860Here we get the head of a token list:
861
862\starttyping[option=LUA]
863<lua token : 590083 => 169324 : refcount>={
864 ["active"]=false,
865 ["cmdname"]="escape",
866 ["command"]=0,
867 ["expandable"]=false,
868 ["frozen"]=false,
869 ["id"]=590083,
870 ["immutable"]=false,
871 ["index"]=0,
872}
873\stoptyping
874
875This scans a single character token with specified catcode (bit) sets:
876
877\starttyping[option=LUA]
878function token.scancode ( <t:integer> catcodes )
879    return <t:string> -- character
880end
881\stoptyping
882
883This scans a single character token with catcode letter or other:
884
885\starttyping[option=LUA]
886function token.scantokencode ( )
887    -- return <t:token>
888end
889\stoptyping
890
891The difference between \typ {scanstring} and \typ {scanargument} is that the
892first returns a string given between \type {{}}, as \type {\macro} or as sequence
893of characters with catcode 11 or 12 while the second also accepts a \type {\cs}
894which then get expanded one level unless we force further expansion.
895
896\starttyping[option=LUA]
897function token.scanstring ( <t:boolean> expand )
898    return <t:string>
899end
900
901function token.scanargument ( <t:boolean> expand )
902    return <t:string>
903end
904\stoptyping
905
906So the \type {scanargument} function expands the given argument. When a braced
907argument is scanned, expansion can be prohibited by passing \type {false}
908(default is \type {true}). In case of a control sequence passing \type {false}
909will result in a one|-|level expansion (the meaning of the macro).
910
911The string scanner scans for something between curly braces and expands on the
912way, or when it sees a control sequence it will return its meaning. Otherwise it
913will scan characters with catcode \type {letter} or \type {other}. So, given the
914following definition:
915
916\startbuffer
917\def\oof{oof}
918\def\foo{foo-\oof}
919\stopbuffer
920
921\typebuffer \getbuffer
922
923we get:
924
925\starttabulate[|l|Tl|l|]
926\FL
927\BC name \BC result \NC \NR
928\TL
929\NC \type {\directlua{token.scanstring()}{foo}} \NC \directlua{context("{\\red\\type {"..token.scanstring().."}}")} {foo} \NC full expansion \NC \NR
930\NC \type {\directlua{token.scanstring()}foo}   \NC \directlua{context("{\\red\\type {"..token.scanstring().."}}")} foo   \NC letters and others \NC \NR
931\NC \type {\directlua{token.scanstring()}\foo}  \NC \directlua{context("{\\red\\type {"..token.scanstring().."}}")}\foo   \NC meaning \NC \NR
932\LL
933\stoptabulate
934
935The \type {\foo} case only gives the meaning, but one can pass an already
936expanded definition (\type {\edef}'d). In the case of the braced variant one can
937of course use the \type {\detokenize} and \prm {unexpanded} primitives since
938there we do expand.
939
940A variant is the following which give a bit more control over what doesn't get
941expanded:
942
943\starttyping[option=LUA]
944function token.scantokenstring (
945    <t:boolean> noexpand,
946    <t:boolean> noexpandconstant,
947    <t:boolean> noexpandparameters
948)
949    return <t:string>
950end
951\stoptyping
952
953Here's one that can scan a delimited argument:
954
955\starttyping[option=LUA]
956function token.scandelimited (
957    <t:integer> leftdelimiter,
958    <t:integer> rightdelimiter,
959    <t:boolean> expand
960)
961    return <t:string>
962end
963\stoptyping
964
965A word is a sequence of what \TEX\ calls letters and other characters. The
966optional \type {keep} argument endures that trailing space and \type {\relax}
967tokens are pushed back into the input.
968
969\starttyping[option=LUA]
970function token.scanword ( <t:boolean> keep )
971    return <t:string>
972end
973\stoptyping
974
975Here we do the same but only accept letters:
976
977\starttyping[option=LUA]
978function token.scanletters ( <t:boolean> keep )
979    return <t:string>
980end
981\stoptyping
982
983\starttyping[option=LUA]
984function token.scankey ( )
985    return <t:string>
986end
987\stoptyping
988
989We can pick up a string that stops at a specific character with the next
990function, which accepts two such sentinels (think of a comma and closing
991bracket).
992
993\starttyping[option=LUA]
994function token.scanvalue ( <t:integer> one, <t:integer> two )
995    return <t:string>
996end
997\stoptyping
998
999This returns a single (\UTF) character. Special input like back slashes, hashes,
1000etc.\ are interpreted as characters.
1001
1002\starttyping[option=LUA]
1003function token.scanchar ( )
1004    return <t:string>
1005end
1006\stoptyping
1007
1008This scanner looks for a control sequence and if found returns the name.
1009Optionally leading spaces can be skipped.
1010
1011\starttyping[option=LUA]
1012function token.scancsname ( <t:boolean> skipspaces )
1013    return <t:string> | <t:nil>
1014end
1015\stoptyping
1016
1017The next one returns an integer instead:
1018
1019\starttyping[option=LUA]
1020function token.scancstoken ( <t:boolean> skipspaces )
1021    return <t:integer> | <t:nil>
1022end
1023\stoptyping
1024
1025This is a straightforward simple scanner that expands next token if needed:
1026
1027\starttyping[option=LUA]
1028function token.scantoken ( )
1029    return <t:token>
1030end
1031\stoptyping
1032
1033Then next scanner picks up a box specification and returns a \type {[h|v]list}
1034node. There are two possible calls. The first variant expects a \type {\hbox}, \type
1035{\vbox} etc. The second variant scans for an explicitly passed box type: \type
1036{hbox}, \type {vbox}, \type {vbox} or \type {dbox}.
1037
1038\starttyping[option=LUA]
1039function token.scanbox ( )
1040    return <t:node> -- box
1041end
1042
1043function token.scanbox ( <t:string> boxtype )
1044    return <t:node> -- box
1045end
1046\stoptyping
1047
1048This scans and returns a so called \quote {detokenized} string:
1049
1050\starttyping[option=LUA]
1051function token.scandetokened ( <t:boolean> expand )
1052    return <t:string>
1053end
1054\stoptyping
1055
1056In the next function we check if a specific character with catcode
1057letter or other is picked up.
1058
1059\starttyping[option=LUA]
1060function token.isnextchar ( <t:integer> charactercode  )
1061    return <t:boolean>
1062end
1063\stoptyping
1064
1065\stopsubsection
1066
1067\startsubsection[title={Gobbling}]
1068
1069You can gobble up an integer or dimension with the following helpers. An error is silently
1070ignored.
1071
1072\starttyping[option=LUA]
1073function token.gobbleinteger ( <t:boolean> optionalequal )
1074    -- no return values
1075end
1076
1077function token.gobbledimension ( <t:boolean> optionalequal )
1078    -- no return values
1079end
1080\stoptyping
1081
1082This is a nested gobbler:
1083
1084\starttyping[option=LUA]
1085function token.gobble ( <t:token> left, <t:token> right )
1086    -- no return values
1087end
1088\stoptyping
1089
1090and this a nested grabber that returns a string:
1091
1092\starttyping[option=LUA]
1093function token.grab ( <t:token> left, <t:token> right )
1094    return <t:string>
1095end
1096\stoptyping
1097
1098\stopsubsection
1099
1100\startsubsection[title={Macros}]
1101
1102This is a nasty one. It pick up two tokens. Then it checks if the next character
1103matches the argument and if so, it pushes the first token back into the input,
1104otherwise the second.
1105
1106\starttyping[option=LUA]
1107function token.futureexpand ( <t:integer> charactercode )
1108    -- no return values
1109end
1110\stoptyping
1111
1112The \type {pushmacro} and \type {popmacro} function are still experimental and
1113can be used to get and set an existing macro. The push call returns a user data
1114object and the pop takes such a userdata object. These object have no accessors
1115and are to be seen as abstractions.
1116
1117\starttyping[option=LUA]
1118function token.pushmacro ( <t:string> csname )
1119    return <t:userdata>
1120end
1121
1122function token.pushmacro ( <t:integer> token )
1123    return <t:userdata> -- entry
1124end
1125\stoptyping
1126
1127\starttyping[option=LUA]
1128function token.popmacro ( <t:userdata> entry )
1129    -- return todo
1130end
1131\stoptyping
1132
1133This saves a \LUA\ function index on the save stack. When a group is closes the
1134function will be called.
1135
1136\starttyping[option=LUA]
1137function token.savelua ( <t:integer> functionindex, <t:boolean> backtrack )
1138    -- no return values
1139end
1140\stoptyping
1141
1142The next function serializes a token list:
1143
1144\starttyping[option=LUA]
1145function token.serialize ( )
1146    return <t:string>
1147end
1148\stoptyping
1149
1150The function is somewhat picky so give van example in \CONTEXT\ speak:
1151
1152\startbuffer
1153\startluacode
1154    local t = token.scantokenlist()
1155    local s = token.serialize(t)
1156    context.type(tostring(t)) context.par()
1157    context.type(s)           context.par()
1158    context(s)                context.par()
1159\stopluacode {before\hskip10pt after}
1160\stopbuffer
1161
1162\typebuffer
1163
1164The serialize expects a token list as scanned by \typ {scantokenlist} which
1165starts with token that points to the list and maintains a reference count, which
1166in this context is irrelevant but is used in the engine to prevent duplicates;
1167for instance the \type {\let} primitive just points to the original and bumps the
1168count.
1169
1170\startlines
1171\getbuffer
1172\stoplines
1173
1174You can interpret a string as \TEX\ input with embedded macros expanded, unless
1175they are unexpandable.
1176
1177\starttyping[option=LUA]
1178function token.getexpansion ( <t:string> code )
1179    return <t:string> -- result
1180end
1181\stoptyping
1182
1183Here is an example:
1184
1185\startbuffer
1186          \def\foo{foo}
1187\protected\def\oof{oof}
1188
1189\startluacode
1190context.type(token.getexpansion("test \relax"))
1191context.par()
1192context.type(token.getexpansion("test \\relax{!} \\foo\\oof"))
1193\stopluacode
1194\stopbuffer
1195
1196\typebuffer
1197
1198Watch how the single backslash actually is a \LUA\ escape that results in
1199a newline:
1200
1201\startlines
1202\getbuffer
1203\stoplines
1204
1205You can also specify a catcode table identifier:
1206
1207\starttyping[option=LUA]
1208function token.getexpansion (
1209    <t:integer> catcodetable,
1210    <t:string>  code
1211)
1212    return <t:string> -- result
1213end
1214\stoptyping
1215
1216\stopsubsection
1217
1218\startsubsection[title={Information}]
1219
1220In some cases you signal to \LUA\ what data type is involved. The list of known
1221types are available with:
1222
1223\starttyping[option=LUA]
1224function token.getfunctionvalues ( )
1225    return <t:table>
1226end
1227\stoptyping
1228
1229\startthreerows
1230\getbuffer[engine:syntax:functioncodes]
1231\stopthreerows
1232
1233The names of command is made available with:
1234
1235\starttyping[option=LUA]
1236function token.getcommandvalues ( )
1237    return <t:table>
1238end
1239\stoptyping
1240
1241\starttworows
1242\getbuffer[engine:syntax:commandcodes]
1243\stoptworows
1244
1245The complete list of primitives can be fetched with the next one:
1246
1247\starttyping[option=LUA]
1248function token.getprimitives ( )
1249    return {
1250        { <t:integer>, <t:integer>, <t:string> }, -- command, value, name
1251        ...
1252    }
1253end
1254\stoptyping
1255
1256The numbers shown below can change if we add or reorganize primitives, although
1257this seldom happens. The list gives an impression how primitives are grouped.
1258
1259\showengineprimitives[2]
1260
1261This is a curious one: it returns the number of steps that a hash lookup took:
1262
1263\starttyping[option=LUA]
1264function token.locatemacro ( <t:string> name )
1265    return <t:integer> - steps
1266end
1267\stoptyping
1268
1269We used this helper when deciding on a reasonable hash size. Of the many
1270primitives there are a few that need more than one lookup step:
1271
1272\startluacode
1273local p = token.getprimitives()
1274local d = { { }, { }, { }, { } }
1275local n = {  0 ,  0 ,  0 ,  0  }
1276table.sort(p,function(a,b) return a[3] < b[3] end)
1277for i=1,#p do
1278    local m = p[i][3]
1279    local s = token.locatemacro(m)
1280    if n[s] then
1281        if s > 1 then
1282            table.insert(d[s],m)
1283        end
1284        n[s] = n[s] + 1
1285    else
1286        print(">>>>>>>>>>>>>>>>>>>>>>>>>> check",s)
1287    end
1288end
1289context.starttabulate { "|c|r|lpT|" }
1290context.FL()
1291context.BC() context("steps")
1292context.BC() context("total")
1293context.BC() context("macros")
1294context.NC() context.NR()
1295context.TL()
1296for i=1,4 do
1297    local di = d[i]
1298    local ni = n[i]
1299    if ni > 0 then
1300        context.NC() context(i)
1301        context.NC() context(ni)
1302        context.NC() if ni > 20 then context.unknown() else context("% t",di) end
1303        context.NC() context.NR()
1304    end
1305end
1306context.LL()
1307context.stoptabulate()
1308\stopluacode
1309
1310\stopsubsection
1311
1312\stopsection
1313
1314\stopdocument
1315
1316
1317% The \type {scanword} scanner can be used to implement for instance a number
1318% scanner. An optional boolean argument can signal that a trailing space or \type
1319% {\relax} should be gobbled:
1320%
1321% \starttyping
1322% function token.scannumber(base)
1323%     return tonumber(token.scanword(),base)
1324% end
1325% \stoptyping
1326%
1327% This scanner accepts any valid \LUA\ number so it is a way to pick up floats
1328% in the input.
1329%
1330% You can use the \LUA\ interface as follows:
1331%
1332% \starttyping
1333% \directlua {
1334%     function mymacro(n)
1335%         ...
1336%     end
1337% }
1338%
1339% \def\mymacro#1{%
1340%     \directlua {
1341%         mymacro(\number\dimexpr#1)
1342%     }%
1343% }
1344%
1345% \mymacro{12pt}
1346% \mymacro{\dimen0}
1347% \stoptyping
1348%
1349% You can also do this:
1350%
1351% \starttyping
1352% \directlua {
1353%     function mymacro()
1354%         local d = token.scandimen()
1355%         ...
1356%     end
1357% }
1358%
1359% \def\mymacro{%
1360%     \directlua {
1361%         mymacro()
1362%     }%
1363% }
1364%
1365% \mymacro 12pt
1366% \mymacro \dimen0
1367% \stoptyping
1368%
1369% It is quite clear from looking at the code what the first method needs as
1370% argument(s). For the second method you need to look at the \LUA\ code to see what
1371% gets picked up. Instead of passing from \TEX\ to \LUA\ we let \LUA\ fetch from
1372% the input stream.
1373%
1374% In the first case the input is tokenized and then turned into a string, then it
1375% is passed to \LUA\ where it gets interpreted. In the second case only a function
1376% call gets interpreted but then the input is picked up by explicitly calling the
1377% scanner functions. These return proper \LUA\ variables so no further conversion
1378% has to be done. This is more efficient but in practice (given what \TEX\ has to
1379% do) this effect should not be overestimated. For numbers and dimensions it saves
1380% a bit but for passing strings conversion to and from tokens has to be done anyway
1381% (although we can probably speed up the process in later versions if needed).
1382
1383% When scanning for the next token you need to keep in mind that we're not scanning
1384% like \TEX\ does: expanding, changing modes and doing things as it goes. When we
1385% scan with \LUA\ we just pick up tokens. Say that we have:
1386%
1387% \pushmacro\oof \let\oof\undefined
1388%
1389% \starttyping
1390% \oof
1391% \stoptyping
1392%
1393% but \type {\oof} is undefined. Normally \TEX\ will then issue an error message.
1394% However, when we have:
1395%
1396% \starttyping
1397% \def\foo{\oof}
1398% \stoptyping
1399%
1400% We get no error, unless we expand \type {\foo} while \type {\oof} is still
1401% undefined. What happens is that as soon as \TEX\ sees an undefined macro it will
1402% create a hash entry and when later it gets defined that entry will be reused. So,
1403% \type {\oof} really exists but can be in an undefined state.
1404%
1405% \startbuffer[demo]
1406% oof        : \directlua{tex.print(token.scancsname())}\oof
1407% foo        : \directlua{tex.print(token.scancsname())}\foo
1408% myfirstoof : \directlua{tex.print(token.scancsname())}\myfirstoof
1409% \stopbuffer
1410%
1411% \startlines
1412% \getbuffer[demo]
1413% \stoplines
1414%
1415% This was entered as:
1416%
1417% \typebuffer[demo]
1418%
1419% The reason that you see \type {oof} reported and not \type {myfirstoof} is that
1420% \type {\oof} was already used in a previous paragraph.
1421%
1422% If we now say:
1423%
1424% \startbuffer
1425% \def\foo{}
1426% \stopbuffer
1427%
1428% \typebuffer \getbuffer
1429%
1430% we get:
1431%
1432% \startlines
1433% \getbuffer[demo]
1434% \stoplines
1435%
1436% And if we say
1437%
1438% \startbuffer
1439% \def\foo{\oof}
1440% \stopbuffer
1441%
1442% \typebuffer \getbuffer
1443%
1444% we get:
1445%
1446% \startlines
1447% \getbuffer[demo]
1448% \stoplines
1449%
1450% When scanning from \LUA\ we are not in a mode that defines (undefined) macros at
1451% all. There we just get the real primitive undefined macro token.
1452%
1453% \startbuffer
1454% \directlua{local t = token.scannext() tex.print(t.id.." "..t.tok)}\myfirstoof
1455% \directlua{local t = token.scannext() tex.print(t.id.." "..t.tok)}\mysecondoof
1456% \directlua{local t = token.scannext() tex.print(t.id.." "..t.tok)}\mythirdoof
1457% \stopbuffer
1458%
1459% \startlines
1460% \getbuffer
1461% \stoplines
1462%
1463% This was generated with:
1464%
1465% \typebuffer
1466%
1467% So, we do get a unique token because after all we need some kind of \LUA\ object
1468% that can be used and garbage collected, but it is basically the same one,
1469% representing an undefined control sequence.
1470%
1471% \popmacro\oof
1472