mk-tokenspeak.tex /size: 8254 b    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent mk-tokenspeak
4
5\environment mk-environment
6
7\chapter {Token speak}
8
9\subject{tokenization}
10
11Most \TEX\ users only deal with (keyed in) characters and (produced) output. Some
12will play with boxes, skips and kerns or maybe even leaders (repeated sequences
13of the former). Others will be grateful that macro package writers take care of
14such things.
15
16Macro writers on the other hand deal properties of characters, like catcodes and
17a truckload of other codes, with lists made out of boxes, skips, kerns and
18penalties but even they cannot look much deeper into \TEX's internals. Their
19deeper understanding comes from reading the \TEX book or even looking at the
20source code.
21
22When someone enters the magic world of \TEX\ and starts asking around on a bit,
23he or she will at some point get confronted with the concept of \quote {tokens}.
24A token is what ends up in \TEX\ after characters have entered its machinery.
25Sometimes it even seems that one is only considered a qualified macro writer if
26one can talk the right token||speak. So what are those magic tokens and how can
27\LUATEX\ shed light on this.
28
29In a moment we will show examples of how \LUATEX\ turns characters into tokens,
30but when looking at those sequences, you need to keep a few things in mind:
31
32\startitemize[packed]
33\startitem
34    A sequence of characters that starts with an escape symbol (normally this is
35    the backslash) is looked up in the hash table (which relates those names to
36    meanings) and replaced by its reference. Such a reference is much faster than
37    looking up the sequence each time.
38\stopitem
39\startitem
40    Characters can have special meanings, for instance a dollar is often used to
41    enter and exit math mode, and a percent symbol starts a comment and hides
42    everything following it on the same line. These meanings are determined by
43    the character's catcode.
44\stopitem
45\startitem
46    All the characters that will end up actually typeset have catcode \quote
47    {letter} or \quote {other} assigned. A sequence of items with catcode
48    \quote{letter} is considered a word and can potentially become hyphenated.
49\stopitem
50\stopitemize
51
52\subject{examples}
53
54We will now provide a few examples of how \TEX\ sees your input.
55
56\starttyping
57Hi there!
58\stoptyping
59
60\starttokens[demo]Hi there!\stoptokens \setups{ShowCollect}
61
62Here we see three kind ot tokens. At this stage a space is still recognizable as
63such but later this will become a skip. In our current setup, the exclamation
64mark is not a letter.
65
66\starttyping
67Hans \& Taco use Lua\TeX \char 33\relax
68\stoptyping
69
70\starttokens[demo]Hans \& Taco use Lua\TeX \char 33\relax\stoptokens \setups{ShowCollect}
71
72Here we see a few new tokens, a \quote {char\_given} and a \quote {call}. The
73first represents a \type {\chardef} i.e.\ a reference to a character slot in a
74font, and the second one a macro that will expand to the \TEX\ logo. Watch how
75the space after a control sequence is eaten up. The exclamation mark is a direct
76reference to character slot~33.
77
78\starttyping
79\noindent {\bf Hans} \par \hbox{Taco} \endgraf
80\stoptyping
81
82\starttokens[demo]\noindent {\bf Hans} \par \hbox{Taco} \endgraf\stoptokens \setups{ShowCollect}
83
84As you can see, some primitives and macro's that are bound to them (like \type
85{\endgraf}) have an internal representation on top of their name.
86
87\starttyping
88before \dimen2=10pt after \the\dimen2
89\stoptyping
90
91\starttokens[demo]before \dimen2=10pt after \the\dimen2\stoptokens \setups{ShowCollect}
92
93As you can see, registers are not explicitly named, one needs the associated
94register code to determine it's character (a dimension in our case).
95
96\starttyping
97before \inframed[width=3cm]{whatever} after
98\stoptyping
99
100\starttokens[demo]before \inframed[width=3cm]{whatever} after\stoptokens \setups{ShowCollect}
101
102As you can see, even when control sequences are collapsed into a reference, we
103still end up with many tokens, and because each token has three properties (cmd,
104chr and id) in practice we end up with more memory used after tokenization.
105
106\starttyping
107compound|-|word
108\stoptyping
109
110\starttokens[demo]compound|-|word\stoptokens \setups{ShowCollect}
111
112This example uses an active character to handle compound words (a \CONTEXT\
113feature).
114
115\starttyping
116hm, \directlua 0 { tex.sprint("Hello World") }
117\stoptyping
118
119\starttokens[demo]hm, \directlua 0 { tex.sprint("Hello World!") }\stoptokens \setups{ShowCollect}
120
121The previous example shows what happens when we include a bit of \LUA\ code
122\unknown\ it is just seen as regular input, but when the string is passed to
123\LUA, only the chr property is passed, so we no longer can distinguish between
124letters and other characters.
125
126A macro definition converts to tokens as follows.
127
128\starttokens[demo]\def\Test#1#2{[#2][#1]} \Test{A}{B}\stoptokens \setups{ShowCollect}
129
130As we already mentioned, a token has three properties. More details can be found
131in the reference manual so we will not go into much detail here.
132
133{\bf The original interceptor for tokens but that one has been replaced by a more
134powerful scanning mechanism. The following text is no longer applicable but kept
135as historic reference. The new token scanner is discussed in later articles.}
136
137% keep text formatted as it is now:
138
139\starttyping[color=]
140
141A most simple callback is:
142
143\starttyping
144callback.register('token_filter', token.get_next)
145\stoptyping
146
147In principle you can call \type {token.get_next} anytime you want
148to intercept a token. In that case you can feed back tokens into
149\TEX\ by using a trick like:
150
151\starttyping
152function tex.printlist(data)
153   callback.register('token_filter', function ()
154       callback.register('token_filter', nil)
155       return data
156    end)
157end
158\stoptyping
159
160Another example of usage is:
161
162\starttyping
163callback.register('token_filter', function ()
164    local t = token.get_next
165    local cmd, chr, id = t[1], t[2], t[3]
166    -- do something with cmd, chr, id
167    return { cmd, chr, id }
168end)
169\stoptyping
170
171There is a whole repertoire of related functions, one is \type
172{token.create}, which can be used as:
173
174\starttyping
175tex.printlist{
176    token.create("hbox"),
177    token.create(utf.byte("{"),  1),
178    token.create(utf.byte("?"), 12),
179    token.create(utf.byte("}"),  2),
180}
181\stoptyping
182
183This results in: \ctxlua {
184    tex.printlist{
185        token.create("hbox"),
186        token.create(utf.byte("{"),  1),
187        token.create(utf.byte("?"), 12),
188        token.create(utf.byte("}"),  2),
189    }
190}
191
192While playing with this we made a few auxiliary functions that
193permit things like:
194
195\starttyping
196tex.printlist ( table.unnest ( {
197    tokens.hbox,
198    tokens.bgroup,
199    tokens.letters("12345"),
200    tokens.egroup,
201} ) )
202\stoptyping
203
204Unnesting is needed because the result of the \type {letters} call
205is a table, and the \type {printlist} function wants a flattened
206table.
207
208The result looks like: \ctxlua {
209    local t = table.unnest {
210        tokens.hbox,
211        tokens.bgroup,
212        tokens.letters("12345"),
213        tokens.egroup,
214    }
215    tex.printlist(t)
216    tokens.collectors.show(t)
217}
218
219In practice, manipulating tokens or constructing lists of tokens
220this way is rather cumbersome, but at least we now have some
221kind of access, if only for illustrative purposes.
222
223\starttyping
224\hbox{12345\hbox{54321}}
225\stoptyping
226
227can also be done by saying:
228
229\starttyping
230tex.sprint("\\hbox{12345\\hbox{54321}}")
231\stoptyping
232
233or under \CONTEXT's basic catcode regime:
234
235\starttyping
236tex.sprint(tex.ctxcatcodes, "\\hbox{12345\\hbox{54321}}")
237\stoptyping
238
239If you like it the hard way:
240
241\starttyping
242tex.printlist ( table.unnest ( {
243    tokens.hbox,
244        tokens.bgroup,
245            tokens.letters("12345"),
246            tokens.hbox,
247                tokens.bgroup,
248                    tokens.letters(string.reverse("12345")),
249                tokens.egroup,
250        tokens.egroup
251} ) )
252\stoptyping
253
254This method may attract those who dislike the traditional \TEX\
255syntax for doing the same thing. Okay, a careful reader will
256notice that reversing the string in \TEX\ takes a bit more
257trickery, so \unknown
258
259\stoptyping
260
261% end of verbose text
262
263{\bf The \type {tokens} etc.\ examples shows here make no sense anyway as we have
264a more extensive interface to the macro language: \type {context}.}
265
266\stopcomponent
267