mk-performance.tex /size: 15 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent mk-performance
4
5\environment mk-environment
6
7\chapter{How about performance}
8
9\subject{remark}
10
11The previous chapters already spent some words on performance
12and memory usage. By the time that Taco and I were implementing,
13discussing and testing the callbacks related to node lists, we were
14already convinced that in all areas covered so far (file management,
15handling input characters, dealing with fonts, conversion to tokens,
16string and table manipulation, enz.) the \TEX||\LUA\ pair was up to
17the task  And so we were quite confident that processing nodes was
18not only an important aspect of \LUATEX\ but also quite feasable
19in terms of performance (after all we needed it in order to deal
20with advanced typesetting of Arab). When Taco was dealing with the
21\TEX\ side of the story, I was experimenting with possible
22mechanisms at the \LUA\ end.
23
24At the same time I got the opportunity to speed up the \METAPOST\
25to \PDF\ converter and both activities involved some timing. Here
26I report some of the observations that we made in this process.
27
28\subject{parsing}
29
30Expressions in \LUA\ are powerful and definitely faster than regular
31expressions found in other languages, but they have some limits. Most
32noticeably is the lack of alternation. In \RUBY\ one can say:
33
34\starttyping
35str = "there is no gamma in here, just an beta"
36
37if str =~ /(alph|bet|delt)a/ then
38    print($1)
39end
40\stoptyping
41
42but in \LUA\ you need a few more lines:
43
44\starttyping
45str = "there is no gamma in here, just an beta"
46
47for _, v in pairs({'alpha','beta','delta'}) do
48    local s = str:match(v)
49    if s then
50        print(s)
51        break
52    end
53end
54\stoptyping
55
56Interesting is that upto now I didn't really miss alternation but
57it may as well be that the lack of it drove me to come up with
58different solutions. For \CONTEXT\ \MKIV\ the \METAPOST\ to \PDF\
59converter has been rewritten in \LUA. This is a prelude to direct
60\LUA\ output from \METAPOST\ but I needed the exercise. It was
61among the first \LUA\ code in \MKIV.
62
63Progressive (sequential) parsing of the data is an option, and is
64done in \MKII\ using pure \TEX. We collect words and compare them
65to \POSTSCRIPT\ directives and act accordingly. The messy parts
66are scanning the preamble, which has specials to be dealt with as
67well as lots of unpredictable code to skip, and the \type
68{fshow} command which adds text to a graphic. But real dirty are
69the code fragments that deal with setting the line width and
70penshapes so the cleanup of this takes some time.
71
72In \LUA\ a different approach is taken. There is an \type {mp} table
73which collects a lot of functions that more or less reflect the
74output of \METAPOST. The functions take care of generating the right
75\PDF\ code and also handle the transformations needed because of
76the differences between \POSTSCRIPT\ and \PDF.
77
78The sequential \POSTSCRIPT\ that comes from \METAPOST\ is
79collected in one string and converted using \type {gsub} into a
80sequence of \LUA\ function calls. Before this can be done, some
81cleanup takes place. The resulting string is then executed as
82\LUA\ code.
83
84As an example:
85
86\starttyping
871 0 0 2 0 0 curveto
88\stoptyping
89
90becomes
91
92\starttyping
93mp.curveto(1,0,0,2,0,0)
94\stoptyping
95
96which results in:
97
98\starttyping
99\pdfliteral{1 0 0 2 0 0 c}
100\stoptyping
101
102In between, the path is stored and transformed which is needed in
103the case of penshapes, where some \POSTSCRIPT\ feature is used that
104is not available in \PDF.
105
106During the development of \LUATEX\ a new feature was added to
107\LUA: \type {lpeg}. With \type {lpeg} you can define text scanners. In
108fact, you can build parsers for languages quite conveniently so
109without doubt we will see it show up all over \MKIV.
110
111Since I needed an exercise to get accustomed with \type {lpeg}, I
112rewrote the mentioned converter. I'm sure that a better implementation
113is possible than I did (after all, \POSTSCRIPT\ is a language) but
114I went for a speedy solution. The following table shows some timings.
115
116\starttabulate[|c|c|l|]
117\HL
118\NC \tt gsub \NC \tt lpeg \NC \NC \NR
119\HL
120\NC  2.5     \NC 0.5      \NC 100 times test graphic \NC \NR
121\NC  9.2     \NC 1.9      \NC 100 times big graphic  \NC \NR
122\HL
123\stoptabulate
124
125The test graphic has about everything that \METAPOST\ can output,
126including special tricks that deal with transparency and shading. The
127big one is just four copies of the test graphic.
128
129So, the \type {lpeg} based variant is about 5~times faster than the
130original variant. I'm not saying that the original implementation is
131that brilliant, but a 5~time improvement is rather nice especially
132when you consider that \type {lpeg} is still experimental and each
133version performs better. The tests were done with \type {lpeg}
134version 0.5 which performs slightly faster than its predecessor.
135
136It's worth mentioning that the original \type {gsub} based variant
137was already a bit improved compared to its first implementation.
138There we collected the \TEX\ (\PDF) code in a table and passed it
139in its concatenated form to \TEX. Because the \LUA\ to \TEX\
140interface is by now quite efficient we can just pass the
141intermediate results directly to \TEX.
142
143\subject{file io}
144
145The repertore of functions that deal with individual characters
146in \LUA\ is small. This does not bother us too much because the
147individual character is not what \TEX\ is mostly dealing with. A
148character or sequence of characters becomes a token (internally
149represented by a table) and tokens result in nodes (again
150tables, but larger). There are many more tokens involved than
151nodes: in \CONTEXT\ a ratio of 200~tokens on 1~node are not
152uncommon. A letter like \type {x} become a token, but the control
153sequence \type {\command} also ends up as one token. Later on,
154this \type {x} may become a character node, possibly surrounded
155by some kerning. The input characters \type {width} result in
1565~tokens, but may not end up as nodes at all, for instance when
157they are part of a key|/|value pair in the argument to a command.
158
159Just as there is no guaranteed one||to||one relationship between
160input characters and tokens, there is no straight relation between
161tokens and nodes. When dealing with input it is good to keep in mind
162that because of these interpretation stages one can never say that
1631~megabyte of input characters ends up as 1~million something in
164memory. Just think of how many megabytes of macros get stored in a
165format file much smaller than the sum of bytes.
166
167We only deal with characters or sequences of bytes when reading
168from an input medium. There are many ways to deal with the input.
169For instance one can process the input lines as \TEX\ sees them,
170in which case \TEX\ takes care of the \UTF\ input. When we're
171dealing with other input encodings we can hook code into the file
172openers and readers and convert the raw data ourselves. We can for
173instance read in a file as a whole, convert it using the normal
174expression handlers or the byte(pair) iterators that \LUATEX\
175provides, or we can go real low level using native \LUA\ code,
176as in:
177
178\starttyping
179do
180    local function nextbyte(f)
181        return f:read(1)
182    end
183
184    function io.bytes(f)
185        return nextbyte, f
186    end
187end
188
189f = io.open("somefile.dat")
190for b in io.bytes(f) do
191    do_something(b)
192end
193f:close()
194\stoptyping
195
196Of course in practice one will need to integrate this into one of the
197reader callback, but the principle stays the same. In case you wonder
198if calling functions for each byte is fast enough \unknown\ it's more than
199fast enough for normal purposes, especially if we keep in mind that other
200tasks like reading of, preparing of and dealing with fonts of processing
201token lists take way more time. You can be sore that when half a second
202runtime is spent on reading a file, processing may take minutes. If one
203wants to sqeeze more performance out of this part, it's always an option
204to write special libraries for that, but this is beyond standard \LUATEX.
205We found out that the speed of loading data from files in \LUA\ is
206mostly related to the small size of \LUA's file buffer. Reading data stored
207in tables is extremely fast, and even faster when precompiled into bytecode.
208
209\subject{tables}
210
211When Taco and I were experimenting with the callbacks that intercept
212tokens and nodes, we wondered what the impact would be on performance.
213Although in \MKIV\ we allocate quite some memory due to font handling,
214we were pretty sure that handling \TEX's internal lists also could
215have their impact. Data related to fonts is not always subjected to
216garbage collection, simply because it's to be available permanently.
217List processing on the other hand involves a lot of temporary allocated
218tables. During a run a real huge amount of tokens passes the machinery.
219When digested, they become nodes. For testing we normally use this
220document (with the name \type {mk.tex}) and at the time of writing
221this, it has some 48 pages.
222
223This document is of moderately complexity, but not as complex as
224the documents that I normally process; they have with lots of graphics,
225layers, structural elements, maybe a bit of \XML\ parsing, etc.
226Nevertheless, we're talking of some 24~million tokens entering the engine
227for 50 pages of text. Contrary to this the number of nodes is small:
228only 120~thousand but the tables making up the nodes are more complex than
229token tables (with three numbers per token). When all tokens are intercepted
230and returned unchanged, on my machine the run is progressively slow and
231memory usage grows from 75M to 112M. There is room for improvement there,
232especially in the garbage collector.
233
234Side note: quite some of these tokens result from macro expansion. Also,
235when in the input a \type {\command} is used, the callback passes it as one
236token. A command stores in a format is already tokenized, but a command read
237from the input is tokenized when read, so behind each token reported there
238can be a few more input characters, but their number can be neglected compared
239to tokens originating from the macro package.
240
241The token callback is rather slow when used for a whole document. However,
242this is typically a callback that will only be used in very special
243situations and for a controlled number of tokens. The node callback on the
244other hand can be set permanently. Fortunately the number of nodes is
245relatively small. The overhead of a simple token handler that just counts
246nodes is around 5\% but most common manipulations with token lists don't
247take much more time. For instance, experiments with adding kerns around
248punctuation (a French speciality) hardly takes time, resolving ligatures is
249not really noticeable and applying inter||character spacing to a whole document
250is not that slow either. Actually, the last example is kind of special
251because it more than doubles the size of the node lists. Inserting or removing
252table elements in relatively slow when tables are large but there are some
253ways around this.
254
255One of the reasons of whole||document token handling being slow is that
256each token is a three||element table and so the garbage collector has to work
257rather hard. The efficiency of this process is also platform dependent (or
258maybe compiler specific). Manipulating the garbage collector parameters
259does not improve performance, unless this forces the collector to be inefficient
260at the cost of a lot of memory.
261
262However, when we started dealing with nodes, I gave tuning the collector
263another try and on the mentioned test document the following observations
264were made when manipulating the step multiplier:
265
266\starttabulate[|c|c|c|]
267\HL
268\NC \bf step \NC \bf runtime \NC \bf memory \NC \NR
269\HL
270\NC 200 \NC 24.0 \NC 80.5M \NC \NR
271\NC 175 \NC 21.0 \NC 78.2M \NC \NR
272\NC 150 \NC 22.0 \NC 74.6M \NC \NR
273\NC 160 \NC 22.0 \NC 74.6M \NC \NR
274\NC 165 \NC 21.0 \NC 77.6M \NC \NR
275\NC 125 \NC 21.5 \NC 89.2M \NC \NR
276\NC 100 \NC 21.5 \NC 88.4M \NC \NR
277\HL
278\stoptabulate
279
280As a result, I decided to set the \type {stepmul} variable to~165.
281
282\starttyping
283\ctxlua{collectgarbage("setstepmul", 165)}
284\stoptyping
285
286However,  when we were testing thenew \type {lpeg} based \METAPOST\ converter, we ran
287into problems. For table intensive operations, temporary disabling the
288garbage collector gave a significant boost in speed. While testing
289performance we used the following loop:
290
291\starttyping
292\dorecurse {2000} {
293    \setbox \scratchbox \hbox \bgroup
294        \convertMPtoPDF{test-mps-procset.mps}{1}{1}
295    \egroup
296}
297\stoptyping
298
299In such a loop, turning the garbage collector on and off is disasterous. Because
300no other \LUA\ calls happen between these calls, the garbage collector is never
301invoked at all. As a result, memory growed from the baseline of 45M to 120MB and
302processing became incrementally slow. I found out that restarting the collector
303before each conversion kept memory usage low and the speed also remained okay.
304
305\starttyping
306\ctxlua{collectgarbage("restart")}
307\stoptyping
308
309Further experiments learned that it makes sense to restart the collector at
310each shipout and before table intense operations. On \type {mk.tex} this
311results in a memory usage of 74M (at the end of the run) and a runtime of
31221~seconds.
313
314Concerning nodes and speed|/|allocation issues, we need to be aware of
315the fact that this was still somewhat experimental and in the final version
316of \LUATEX\ callbacks may occur at different places and lists may be subjected
317to parsing multiple times at different moments and locations (for instance when
318we start dealing with attributes, an upcoming new feature).
319
320Back to tokens. The reason why applying the callback to every token takes a
321while has to do with the fact that each token goes through the associated
322function. If you want to have an idea of what this means for 24~million
323tokens, just run the following \LUA\ code:
324
325\starttyping
326for i=1,24 do
327    print(i)
328    for j=1,1000*1000 do
329        local t = { 1, 2, 3 }
330    end
331end
332print(os.clock())
333\stoptyping
334
335This takes some 60 seconds on my machine. The following code
336runs about three times faster because the table has not to
337be allocated each time.
338
339\starttyping
340t = { 1, 2, 3 }
341for i=1,24 do
342    print(i)
343    for j=1,1000*1000 do
344        t[1]=4 t[2]=5 t[3]=6
345    end
346end
347print(os.clock())
348\stoptyping
349
350Imagine this code to be interwoven with other code and \TEX\ doing
351things with the tokens it gets back. The memory pool will be
352scattered and garbage collecting will become more difficult.
353
354However, in practice one will only apply token handling
355to a marked piece of the input data. It is for this reason
356that the callback is not:
357
358\starttyping
359callback.register('token_filter', function(t)
360    return t
361end )
362\stoptyping
363
364but instead
365
366\starttyping
367callback.register('token_filter', function()
368    return token.get_next()
369end )
370\stoptyping
371
372This gives the opportunity to fetch more than one token and
373keep fetching till a criterium is met (for instance a sentinel).
374
375Because \type {token.get_next} is not bound to the callback you
376can fetch tokens anytime you want and only use the callback to
377feed back tokens into \TEX. In \CONTEXT\ \MKIV\ there is some
378collect and flush tokens present. Here is a trivial example:
379
380\starttyping
381\def\SwapChars{\directlua 0 {
382    do
383        local t = { token.get_next(), token.get_next() }
384        callback.register('token_filter', function()
385            callback.register('token_filter', nil)
386            return { t[2], t[1] }
387        end )
388    end
389}}
390
391\SwapChars HH \SwapChars TH
392\stoptyping
393
394Collecting tokens can take place inside the callback but also
395outside. This also gives you the opportunity to collect them
396in efficient ways and keep an eye on the memory demands.
397
398Of course using \TEX\ directly takes less code:
399
400\starttyping
401\def\SwapChars#1#2{#2#1}
402\stoptyping
403
404The example shown here involves so little tokens that running
405it takes no noticeable time. Here we show this definition in
406tokenized form:
407
408\starttokens[demo]\def\SwapChars#1#2{#2#1}\stoptokens \setups{ShowCollect}
409
410\stopcomponent
411