about-speed.tex /size: 35 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent about-speed
4
5\environment about-environment
6
7\startchapter[title=Speed]
8
9\startsection[title=Introduction]
10
11In the \quote {mk} and \type {hybrid} progress reports I have spend some words
12on speed. Why is speed this important?
13
14In the early days of \CONTEXT\ I often had to process documents with thousands of
15pages and hundreds of thousands of hyperlinks. You can imagine that this took a
16while, especially when all kind of ornaments had to be added to the page:
17backgrounds, buttons with their own backgrounds and offsets, hyperlink colors
18dependent on their state, etc. Given that multiple runs were needed, this could
19mean that you'd leave the machine running all night in order to get the final
20document.
21
22It was the time when computers got twice the speed with each iteration of
23hardware, so I suppose that it would run substantially faster on my current
24laptop, an old Dell M90 workhorse. Of course a recently added SSD drive adds a
25boost as well. But still, processing such documents on a machine with a 8Mhz 286
26processor and 640 megabytes of memory was close to impossible. But, when I
27compare the speed of core duo M90 with for instance an M4600 with a i5 \CPU\
28running the same clock speed as the M90, I see a factor 2 improvement at most. Of
29course going for a extremely clocked desktop will be much faster, but we're no
30longer seeing a tenfold speedup every few years. On the contrary: we see a shift
31multiple cores, often running at a lower clock speed, with the assumption that
32threaded applications are used. This scales perfectly for web services and
33graphic manipulations but not so much for \TEX. If we want go faster, we need to
34see where we can be more efficient within more or less frozen clock speeds.
35
36Of course there are some developments that help us. First of all, for programs
37like \TEX\ clever caching of files by the operating system helps a lot. Memory
38still becomes faster and \CPU\ cached become larger too. For large documents with
39lots of resources an SSD works out great. As \LUA\ uses floating point, speedup
40in that area also help with \LUATEX. We use virtual machines for \TEX\ related
41services and for some reason that works out quite well, as the underlying
42operating system does lots of housekeeping in parallel. But, with all maxing out,
43we finally end up at the software itself, and in \TEX\ this boils down to a core
44of compiled code along with lots of macro expansions and interpret \LUA\ code.
45
46In the end, the question remains what causes excessive runtimes. Is it the nature
47of the \TEX\ expansion engine? Is it bad macro writing? Is there too much
48overhead? If you notice how fast processing the \TEX\ book goes on modern
49hardware it is clear that the core engine is not the problem. It's no big deal to
50get 100 pages per second on documents that use relative a simple page builder and
51have macros that lack a flexible user interface.
52
53Take the following example:
54
55\starttyping
56\starttext
57\dorecurse{1000}{test\page}
58\stoptext
59\stoptyping
60
61We do nothing special here. We use the default Latin Modern fonts and process
62single words. No burden is put on the pagebuilder either. This way we get on a
632.33 Ghz T7600 \CPU\ a performance of 185 pages per second. \footnote {In this
64case the mingw version was used. A version using the native \WINDOWS\ compiler
65runs somewhat faster, although this depends on the compiler options. \footnote
66{We've noticed that sometimes the mingw binaries are faster than native binaries,
67but sometimes they're slower.} With \LUAJITTEX\ the 185 pages per second become
68becomes 195 on a 1000 page document.} The estimated \LUA\ overhead in this 1000
69page document is some 1.5 to 2 seconds. The following table shows the performance
70on such a test document with different page numbers in pps (reported pages per
71second).
72
73\starttabulate[|r|r|]
74\HL
75\NC \bf \# pages \NC \bf pps \NC \NR
76\HL
77\NC     1 \NC   2  \NC \NR
78\NC    10 \NC  15 \NC \NR
79\NC   100 \NC  90 \NC \NR
80\NC  1000 \NC 185 \NC \NR
81\NC 10000 \NC 215 \NC \NR
82\HL
83\stoptabulate
84
85The startup time, measured on a zero page document, is 0.5 seconds. This includes
86loading the format, loading the embedded \LUA\ scripts and initializing them,
87initializing and loading the file database, locating and loading some runtime
88files and loading the absolute minumum number of fonts: a regular and math Latin
89Modern. A few years before this writing that was more than a second, and the gain
90is due to a slightly faster \LUA\ interpreter as well as improvements in
91\CONTEXT.
92
93So why does this matter at all, if on a larger document the startup time can be
94neglected? It does because when I have to implement a style for a project or are
95developing some functionality a fast edit||run||preview cycle is a must, if only
96because even a few second wait feels uncomfortable. On the other hand, when I
97process a manual of say 150 pages, which uses some tricks to explain matters, I
98don't care if the processing rate is between 5 and 15 pages per second, simply
99because you get (done) what you asked for. It mostly has to do with feeling
100comfortable.
101
102There is one thing to keep in mind: such measurements can vary over time, as they
103depend on several factors. Even in the trivial case we need to:
104
105\startitemize[packed]
106\startitem
107    load macros and \LUA\ code
108\stopitem
109\startitem
110    load additional files
111\stopitem
112\startitem
113    initialize the system, think of fonts and languages
114\stopitem
115\startitem
116    package the pages, which includes reverting to global document states
117\stopitem
118\startitem
119    create the final output stream (\PDF)
120\stopitem
121\stopitemize
122
123The simple one word per page test is not that slow, and normally for 1000 pages we
124measure around 200 pps. However, due to some small speedups (that somehow add up)
125in three months time I could gain a lot:
126
127\starttabulate[|r|r|r|r|]
128\HL
129\NC \bf \# pages \NC \bf Januari \NC \bf April \NC \bf May\rlap{\quad(2013)} \NR
130\HL
131\NC     1 \NC   2 \NC   2 \NC   2 \NC \NR
132\NC    10 \NC  15 \NC  17 \NC  17 \NC \NR
133\NC   100 \NC  90 \NC 109 \NC 110 \NC \NR
134\NC  1000 \NC 185 \NC 234 \NC 259 \NC \NR
135\NC 10000 \NC 215 \NC 258 \NC 289 \NC \NR
136\HL
137\stoptabulate
138
139Among the improvements in April were a faster output to the console (first
140prototyped in \LUA, later done in the \LUATEX\ engine itself), and a couple of
141low level \LUA\ optimizations. In May a dirty (maybe too tricky) global document
142state restore trick has beeing introduced. Although these changes give nice speed
143bump, they will mostly go unnoticed in more realistic documents. There we are
144happy if we end up in the 20 pps range. So, in practice a more than 10 percent
145speedup between Januari and April is just a dream. \footnote {If you wonder why I
146still bother with such things: sometimes speedups are just a side effect of
147trying to accomplish something else, like less verbose output in full tracing
148mode.}
149
150There are many cases where it does matter to squeeze out every second possible.
151We run workflows where some six documents are generated from one source. If we
152forget about the initial overhead of fetching the source from a remote server
153\footnote {In the user interface we report the time it takes to fetch the source
154so that the typesetter can't be blamed for delays.} gaining half a second per
155document (if we start frech each needs two runs at least) means that the user
156will see the first result one second faster and have them all in six less than
157before. In that case it makes sense to identify bottlenecks in the more high
158level mechanisms.
159
160And this is why during the development of \CONTEXT\ and the transition from
161\MKII\ to \MKIV\ quite some time has been spent on avoiding bottlenecks. And, at
162this point we can safely conclude that, in spite of more advanced functionality,
163the current version of \MKIV\ runs faster than the \MKII\ versions in most cases,
164especially if you take the additional functionality into account (like \UNICODE\
165input and fonts).
166
167\stopsection
168
169\startsection[title=The \TEX\ engine]
170
171Writing inefficient macros is not that hard. If they are used only a few times,
172for instance in setting up properties it plays no role. But if they're expanded
173many times it may make a difference. Because use and development of \CONTEXT\
174went hand in hand we always made sure that the overhead was kept at a minimum.
175
176\startsubject[title=The parbuilder]
177
178There are a couple of places where document processing in a traditional \TEX\
179engine gets a performance hit. Let's start with the parbuilder. Although the
180paragraph builder is quite fast it can responsible for a decent amount of runtime.
181It is also a fact that the parbuilder of the engines derived from original \TEX\
182are more complex. For instance, \OMEGA\ adds bidirectionality to the picture
183which involves some extra checking as well as more nodes in the list. The \PDFTEX\
184engine provides protrusion and expansions, and as that feature was primarily a
185topic of research it was never optimized.
186
187In \LUATEX\ the parbuilder is a mixture of the \PDFTEX\ and \OMEGA\ builders and
188adapted to the fact that we have split the hyphenation, ligature building,
189kerning and breaking a paragraph into lines. The protrusion and expansion code is
190still there but already for a few years I have alternative code for \LUATEX\ that
191simplifies the implementation and could in principle give a speed boost as well
192but till now we never found time to adapt the engine. Take the following test code:
193
194\ifdefined\tufte \else \let\tufte\relax \fi
195
196\starttyping
197\testfeatureonce{100}{\setbox0\hbox{\tufte \par}} \tufte \par
198\stoptyping
199
200In \MKIV\ we use \LUA\ for doing fonts so when we measure this bit we get the
201used time for typesetting our \type {\tufte} quote without breaking it into
202lines. A normal \LUATEX\ run needs 0.80 seconds and a \LUAJITTEX\ run takes 0.47
203seconds. \footnote {All measurements are on a Dell M90 laptop running Windows 8.
204I keep using this machine because it has a decent high res 4:3 screen. It's the
205same machine Luigi Scarso and I used when experimenting with \LUAJITTEX.}
206
207\starttyping
208\testfeatureonce{100}{\setbox0\vbox{\tufte \par}} \tufte \par
209\stoptyping
210
211In this case \LUATEX\ needs 0.80 seconds and \LUAJITTEX\ needs 0.50 seconds and
212as we break the list into lines, we can deduct that close to zero seconds are
213needed to break 100 samples. This (often used) sample text has the interesting
214property that it has many hyphenation points and always gives multiple hyphenated
215lines. So, the parbuilder, if no protrusion and expansion are used, is real fast!
216
217\starttyping
218\startparbuilder[basic]
219  \testfeatureonce{100}{\setbox0\vbox{\tufte \par}} \tufte \par
220\stopparbuilder
221\stoptyping
222
223Here we kick in our \LUA\ version of the par builder. This takes 1.50 seconds for
224\LUATEX\ and 0.90 seconds for \LUAJITTEX. So, \LUATEX\ needs 0.70 seconds to
225break the quote into lines while \LUAJITTEX\ needs 0.43. If we stick to stock
226\LUATEX, this means that a medium complex paragraph needs 0.007 seconds of \LUA\
227time and this is not that is not a time to be worried about. Of course these
228numbers are not that accurate but the measurements are consistent over multiple
229runs for a specific combination of \LUATEX\ and \MKIV. On a more modern machine
230it's probably also close to zero.
231
232These measurements demonstrate that we should add some nuance to the assumption
233that parbuilding takes time. For this we need to distinguish between traditional
234\TEX\ and \LUATEX. In traditional \TEX\ you build an horizontal box or vertical
235box. In \TEX\ speak these are called horizontal and vertical lists. The main text
236flow is a special case and called the main vertical list, but in this perspective
237you can consider it to be like a vertical box.
238
239Each vertical box is split into lines. These lines are packed into horizontal
240boxes. In traditional \TEX\ constructing a list starts with turning references to
241characters into glyphs and ligatures. Kerns get inserted between characters if
242the font requests that. When a vertical box is split into lines, discretionary
243nodes get inserted (hyphenation) and when font expansion or protrusion is enabled
244extra fonts with expanded dimensions get added.
245
246So, in the case of vertical box, building the paragraph is not really
247distinguished from ligaturing, kerning and hyphenation which means that the
248timing of this process is somewhat fuzzy. Also, because after the lines are
249identified some final packing of lines happens and the result gets added to a
250vertical list.
251
252In \LUATEX\ all these stages are split into hyphenation, ligature building,
253kerning, line breaking and finalizing. When the callbacks are not enabled the
254normal machinery kicks in but still the stages are clearly separated. In the case
255of \CONTEXT\ the font ligaturing and kerning get preceded by so called node mode
256font handling. This means that we have extra steps and there can be even more
257steps before and afterwards. And, hyphenation always happens on the whole list,
258contrary to traditional \TEX\ that interweaves this. Keep in mind that because we
259can box and unbox and in that process add extra text the whole process can get
260repeated several times for the same list. Of course already treated glyphs and
261kerns are normally kept as they are.
262
263So, because in  \LUATEX\ the process of splitting into lines is separated we can
264safely conclude that it is real fast. Definitely compared to al the font related
265steps. So, let's go back to the tests and let's do the following:
266
267\starttyping
268\testfeatureonce{1000}{\setbox0\hbox{\tufte}}
269
270\testfeatureonce{1000}{\setbox0\vbox{\tufte}}
271
272\startparbuilder[basic]
273    \testfeatureonce{1000}{\setbox0\vbox{\tufte}}
274\stopparbuilder
275\stoptyping
276
277We've put the text into a macro so that we don't have interference from reading
278files. The test wrapper does the timing. The following measurements are somewhat
279rough but repetition gives similar results. \footnote {Before and between runs
280we do a garbage collection.}
281
282\starttabulate[|c|c|c|c|c|]
283\HL
284\NC   \NC \bf engine \NC \bf method \NC \bf normal \NC \bf hz \NC \NR % comment
285\HL
286\NC 1 \NC luatex     \NC tex hbox   \NC ~9.64      \NC ~9.64  \NC \NR % baseline font feature processing, hyphenation etc: 9.74
287\NC 2 \NC            \NC tex vbox   \NC ~9.84      \NC 10.16  \NC \NR % 0.20 linebreak / 0.52 with hz -> 0.32 hz overhead (150pct more)
288\NC 3 \NC            \NC lua vbox   \NC 17.28      \NC 18.43  \NC \NR % 7.64 linebreak / 8.79 with hz -> 1.33 hz overhead ( 20pct more)
289\HL
290\NC 4 \NC luajittex  \NC tex hbox   \NC ~6.33      \NC ~6.33  \NC \NR % baseline font feature processing, hyphenation etc: 6.33
291\NC 5 \NC            \NC tex vbox   \NC ~6.53      \NC ~6.81  \NC \NR % 0.20 linebreak / 0.48 with hz -> 0.28 hz overhead (expected 0.32)
292\NC 6 \NC            \NC lua vbox   \NC 11.06      \NC 11.81  \NC \NR % 4.53 linebreak / 5.28 with hz -> 0.75 hz overhead
293\HL
294\stoptabulate
295
296In line~1 we see the basline: hyphenation, processing fonts and hpacking takes
2979.74 seconds. In the second line we see that breaking the 1000 paragraphs costs
298some 0.20 seconds and when expansion is enabled an extra 12 seconds is needed.
299This means that expansion takes 150\% more runtime. If we delegate the task to
300\LUA\ we need 7.64 seconds for breaking into lines which can not be neglected
301but is still ok given the fact that we break 1000 paragraphs. But, interesting
302is to see that our alternative expansion routine only adds 1.33 seconds which is
303less than 20\%. It must be said that the built|-|in method is not that efficient
304by design if only because it started out differently as part of research.
305
306When measured three months later, the numbers for regular \LUATEX\ (at that time
307version 0.77) with the latest \CONTEXT\ were: 8.52, 8.72 and 15.40 seconds for
308the normal run, which demonstrates that we should not draw too many conclusions
309from such measurements. It's the overal picture that matters.
310
311As with earlier timings, if we use \LUAJITTEX\ we see that the runtime of \LUA\
312is much lower (due to the virtual machine). Of course we're still 20 times slower
313than the built|-| in method but only 10 times slower when we use expansion. To put
314these numbers in perspective: 5 seconds for 1000 paragraphs.
315
316\starttyping
317\setupbodyfont[dejavu]
318
319\starttext
320  \dontcomplain \dorecurse{1000}{\tufte\par}
321\stoptext
322\stoptyping
323
324This results in 295 pages in the default layout and takes 17.8 seconds or 16.6
325pages per second. Expansion is not enabled.
326
327\starttext
328\startparbuilder[basic]
329    \dontcomplain \dorecurse{1000}{\tufte\par}
330\stopparbuilder
331\stoptext
332
333That one takes 24.7 seconds and runs at 11.9 pages per second. This is indeed
334slower but on a bit more modern machine I expect better results. We should also
335realize that with Dejavu being a relative large font a difficult paragraph like
336the tufte example gives overfull boxes which in turn is an indication that quite
337some alternative breaks are tried.
338
339When typeset with Latin Modern we don't get overfull boxes and interesting is
340that the native method needs less time (15.9 seconds or 14.1 pages per second)
341while the \LUA\ variant also runs a bit faster: 23.4 or 9.5 pages per second. The
342number of pages is 223 because this font is smaller by design.
343
344When we disable hyphenation the the Dejavu variant takes 16.5 (instead of 17.8)
345seconds and 23.1 (instead of 24.7) seconds for \LUA, so this process is not that
346demanding.
347
348For typesetting so many paragraphs without anything special it makes no sense to
349bother with using a \LUA\ based parbuilder. I must admit that I never had to typeset
350novels so all my 300 page runs are much longer anyway. Anyway, when at some point
351we introduce alternative parbuilding to \CONTEXT, the speed penalty is probably
352acceptable.
353
354Just to indicate that predictions are fuzzy: when we put a \type {\blank} between
355the paragraphs we end up with 313 pages and the traditional method takes 18.3
356while \LUA\ needs 23.6 seconds. One reason for this is that the whitespace is
357also handled by \LUA\ and in the pagebuilder we do some finalizing, so we
358suddenly get interference of other processes (as well as the garbage collector).
359Again an indication that we should not bother too much about speed. I try to make
360sure that the \LUA\ (as well as \TEX) code is reasonably efficient, so in
361practice it's the document style that is a more important factor than the
362parbuilder, it being the traditional one or the \LUA\ variant.
363
364\stopsubject
365
366\startsubject[title=Copying boxes]
367
368As soon as in \CONTEXT\ you start enhancing the page with headers and footers and
369backgrounds you will see that the pps rate drops. This is partly due to the fact
370that suddenly quite some macro expansion takes place in order to check what needs
371to happen (like font and color switches, offsets, overlays etc). But what has
372more impact is that we might end up with copying boxes and that takes time. Also,
373by wrapping and repackaging boxes, we add additional levels of recursion in
374postprocessing code.
375
376\stopsubject
377
378\startsubject[title=Macro expansion]
379
380Taco and I once calculated that \MKII\ spends some 4\% of the time in accessing
381the hash table. This is a clear indication that quite some macro expansions goes
382on. Due to the fact that when I rewrote \MKII\ into \MKIV\ I no longer had to
383take memory and other limitations into account, the codebase looks quite
384different. There we do have more expansion in the mechanism that deals with
385settings but the body of macros is much smaller and less parameters are passed.
386So, the overall performance is better.
387
388\stopsubject
389
390\startsubject[title=Fonts]
391
392Using a font has several aspects. First you have to define an instance. Then, when
393you use it for the first time, the font gets loaded from storage, initialized and
394is passed to \TEX. All these steps are quite optimized. If we process the following
395file:
396
397\starttyping
398\setupbodyfont[dejavu]
399
400\starttext
401    regular, {\it italic}, {\bf bold ({\bi italic})} and $m^a_th$
402\stoptext
403\stoptyping
404
405we get reported:
406
407\starttabulate[||T|]
408\NC \type{loaded fonts}    \NC xits-math.otf xits-mathbold.otf \NC \NR
409\NC                        \NC dejavuserif-bold.ttf dejavuserif-bolditalic.ttf \NC \NR
410\NC                        \NC dejavuserif-italic.ttf dejavuserif.ttf \NC \NR
411\NC \type{fonts load time} \NC 0.374 seconds \NR
412\NC \type{runtime}         \NC 1.014 seconds, 0.986 pages/second \NC \NR
413\stoptabulate
414
415So, six fonts are loaded and because XITS is used we also preload the math bold
416variant. Loading of text fonts is delayed but in order initialize math we need to
417preload the math fonts.
418
419If we don't define a bodyfont, a default set gets loaded: Latin Modern. In that
420case we get:
421
422\starttabulate[||T|]
423\NC \type{loaded fonts}    \NC latinmodern-math.otf \NC \NR
424\NC                        \NC lmroman10-bolditalic.otf lmroman12-bold.otf \NC \NR
425\NC                        \NC lmroman12-italic.otf lmroman12-regular.otf \NC \NR
426\NC \type{fonts load time} \NC 0.265 seconds \NR
427\NC \type{runtime}         \NC 0.874 seconds, 1.144 pages/second \NC \NR
428\stoptabulate
429
430Before we had native \OPENTYPE\ Latin Modern math fonts, it took slightly longer
431because we had to load many small \TYPEONE\ fonts and assemble a virtual math font.
432
433As soon as you start mixing more fonts and/or load additional weights and styles
434you will see these times increase. But if you use an already loaded font with
435a different featureset or scaled differently, the burden is rather low. It is
436safe to say that at this moment loading fonts is not a bottleneck.
437
438Applying fonts can be more demanding. For instance if you typeset Arabic or
439Devanagari the amount of node and font juggling definitely influences the total
440runtime. As the code is rather optimized there is not much we can do about it.
441It's the price that comes with flexibility. As far as I can tell getting the same
442results with \PDFTEX\ (if possible at all) or \XETEX\ is not taking less time. If
443you've split up your document in separate files you will seldom run more than a
444dozen pages which is then still bearable.
445
446If you are for instance typesetting a dictionary like document, it does not make
447sense to do all font switches by switching body fonts. Just defining a couple of
448font instances makes more sense and comes at no cost. Being already quite
449efficient given the complexity you should not expect impressive speedups in this
450area.
451
452\stopsubject
453
454\startsubject[title=Manipulations]
455
456The main manipulation that I have to do is to process \XML\ into something
457readable. Using the built||in parser and mapper already has some advantages
458and if applied in the right way it's also rather efficient. The more you restrict
459your queries, the better.
460
461Text manipulations using \LUA\ are often quite fast and seldom the reason for
462seeing slow processing. You can do lots of things at the \LUA\ end and still have
463all the \CONTEXT\ magic by using the \type {context} namespace and function.
464
465\stopsubject
466
467\startsubject[title=Multipass]
468
469You can try to save 1 second on a 20 second run but that is not that impressive
470if you need to process the document three times in order to get your cross
471references right. Okay you'd save 3 seconds but still to get result you needs
472some 60 seconds (unless you already have run the document before). If you have a
473predictable workflow you might know in advance that you only need two runs in
474case you can enforce that with \type {--runs=2}. Furthermore you can try to
475optimize the style by getting rid of redundant settings and inefficient font
476switches. But no matter what we optimize, unless we have a document with no cross
477references, sectioning and positioning, you often end up with the extra run,
478although \CONTEXT\ tries to minimize the number of needed runs needed.
479
480\stopsubject
481
482\startsubject[title=Trial runs]
483
484Some mechanisms, like extreme tables, need multiple passes and all but the last
485one are tagged as trial runs. Because in many cases only dimensions matter, we
486can disable some time consuming code in such case. For instance, at some point
487Alan Braslau and I found out that the new chemical manual ran real slow, mainly
488due to the tens of thousands of \METAPOST\ graphics. Adding support for trial
489runs to the chemical structure macros gave a fourfold improvement. The manual is
490still a slow|-|runner, but that is simply because it has so many runtime
491generated graphics.
492
493\stopsubject
494
495\stopsection
496
497\startsection[title=The \METAPOST\ library]
498
499When the \METAPOST\ library got included we saw a drastic speedup in processing
500document with lots of graphics. However, when \METAPOST\ got a different number
501system (native, double and decimal) the changed memory model immediately lead to
502a slow down. On one 150 page manual which a graphic on each page I saw the
503\METAPOST\ runtime go up from about half a second upto more than 5 seconds. In
504this case I was able to rewrite some core \METAFUN\ macro to better suit the new
505model, but you might not be so lucky. So more careful coding is needed. Of course
506if you only have a few graphics, you can just ignore the change.
507
508\stopsection
509
510\startsection[title=The \LUA\ interpreter]
511
512Where the \TEX\ part of \LUATEX\ is compiled, the \LUA\ code gets interpreted,
513converted into bytecode, and ran by the virtual machine. \LUA\ is by design quite
514portable, which means that the virtual machine is not optimized for a specific
515target. The \LUAJIT\ interpreter on the other hand is written in assembler and
516available for only some platforms, but the virtual machine is about twice as
517fast. The just||in||time part of \LUAJIT\ is not if much help and even can slow
518down processing.
519
520When we moved from \LUA~5.1 to 5.2 we found out that there was some speedup but
521it's hard to say why. There has been changes in the way strings are dealt with
522(\LUA\ hashes strings) and we use lots of strings, really lots. There has been
523changes in the garbage collection and during a run lots of garbage needs to be
524collected. There are some fundamental changes in so called environments and who
525knows what impact that has.
526
527If you ever tried to measure the performance of \LUA, you probably have noticed
528that it is quite fast. This means that it makes no sense to optimize code that
529gets visited only occasionally. But some of the \CONTEXT\ code gets exercised a
530lot, for instance all code that deals with fonts. We use attributes a lot and
531checking them is for good reason not the fastest code. But given the often
532advanced functionality that it makes possible we're willing to pay the price.
533It's also functionality that you seldom need all at the same time and for
534straightforward text only documents all that code is never executed.
535
536When writing \TEX\ or \LUA\ code I spent a lot of time making it as efficient as
537possible in terms of performance and memory usage. The sole reason for that is
538that we happen to process documents where a lot of functionality is combined, so
539if many small speed||ups accumulate to a noticeable performance gain it's worth
540the effort.
541
542So, where does \LUA\ influence runtime? First of all we use \LUA\ do deal with all
543in- and output as well as locating files in the \TEX\ directory structure. Because
544that code is partly shared with the script manager (\type {mtxrun}) it is optimized
545but some more is possible if needed. It is already not the most easy to read code,
546so I don't want to introduce even more obscurity.
547
548Quite some code deals with loading, preparing and caching fonts. That code is
549mostly optimized for memory usage although speed is also okay. This code is only
550called when a font is loaded for the first time (after an update). After that
551loading is at matter of milliseconds. When a text gets typeset and when fonts are
552processed in so called node mode, depending on the script and|/|or enabled
553features, a substantial amount of time is spent in \LUA. There is still a bit
554complex dealing with inserting kerns but future \LUATEX\ will carry kerning
555in the glyph node so there we can gain some runtime.
556
557If a page has 4000 characters and if font features as well as other manipulations
558demand 10 runs over the text, we have 40.000 checks of nodes and potential
559actions. Each involves an id check, maybe a subtype check, maybe some attribute
560checking and possibly some action. So, if we have 200.000 (or more) function
561calls to the per page \TEX\ end it might add up to a lot. Around the time that we
562went to \LUA~5.2 and played with \LUAJITTEX, the node accessors have been sped
563up. This gave indeed a measurable speedup but not on an average document, only on
564the more extreme documents or features. Because the \MKIV\ \LUA\ code goes from
565experimental to production to final, some improvements are made in the process
566but there is not much to gain there. We just have to wait till computers get
567faster, \CPU\ cache get bigger, branch prediction improves, floating point
568calculations take less time, memory is speedy, and flash storage is the standard.
569
570The \LUA\ code is plugged into the \TEX\ machinery via callbacks. For
571instance each time a box is build several callbacks are triggered, even if it's
572an empty box or just an extra wrapper. Take for instance this:
573
574\starttyping
575\hbox \bgroup
576    \hskip \zeropoint
577    \hbox \bgroup
578        test
579    \egroup
580    \hskip \zeropoint
581\egroup
582\stoptyping
583
584Of course you won't come up with this code as it doesn't do much good but macros
585that you use can definitely produce this. For instance, the zero skip can be a
586left and right margin that happen to be. For 10.000 iterations I measured 0.78
587seconds while the text one takes 0.62 seconds:
588
589\starttyping
590\hbox \bgroup
591    \hbox \bgroup
592        test
593    \egroup
594\egroup
595\stoptyping
596
597Why is this? One reason is that a zero skip results in a node and the more nodes
598we have the more memory (de)allocation takes place and the more nodes in the list
599need to be checked. Of course the relative difference is less when we have more
600text. So how can we improve this? The following variant, at the cost of some
601testing takes just as much time.
602
603\starttyping
604\hbox \bgroup
605    \hbox \bgroup
606        \scratchdimen\zeropoint
607        \ifdim\scratchdimen=\zeropoint\else\hskip\scratchdimen\fi
608        test
609        \ifdim\scratchdimen=\zeropoint\else\hskip\scratchdimen\fi
610    \egroup
611\egroup
612\stoptyping
613
614As does this one, but the longer the text, the slower it gets as one of the two
615copies needs to be skipped.
616
617\starttyping
618\hbox \bgroup
619    \hbox \bgroup
620        \scratchdimen\zeropoint
621        \ifdim\scratchdimen=\zeropoint
622            test%
623        \else
624            \hskip\scratchdimen
625            test%
626            \hskip\scratchdimen
627        \fi
628    \egroup
629\egroup
630\stoptyping
631
632Of course most speedup is gained when we don't package at all, so if we test
633before we package but such an optimization is seldom realistic because much more
634goes on and we cannot check for everything. Also, 10.000 is a lot while 0.10
635seconds is something we can live with. By the way, compare the following
636
637\starttyping
638\hbox \bgroup
639    \hskip\zeropoint
640    test%
641    \hskip\zeropoint
642\egroup
643
644\hbox \bgroup
645    \kern\zeropoint
646    test%
647    \kern\zeropoint
648\egroup
649\stoptyping
650
651The first variant is less efficient that the second one, because a skip
652effectively is a glue node pointing to a specification node while a kern is just
653a simple node with the width stored in it. \footnote {On the \LUATEX\ agenda is
654moving the glue spec into the glue node.} I must admit that I seldom keep in mind
655to use kerns instead of skips when possible if only because one needs to be sure
656to be in the right mode, horizontal or vertical, so additional commands might be
657needed.
658
659\stopsection
660
661\startsection[title=Macros]
662
663Are macros a bottleneck? In practice not really. Of course we have optimized the
664core \CONTEXT\ macros pretty well, but one reason for that is that we have a
665rather extensive set of configuration and definition mechanisms that rely heavily
666on inheritance. Where possible all that code is written in a way that macro
667expansion won't hurt too much. because of this users themselves can be more
668liberal in coding. There is a lot going on deep down and if you turn on tracing
669macros you can get horrified. But, not all shown code paths are entered. During the
670move (and rewrite) from \MKII\ to \MKIV\ quite some bottlenecks that result from
671limitations of machines and memory have been removed and as a result the macro
672expansion part is somewhat faster, which nicely compensates the fact that we
673have a more advanced but slower inheritance subsystem. Readability of code and
674speed are probably nicely balanced by now.
675
676Once a macro is read in, its internal representation is pretty efficient. For
677instance references to macro names are just pointers into a hash table. Of
678course, when a macro is seen in your source, that name has to be looked up, but
679that's a fast action. Using short names in the running text for instance really
680doesn't speed up processing much. Switching font sets on the other hand does, as
681then quite some checking happens and the related macros are pretty extensive.
682However, once a font is loaded references to it a pretty fast. Just keep in mind
683that if you define something inside a group, in most cases it got forgotten. So,
684if you need something more often, just define it at the outer level.
685
686\stopsection
687
688\startsection[title=Optimizing code]
689
690Optimizing only makes sense if used very often and called frequently or when the
691problem to solve is demanding. An example of code that gets done often is page
692building, where we pack together many layout elements. Font switches can also be
693time consuming, if defined wrong. These can happen for instance for formulas,
694marked words, cross references, margin notes, footnotes (often a complete
695bodyfont switch), table cells, etc. Yet another is clever vertical spacing that
696happens between structural elements. All these mechanisms are reasonably
697optimized.
698
699I can safely say that deep down \CONTEXT\ is no that inefficient, given what it
700has to do. But when a style for instance does redundant or unnecessary massive
701font switches you are wasting runtime. I dare to say that instead of trying to
702speed up code (for instance by redefining macros) you can better spend the time
703in making styles efficient. For instance having 10 \type {\blank}'s in a row
704will work out rather well but takes time. If you know that a section head has no
705raised or lowered text and no math, you can consider using \type {\definefont} to
706define the right size (especially if it is a special size) instead of defining
707an extra bodyfont size and switch to that as it includes setting up related sizes
708and math.
709
710It might sound like using \LUA\ for some tasks makes \CONTEXT\ slower, but this
711is not true. Of course it's hard to prove because by now we also have more
712advanced font support, cleaner math mechanisms, additional features especially in
713especially structure related mechanisms, etc. There are also mechanisms that are
714faster, for instance extreme tables (a follow up on natural tables) and mixed
715column modes. Of course on the previously mentioned 300 page simple paragraphs
716with simple Latin text the \PDFTEX\ engine is much faster than \LUATEX, also
717because simple fonts are used. But for many of todays document this engine is no
718longer an options. For instance in our \XML\ processing in multiple languages,
719\LUATEX\ beats \PDFTEX. There is not that much to optimize left, so most speed up
720has to come from faster machines. And this is not much different from the past:
721processing 300 page document on a 4.7Mhz 8086 architecture was not much fun and
722we're not even talking of advanced macros here. Faster machines made more clever
723and user friendly systems possible but at the cost of runtime, to even if
724machines have become many times faster, processing still takes time. On the other
725hand, \CONTEXT\ will not become more complex than it is now, so from now on we
726can benefit from faster \CPU's, memory and storage.
727
728\stopsection
729
730\stopchapter
731
732\stopcomponent
733