musings-performance.tex /size: 19 Kb    last modification: 2024-01-16 10:21
1% language=us runpath=texruns:manuals/musings
2
3% \usemodule[abbreviations-smallcaps]
4
5\startcomponent musings-performance
6
7\environment musings-style
8
9\startchapter[title={Performance again}]
10
11\startlines \setupalign[flushright]
12Hans Hagen
13Hasselt NL
14Februari 2020 (public 2023)
15\stoplines
16
17\startsection[title=Introduction]
18
19In a \MAPS\ article of 2019 I tried to answer the question \quote {Is \TEX\
20really slow?}. A while after it was published on the Dutch \TEX\ mailing list a
21user posted a comment stating that in his experience the \LUATEX\ engine in
22combination with \LATEX\ was terribly slow: one page per second for a Japanese
23text. It was also slower than \PDFTEX\ with English, but for Japanese it was
24close to unusable. The alternative, using a Japanese \TEX\ engine was no option
25due to lack of support for certain images.
26
27In order to check this claim I ran a test in \CONTEXT. Even on my 8 year old
28laptop I could get 45 pages per second for full page Japanese texts (6 paragraphs
29with each 300 characters per page): 167 pages took just less than 4 seconds.
30Typesetting Japanese involves specific spacing and line break handling. So,
31naturally the question arises: why the difference. Frans Goddijn wondered if I
32could explain a bit more about that, so here we go.
33
34In the mentioned article I already have explained what factors play a role and
35the macro package is one of them. It is hard to say to what extent inefficient
36macros or a complex layout influence the runtime, but my experience is that it is
37pretty hard to get speeds as low as 1 page per second. On an average complex
38document like the \LUATEX\ manual (lots of verbatim and tables, but nothing else
39demanding apart from color being used and a unique \METAPOST\ graphic per page) I
40get at least a comfortable 20 pages per second.
41
42I can imagine that for a \TEX\ user who sees other programs on a computer do
43complex things fast, the performance of \TEX\ is puzzling. But, where for
44instance rendering videos can benefit from specific features of (video)
45processors, multiple cores, or just aggressive optimization by compilers of
46(nested) loops and manipulation of arrays of bytes, this is not the case for
47\TEX. This program processes all in sequence, there is not much repetition that
48can be optimized, it cannot exploit the processor in special ways and the
49compiler can not do that many optimizations.
50
51I can't answer why a \LATEX\ run is slower than a \CONTEXT\ run. Actually, one
52persistent story has always been that \CONTEXT\ was slow in comparison. But maybe
53it helps to know a bit what happens deep down in \TEX\ and how macro code can
54play a role in performance. When doing that I will simplify things a bit.
55
56\stopsection
57
58\startsection[title=Text and nodes]
59
60The \TEX\ machinery takes input and turns that into some representation that can
61be turned into a visual representation ending up as \PDF. So say that we have
62this:
63
64\starttyping
65hello
66\stoptyping
67
68In a regular programming language this is a string with five characters. When the
69string is manipulated it is basically still a sequence of bytes in memory. In
70\TEX, if this is meant as text, at some point the internal representation is a so
71called node list:
72
73\starttyping
74[h] -> [e] -> [l] -> [l] -> [o]
75\stoptyping
76
77In traditional \TEX\ these are actually character nodes. They have a few properties,
78like what font the character is from and what the character code is (0 up to 255).
79At some point \TEX\ will turn that list into a glyph list. Say that we have this:
80
81\starttyping
82efficient
83\stoptyping
84
85This will eventually become seven nodes:
86
87\starttyping
88[e] -> [ffi] -> [c] -> [i] -> [e] -> [n] -> [t]
89\stoptyping
90
91The ffi ligature is a glyph node which actually also keeps information about this
92one character being made from three.
93
94In \LUATEX\ it is different, and this is one of the reasons for it being slower. We
95stick to the first example:
96
97\starttyping
98[h] <-> [e] <-> [l] <-> [l] <-> [o]
99\stoptyping
100
101So, instead of pointing to the next node, we also point back to the previous: we
102have a double linked list. This means that all over the program we need to
103maintain these extra links too. They are not used by \TEX\ itself, but handy at
104the \LUA\ end. But, instead of only having the font as property there is much
105more. The \TEX\ program can deal with multiple languages at the same time and
106this relates to hyphenation. In traditional \TEX\ there are language nodes that
107indicate a switch to another language. But in \LUATEX\ that property is kept with
108each glyph node. Actually, even specific language properties like the hyphen min,
109hyphen max and the choice if uppercase should be hyphenated are kept with these
110nodes. Spaces are turned into glue nodes, and these nodes are also larger than in
111regular \TEX\ engines.
112
113So, in \LUATEX, when a character goes from the input into a node, a more complex
114data structure has to be set up and the larger data structure also takes more
115memory. That in turn means that caching (close to the \CPU) gets influenced. Add
116to that the fact that we operate on 32 bit character values, which also comes
117with higher memory demands.
118
119We mentioned that a traditional engine goes from one state of node list into
120another (the ligature building). Actually this is an integrated process: a lot
121happens on the fly. If something is put into a \type {\hbox} no hyphenation takes
122place, only ligature building and kerning. When a paragraph is typeset,
123hyphenation happens on demand, in places where it makes sense.
124
125In \LUATEX\ these stages are split. A node list is {\em always} hyphenated. This
126step as well as ligature building and kerning are {\em three} separate steps. So,
127there's always more hyphenation going on than in a traditional \TEX\ engine: we
128get more discretionary nodes and again these take more memory than before; also
129the more nodes we have, the more it will impact performance down the line. The
130reason for this is that each step can be intercepted and replaced by a \LUA\
131driven one. In practice, with modern \OPENTYPE\ fonts that is what happens: these
132are dealt with (or at least managed in) \LUA. For Japanese for sure the
133built|-|in ligature and kerning doesn't apply: the work is delegated and this
134comes at a price. Japanese needs no hyphenation but instead characters are
135treated with respect to their neighbors and glue nodes are injected when needed.
136This is something that \LUA\ code is used for so here performance is determined
137by how well the plugged in code behaves. It can be inefficient but it can also be
138so clever that it just takes a bit of time to complete.
139
140I didn't mention another property of nodes: attributes. Each node that has some
141meaning in the node list (glyphs, kerns, glue, penalties, discretionary,
142\unknown, these terms should ring bells for a \TEX\ user) have a pointer to an
143attribute list. Often these are the same for neighboring nodes, but they can be
144different. If a macro package sets 10 attributes, then there will be lists of ten
145attributes nodes (plus some overhead) active. When values change, copies are made
146with the change applied. Grouping even complicates this a little more. This has
147an impact on performance. Not only need these lists be managed, when they are
148consulted at the \LUA\ end (as they are meant as communication with that bit of
149the engine) these lists are interpreted. It all adds up to more runtime. There is
150nothing like that in traditional \TEX, but there some more macro juggling to
151achieve the same effects can cause a performance hit.
152
153\stopsection
154
155\startsection[title=Macros and tokens]
156
157When you define a macro like this:
158
159\starttyping
160\def\MyMacro#1{\hbox{here: #1!}}
161\stoptyping
162
163the \TEX\ engine will parse this as follows (we keep it simple):
164
165\starttabulate[|Tc|l|]
166\NC \string\def               \NC primitive token \NC \NR
167\NC \string\MyMacro           \NC user macro pointing to: \NC \NR
168\NC \char\hashasciicode 1     \NC argument list of length 1 and no delimiters \NC \NR
169\NC \char\leftbraceasciicode  \NC openbrace token \NC \NR
170\NC \string\hbox              \NC hbox primitive token \NC \NR
171\NC h                         \NC letter token h \NC \NR
172\NC e                         \NC letter token e \NC \NR
173\NC r                         \NC letter token r \NC \NR
174\NC e                         \NC letter token e \NC \NR
175\NC :                         \NC other token : \NC \NR
176\NC                           \NC space token \NC \NR
177\NC \char\hashasciicode 1     \NC reference to argument \NC \NR
178\NC !                         \NC other token ! \NC \NR
179\NC \char\rightbraceasciicode \NC close brace token \NC \NR
180\stoptabulate
181
182The \type {\def} is eventually lost, and the meaning of the macro is stored as a
183linked list of tokens that get bound to the user macro \type {\MyMacro}. The details
184about how this list is stored internally can differ a bit per engine but the idea
185remains. If you compare tokens of a traditional \TEX\ engine with \LUATEX, the main
186difference is in the size: those in \LUATEX\ take more memory and again that impacts
187performance.
188
189\stopsection
190
191\startsection[title=Processing]
192
193Now, for a moment we step aside and look at a regular programming language, like
194\PASCAL, the language \TEX\ is written in, or \CCODE\ that is used for \LUATEX.
195The high level definitions, using the syntax of the language, gets compiled into
196low level machine code: a sequence of instructions for the \CPU. When doing so
197the compiler can try to optimize the code. When the program is executed all the
198\CPU\ has to do is fetch the instructions, and execute them, which in turn can
199lead to fetching data from memory. Successive versions of \CPU's have become more
200clever in handling this, predicting what might happen, (pre) fetching data from
201memory etc.
202
203When you look at scripting languages, again a high level syntax is used but after
204interpretation it becomes compact so called bytecode: a sequence of instructions
205for a virtual machine that itself is a compiled program. The virtual machine
206fetches the bytes and acts upon them. It also deals with managing memory and
207variables. There is not much optimization going on there, certainly not when the
208language permits dynamically changing function calls and such. Here performance
209is not only influenced by the virtual machine but also by the quality of the
210original code (the scripts). In \LUATEX\ we're talking \LUA\ here, a scripting
211language that is actually considered to be pretty fast.
212
213Sometimes bytecode can be compiled Just In Time into low level machine code but
214for \LUATEX\ that doesn't work out well. Much \LUA\ code is executed only once or
215a few times so it simply doesn't pay off. Apart from that there are other
216limitations with this (in itself impressive) technology so I will not go into
217more detail.
218
219So how does \TEX\ work? It is important to realize that we have a mix of input
220and macros. The engine interprets that on the fly. A character enters the input
221and \TEX\ has to look at it in the perspective of what it what it expects. It is
222just a character? Is it part of a control sequence that started (normally) with a
223backslash? Does it have a special meaning, like triggering math mode? When a
224macro is defined, it gets stored as a linked list of tokens and when it gets
225called the engine has to expand that meaning. In the process some actions
226themselves kind of generate input. When that happens a new level of input is
227entered and further expansion takes place. Sometimes \TEX\ looks ahead and when
228not satisfied, pushes something back into the input which again introduces a new
229level. A lot can happen when a macro gets expanded. If you want to see this, just
230add \type {\tracingall} at the top of your file: you will be surprised! You will
231not see how often tokens get pushed and popped but you can see how much got
232expanded and how often local changes get restored. By the way, here is something
233to think about:
234
235\starttyping
236\count4=123
237\advance \count4 by 123
238\stoptyping
239
240If this is in your running text, the scanner sees \type {\count} and then
241triggers the code that handles it. That code expects a register number, here that
242is the \type {4}. Then it checks if there is an optional \type {=} which means
243that it has to look ahead. In the second line it checks for the optional keyword
244\type {by}. This optional scanning has a side effect: when the next token is {\em
245not} an equal or keyword, it has to push back what it just read (we enter a new
246input level) and go forward. It then scans a number. That number ends with a
247space or \type {\relax} or something not being a number. Again, some push back
248onto the input can happen. In fact, say that instead of \type {4} we have a macro
249indicating the register number, intermediate expansion takes place. So, even
250these simple lines already involve a lot of action! Now, say that we have this
251
252\starttyping
253% \newcounter \scratchcounter % done once
254\scratchcounter 123
255\scratchcounter =123
256\advance\scratchcounter by 123
257\advance\scratchcounter 123
258\stoptyping
259
260Can you predict what is more efficient? If this operation doesn't happen
261frequently, performance wise there is no real difference between the variants
262with and without \type {=} and with and without \type {b}. This is because \TEX\
263is pretty fast in tokenizing its input and interpreting its already stored token
264lists that have these commands. But given what we said before, when you talk of
265millions of such assignments, adding the equal sign and \type {by} {\em could}
266actually be faster because there is no pushing back onto the input stack
267involved. It probably makes no sense to take this into account when writing
268macros but just keep in mind that performance is in the details.
269
270% written in 2020, the next added in Januari 2023
271
272Actually, contrary to what you might expect, \type {\scratchcounter} is not even a
273counter in \CONTEXT, and in \LUAMETATEX we can also do this:
274
275\starttyping
276% \newinteger\scratchcounter % done once
277\scratchcounter 123
278\scratchcounter =123
279\advanceby\scratchcounter 123
280\stoptyping
281
282Which means that because this counter is defined as so called \quotation
283{constant integer} it avoids some indirectness (to a counter register) and
284because \type {\advanceby} doesn't scan for a keyword the code above runs faster
285anyway.
286
287This model of expansion is very different from compiled code or bytecode. To some
288extent you can consider a list of tokens that make up a macro to be bytecode, but
289instead of a sequence of bytes it is a linked list. That itself has a penalty in
290performance. Depending on how macros expand, the engine can be hopping all over
291the token memory following that list. That means that quite likely the data that
292gets accessed is not in your \CPU\ cache and as a result performance cannot
293benefit from it apart of course from the expanding machinery itself, but that one
294is not a simple loop messing around with variables: it accesses code all over the
295place! Text gets hyphenated, fonts get applied, material gets boxed, paragraphs
296constructed, pages built. We're not moving a blob of bits around (as in a video)
297but we're constantly manipulating small amounts of memory scattered around memory
298space.
299
300Now, where a traditional \TEX\ engine works on 8 bit characters and smaller
301tokens, the 32 bit \LUATEX\ works on larger chunks. Although macro names are
302stored as single symbolic units, there are moments when its real (serialized to
303characters) name is used, for instance when with \type {\csname}. When that
304happens, the singular token becomes a list, so for instance the (stored) token
305\type {\foo} becomes a temporary three token list (actually four if you also
306count the initial reference token). Those tree tokens become three characters in
307a string that then is used in the hash lookup. There are plenty cases where such
308temporary string variables are allocated and filled. Compare:
309
310\starttyping
311\def\foo{\hello}
312\stoptyping
313
314Here the macro \type {\foo} has just a one token reference to \type {\hello}
315because that's how a macro reference gets stored. But in
316
317\starttyping
318\def\foo{\csname hello\endcsname}
319\stoptyping
320
321we have two plus five tokens to access what effectively is \type {\hello}. Each
322character token has to be converted to a byte into the assembled string. Now it
323must be said that in practice this is still pretty fast but when we have longer
324names and especially when we have \UTF8 characters in there it can come at a
325price. It really depends on how your macro package works and sometimes you just
326pay the price of progress. Buying a faster machine is then the solution because
327often we're not talking of extreme performance loss here. And modern \CPU's can
328juggle bytes quite efficiently. Actually, when we go to 64 bit architectures,
329\LUATEX's data structures fit quite well to that. As a side note: when you run a
33032 bit binary on a 64 bit architecture there can even be a price being paid for
331that when you use \LUATEX. Just move on!
332
333\stopsection
334
335\startsection[title=Management]
336
337Before we can even reach the point that some content becomes typeset, much can
338happen: the engine has to start up. It is quite common that a macro package uses
339a memory dump so that macros are not to be parsed each run. In traditional
340engines hyphenation patterns are stored in the memory dump as well. And some
341macro packages can put fonts in it. All kind of details, like upper- and
342lowercase codes can get stored too. In \LUATEX\ fonts and patterns are normally
343kept out of the dump. That dump itself is much larger already because we have 32
344bit characters instead of 8 bit so more memory is used. There are also new
345concepts, like catcode tables that take space. Math is 32 bit too, so more codes
346related to math are stored. Actually the format is so much larger that \LUATEX\
347compresses it. Anyway, it has an impact on startup time. It is not that much, but
348when you measure differences on a one page document the overhead in getting
349\LUATEX\ up and running will definitely impact the measurement.
350
351The same is true for the backend. A traditional engine uses (normally) \TYPEONE\
352fonts and \LUATEX\ relies on \OPENTYPE. So, the backend has to do more work. The
353impact is normally only visible when the document is finalized. There can be a
354slightly larger hickup after the last page. So, when you measure one page
355performance, it again pollutes the page per second performance.
356
357\stopsection
358
359\startsection[title=Summary]
360
361So, to come back to the observation that \LUATEX\ is slower than \PDFTEX. At
362least for \CONTEXT\ we can safely conclude that indeed \PDFTEX\ is faster when we
363talk about a standard English document, with \TEX\ \ASCII\ input, where we can do
364with traditional small fonts, with only some kerning and simple ligatures. But as
365soon as we deal with for instance \XML, have different languages and scripts,
366have more demanding layouts, use color and images, and maybe even features that
367we were not aware of and therefore didn't require in former times the \LUATEX\
368engine (and for \CONTEXT\ it's \LUAMETATEX\ follow up) performs way better than
369\PDFTEX. And how about support for hyper links, protrusion and expansion, tagging
370for the sake of accessibility, new layers of abstraction, etc. The impact on
371performance can differ a lot per engine (and probably also per macro package).
372So, there is no simple answer and explanation for the fact that the observed slow
373\LATEX\ run on Japanese text, apart from that we can say: look at the whole
374picture: we have more complex tokens, nodes, scripts and languages, fonts,
375macros, demands on the machinery, etc. Maybe it is just the price you are paying
376for that.
377
378\stopsection
379
380\stopchapter
381
382\stopcomponent
383