% language=us runpath=texruns:manuals/musings

% \usemodule[abbreviations-smallcaps]

\startcomponent musings-performance

\environment musings-style

\startchapter[title={Performance again}]

\startlines \setupalign[flushright]
Hans Hagen
Hasselt NL
Februari 2020 (public 2023)
\stoplines

\startsection[title=Introduction]

In a \MAPS\ article of 2019 I tried to answer the question \quote {Is \TEX\
really slow?}. A while after it was published on the Dutch \TEX\ mailing list a
user posted a comment stating that in his experience the \LUATEX\ engine in
combination with \LATEX\ was terribly slow: one page per second for a Japanese
text. It was also slower than \PDFTEX\ with English, but for Japanese it was
close to unusable. The alternative, using a Japanese \TEX\ engine was no option
due to lack of support for certain images.

In order to check this claim I ran a test in \CONTEXT. Even on my 8 year old
laptop I could get 45 pages per second for full page Japanese texts (6 paragraphs
with each 300 characters per page): 167 pages took just less than 4 seconds.
Typesetting Japanese involves specific spacing and line break handling. So,
naturally the question arises: why the difference. Frans Goddijn wondered if I
could explain a bit more about that, so here we go.

In the mentioned article I already have explained what factors play a role and
the macro package is one of them. It is hard to say to what extent inefficient
macros or a complex layout influence the runtime, but my experience is that it is
pretty hard to get speeds as low as 1 page per second. On an average complex
document like the \LUATEX\ manual (lots of verbatim and tables, but nothing else
demanding apart from color being used and a unique \METAPOST\ graphic per page) I
get at least a comfortable 20 pages per second.

I can imagine that for a \TEX\ user who sees other programs on a computer do
complex things fast, the performance of \TEX\ is puzzling. But, where for
instance rendering videos can benefit from specific features of (video)
processors, multiple cores, or just aggressive optimization by compilers of
(nested) loops and manipulation of arrays of bytes, this is not the case for
\TEX. This program processes all in sequence, there is not much repetition that
can be optimized, it cannot exploit the processor in special ways and the
compiler can not do that many optimizations.

I can't answer why a \LATEX\ run is slower than a \CONTEXT\ run. Actually, one
persistent story has always been that \CONTEXT\ was slow in comparison. But maybe
it helps to know a bit what happens deep down in \TEX\ and how macro code can
play a role in performance. When doing that I will simplify things a bit.

\stopsection

\startsection[title=Text and nodes]

The \TEX\ machinery takes input and turns that into some representation that can
be turned into a visual representation ending up as \PDF. So say that we have
this:

\starttyping
hello
\stoptyping

In a regular programming language this is a string with five characters. When the
string is manipulated it is basically still a sequence of bytes in memory. In
\TEX, if this is meant as text, at some point the internal representation is a so
called node list:

\starttyping
[h] -> [e] -> [l] -> [l] -> [o]
\stoptyping

In traditional \TEX\ these are actually character nodes. They have a few properties,
like what font the character is from and what the character code is (0 up to 255).
At some point \TEX\ will turn that list into a glyph list. Say that we have this:

\starttyping
efficient
\stoptyping

This will eventually become seven nodes:

\starttyping
[e] -> [ffi] -> [c] -> [i] -> [e] -> [n] -> [t]
\stoptyping

The ffi ligature is a glyph node which actually also keeps information about this
one character being made from three.

In \LUATEX\ it is different, and this is one of the reasons for it being slower. We
stick to the first example:

\starttyping
[h] <-> [e] <-> [l] <-> [l] <-> [o]
\stoptyping

So, instead of pointing to the next node, we also point back to the previous: we
have a double linked list. This means that all over the program we need to
maintain these extra links too. They are not used by \TEX\ itself, but handy at
the \LUA\ end. But, instead of only having the font as property there is much
more. The \TEX\ program can deal with multiple languages at the same time and
this relates to hyphenation. In traditional \TEX\ there are language nodes that
indicate a switch to another language. But in \LUATEX\ that property is kept with
each glyph node. Actually, even specific language properties like the hyphen min,
hyphen max and the choice if uppercase should be hyphenated are kept with these
nodes. Spaces are turned into glue nodes, and these nodes are also larger than in
regular \TEX\ engines.

So, in \LUATEX, when a character goes from the input into a node, a more complex
data structure has to be set up and the larger data structure also takes more
memory. That in turn means that caching (close to the \CPU) gets influenced. Add
to that the fact that we operate on 32 bit character values, which also comes
with higher memory demands.

We mentioned that a traditional engine goes from one state of node list into
another (the ligature building). Actually this is an integrated process: a lot
happens on the fly. If something is put into a \type {\hbox} no hyphenation takes
place, only ligature building and kerning. When a paragraph is typeset,
hyphenation happens on demand, in places where it makes sense.

In \LUATEX\ these stages are split. A node list is {\em always} hyphenated. This
step as well as ligature building and kerning are {\em three} separate steps. So,
there's always more hyphenation going on than in a traditional \TEX\ engine: we
get more discretionary nodes and again these take more memory than before; also
the more nodes we have, the more it will impact performance down the line. The
reason for this is that each step can be intercepted and replaced by a \LUA\
driven one. In practice, with modern \OPENTYPE\ fonts that is what happens: these
are dealt with (or at least managed in) \LUA. For Japanese for sure the
built|-|in ligature and kerning doesn't apply: the work is delegated and this
comes at a price. Japanese needs no hyphenation but instead characters are
treated with respect to their neighbors and glue nodes are injected when needed.
This is something that \LUA\ code is used for so here performance is determined
by how well the plugged in code behaves. It can be inefficient but it can also be
so clever that it just takes a bit of time to complete.

I didn't mention another property of nodes: attributes. Each node that has some
meaning in the node list (glyphs, kerns, glue, penalties, discretionary,
\unknown, these terms should ring bells for a \TEX\ user) have a pointer to an
attribute list. Often these are the same for neighboring nodes, but they can be
different. If a macro package sets 10 attributes, then there will be lists of ten
attributes nodes (plus some overhead) active. When values change, copies are made
with the change applied. Grouping even complicates this a little more. This has
an impact on performance. Not only need these lists be managed, when they are
consulted at the \LUA\ end (as they are meant as communication with that bit of
the engine) these lists are interpreted. It all adds up to more runtime. There is
nothing like that in traditional \TEX, but there some more macro juggling to
achieve the same effects can cause a performance hit.

\stopsection

\startsection[title=Macros and tokens]

When you define a macro like this:

\starttyping
\def\MyMacro#1{\hbox{here: #1!}}
\stoptyping

the \TEX\ engine will parse this as follows (we keep it simple):

\starttabulate[|Tc|l|]
\NC \string\def               \NC primitive token \NC \NR
\NC \string\MyMacro           \NC user macro pointing to: \NC \NR
\NC \char\hashasciicode 1     \NC argument list of length 1 and no delimiters \NC \NR
\NC \char\leftbraceasciicode  \NC openbrace token \NC \NR
\NC \string\hbox              \NC hbox primitive token \NC \NR
\NC h                         \NC letter token h \NC \NR
\NC e                         \NC letter token e \NC \NR
\NC r                         \NC letter token r \NC \NR
\NC e                         \NC letter token e \NC \NR
\NC :                         \NC other token : \NC \NR
\NC                           \NC space token \NC \NR
\NC \char\hashasciicode 1     \NC reference to argument \NC \NR
\NC !                         \NC other token ! \NC \NR
\NC \char\rightbraceasciicode \NC close brace token \NC \NR
\stoptabulate

The \type {\def} is eventually lost, and the meaning of the macro is stored as a
linked list of tokens that get bound to the user macro \type {\MyMacro}. The details
about how this list is stored internally can differ a bit per engine but the idea
remains. If you compare tokens of a traditional \TEX\ engine with \LUATEX, the main
difference is in the size: those in \LUATEX\ take more memory and again that impacts
performance.

\stopsection

\startsection[title=Processing]

Now, for a moment we step aside and look at a regular programming language, like
\PASCAL, the language \TEX\ is written in, or \CCODE\ that is used for \LUATEX.
The high level definitions, using the syntax of the language, gets compiled into
low level machine code: a sequence of instructions for the \CPU. When doing so
the compiler can try to optimize the code. When the program is executed all the
\CPU\ has to do is fetch the instructions, and execute them, which in turn can
lead to fetching data from memory. Successive versions of \CPU's have become more
clever in handling this, predicting what might happen, (pre) fetching data from
memory etc.

When you look at scripting languages, again a high level syntax is used but after
interpretation it becomes compact so called bytecode: a sequence of instructions
for a virtual machine that itself is a compiled program. The virtual machine
fetches the bytes and acts upon them. It also deals with managing memory and
variables. There is not much optimization going on there, certainly not when the
language permits dynamically changing function calls and such. Here performance
is not only influenced by the virtual machine but also by the quality of the
original code (the scripts). In \LUATEX\ we're talking \LUA\ here, a scripting
language that is actually considered to be pretty fast.

Sometimes bytecode can be compiled Just In Time into low level machine code but
for \LUATEX\ that doesn't work out well. Much \LUA\ code is executed only once or
a few times so it simply doesn't pay off. Apart from that there are other
limitations with this (in itself impressive) technology so I will not go into
more detail.

So how does \TEX\ work? It is important to realize that we have a mix of input
and macros. The engine interprets that on the fly. A character enters the input
and \TEX\ has to look at it in the perspective of what it what it expects. It is
just a character? Is it part of a control sequence that started (normally) with a
backslash? Does it have a special meaning, like triggering math mode? When a
macro is defined, it gets stored as a linked list of tokens and when it gets
called the engine has to expand that meaning. In the process some actions
themselves kind of generate input. When that happens a new level of input is
entered and further expansion takes place. Sometimes \TEX\ looks ahead and when
not satisfied, pushes something back into the input which again introduces a new
level. A lot can happen when a macro gets expanded. If you want to see this, just
add \type {\tracingall} at the top of your file: you will be surprised! You will
not see how often tokens get pushed and popped but you can see how much got
expanded and how often local changes get restored. By the way, here is something
to think about:

\starttyping
\count4=123
\advance \count4 by 123
\stoptyping

If this is in your running text, the scanner sees \type {\count} and then
triggers the code that handles it. That code expects a register number, here that
is the \type {4}. Then it checks if there is an optional \type {=} which means
that it has to look ahead. In the second line it checks for the optional keyword
\type {by}. This optional scanning has a side effect: when the next token is {\em
not} an equal or keyword, it has to push back what it just read (we enter a new
input level) and go forward. It then scans a number. That number ends with a
space or \type {\relax} or something not being a number. Again, some push back
onto the input can happen. In fact, say that instead of \type {4} we have a macro
indicating the register number, intermediate expansion takes place. So, even
these simple lines already involve a lot of action! Now, say that we have this

\starttyping
% \newcounter \scratchcounter % done once
\scratchcounter 123
\scratchcounter =123
\advance\scratchcounter by 123
\advance\scratchcounter 123
\stoptyping

Can you predict what is more efficient? If this operation doesn't happen
frequently, performance wise there is no real difference between the variants
with and without \type {=} and with and without \type {b}. This is because \TEX\
is pretty fast in tokenizing its input and interpreting its already stored token
lists that have these commands. But given what we said before, when you talk of
millions of such assignments, adding the equal sign and \type {by} {\em could}
actually be faster because there is no pushing back onto the input stack
involved. It probably makes no sense to take this into account when writing
macros but just keep in mind that performance is in the details.

% written in 2020, the next added in Januari 2023

Actually, contrary to what you might expect, \type {\scratchcounter} is not even a
counter in \CONTEXT, and in \LUAMETATEX we can also do this:

\starttyping
% \newinteger\scratchcounter % done once
\scratchcounter 123
\scratchcounter =123
\advanceby\scratchcounter 123
\stoptyping

Which means that because this counter is defined as so called \quotation
{constant integer} it avoids some indirectness (to a counter register) and
because \type {\advanceby} doesn't scan for a keyword the code above runs faster
anyway.

This model of expansion is very different from compiled code or bytecode. To some
extent you can consider a list of tokens that make up a macro to be bytecode, but
instead of a sequence of bytes it is a linked list. That itself has a penalty in
performance. Depending on how macros expand, the engine can be hopping all over
the token memory following that list. That means that quite likely the data that
gets accessed is not in your \CPU\ cache and as a result performance cannot
benefit from it apart of course from the expanding machinery itself, but that one
is not a simple loop messing around with variables: it accesses code all over the
place! Text gets hyphenated, fonts get applied, material gets boxed, paragraphs
constructed, pages built. We're not moving a blob of bits around (as in a video)
but we're constantly manipulating small amounts of memory scattered around memory
space.

Now, where a traditional \TEX\ engine works on 8 bit characters and smaller
tokens, the 32 bit \LUATEX\ works on larger chunks. Although macro names are
stored as single symbolic units, there are moments when its real (serialized to
characters) name is used, for instance when with \type {\csname}. When that
happens, the singular token becomes a list, so for instance the (stored) token
\type {\foo} becomes a temporary three token list (actually four if you also
count the initial reference token). Those tree tokens become three characters in
a string that then is used in the hash lookup. There are plenty cases where such
temporary string variables are allocated and filled. Compare:

\starttyping
\def\foo{\hello}
\stoptyping

Here the macro \type {\foo} has just a one token reference to \type {\hello}
because that's how a macro reference gets stored. But in

\starttyping
\def\foo{\csname hello\endcsname}
\stoptyping

we have two plus five tokens to access what effectively is \type {\hello}. Each
character token has to be converted to a byte into the assembled string. Now it
must be said that in practice this is still pretty fast but when we have longer
names and especially when we have \UTF8 characters in there it can come at a
price. It really depends on how your macro package works and sometimes you just
pay the price of progress. Buying a faster machine is then the solution because
often we're not talking of extreme performance loss here. And modern \CPU's can
juggle bytes quite efficiently. Actually, when we go to 64 bit architectures,
\LUATEX's data structures fit quite well to that. As a side note: when you run a
32 bit binary on a 64 bit architecture there can even be a price being paid for
that when you use \LUATEX. Just move on!

\stopsection

\startsection[title=Management]

Before we can even reach the point that some content becomes typeset, much can
happen: the engine has to start up. It is quite common that a macro package uses
a memory dump so that macros are not to be parsed each run. In traditional
engines hyphenation patterns are stored in the memory dump as well. And some
macro packages can put fonts in it. All kind of details, like upper- and
lowercase codes can get stored too. In \LUATEX\ fonts and patterns are normally
kept out of the dump. That dump itself is much larger already because we have 32
bit characters instead of 8 bit so more memory is used. There are also new
concepts, like catcode tables that take space. Math is 32 bit too, so more codes
related to math are stored. Actually the format is so much larger that \LUATEX\
compresses it. Anyway, it has an impact on startup time. It is not that much, but
when you measure differences on a one page document the overhead in getting
\LUATEX\ up and running will definitely impact the measurement.

The same is true for the backend. A traditional engine uses (normally) \TYPEONE\
fonts and \LUATEX\ relies on \OPENTYPE. So, the backend has to do more work. The
impact is normally only visible when the document is finalized. There can be a
slightly larger hickup after the last page. So, when you measure one page
performance, it again pollutes the page per second performance.

\stopsection

\startsection[title=Summary]

So, to come back to the observation that \LUATEX\ is slower than \PDFTEX. At
least for \CONTEXT\ we can safely conclude that indeed \PDFTEX\ is faster when we
talk about a standard English document, with \TEX\ \ASCII\ input, where we can do
with traditional small fonts, with only some kerning and simple ligatures. But as
soon as we deal with for instance \XML, have different languages and scripts,
have more demanding layouts, use color and images, and maybe even features that
we were not aware of and therefore didn't require in former times the \LUATEX\
engine (and for \CONTEXT\ it's \LUAMETATEX\ follow up) performs way better than
\PDFTEX. And how about support for hyper links, protrusion and expansion, tagging
for the sake of accessibility, new layers of abstraction, etc. The impact on
performance can differ a lot per engine (and probably also per macro package).
So, there is no simple answer and explanation for the fact that the observed slow
\LATEX\ run on Japanese text, apart from that we can say: look at the whole
picture: we have more complex tokens, nodes, scripts and languages, fonts,
macros, demands on the machinery, etc. Maybe it is just the price you are paying
for that.

\stopsection

\stopchapter

\stopcomponent