% language=us runpath=texruns:manuals/followingup

\environment followingup-style

\startcomponent  followingup-format

\startchapter[title={The format file}]

It is interesting when someone compares macro package and used parameters like
the size of a format file, the output of \type {\tracingall}, or startup time to
make some point. The point I want to make here is that unless you know exactly
what goes on in a run that involves a real document, which can itself involve
multiple runs, such a comparison is rather pointless. For sure I do benchmark but
I can only draw conclusions of what I (can) know (about). Yes, benchmarking your
own work makes sense but doing that in comparisons to what you consider
comparable variants assumes knowledge of more than your own work and objectives.


For instance, when you load few fonts, typeset one page and don't do anything
that demands any processing or multiple runs, you basically don't measure
anything. More interesting are the differences between 10 or 500 pages, a few
font calls or tens of thousands, no color of extensive usage of color and other
properties, interfacing, including inheritance of document constructs, etc. And
even then, when comparing macro packages, it is kind of tricky to deduce much
from what you observe. You really need to know what is going on inside and also
how that relates to for instance adaptive font scaling. You can have a fast start
up but if a users needs one tikz picture, loading that package alone will make
you forget the initial startup time. You always pay a price for advanced features
and integration! And we didn't even talk about the operating system caching
files, running on a network share, sharing processors among virtual machines,
etc.

Pointless comparing is also true for looking at the log file when enabling \type
{\tracingall}. When a macro package loads stuff at startup you can be sure that
the log file is larger. When a font or language is loaded the first time, or
maybe when math is set up there can be plenty of lines dumped. Advanced analysis
of conditions and trial runs come at a price too. And eventually, when a box is
shown the configured depth and breadth really matter, and it might also be that
the engine provides much more (verbose) detail. So, a comparison is again
pointless. It can also backfire. Over the decades of developing \CONTEXT\ I have
heard people working on systems make claims like \quotation {We prefer not to
\unknown} or \quotation {It is better to it this way \unknown} or (often about
operating system) \quotation {It is bad that \unknown} just to see years later
the same being done in the presumed better alternative. I can have a good laugh
about that: do this and don't do that backfiring.

That brings us to the format file. When you make a \CONTEXT\ format with the
English user interface, with interfacing being a feature that itself introduces
overhead, the \LUATEX\ engine will show this at the end:

\starttyping
Beginning to dump on file cont-en.fmt
 (format=cont-en 2021.6.9)
48605 strings using 784307 bytes
1050637 memory locations dumped; current usage is 414&523763
44974 multiletter control sequences
\font\nullfont=nullfont
0 preloaded fonts
\stoptyping

The file itself is quite large: 11,129,903 bytes. However, it is actually much
larger because the format file is compressed! The real size is 19.399.216. Not
taking that into account when comparing the size of format files is kind of bad
because compression directly relates to what resources a format use and how
usage is distributed over the available memory blobs. The \LUATEX\ engine does
some optimizations and saves the data sparse but the more holes you create, the
worse it gets. For instance, the large character vectors are compartmentalized in
order to handle \UNICODE\ efficiently so the used memory relates to what you
define: do you set up all catcodes or just a subset. Maybe you delay some
initialization to after the format is loaded, in which case a smaller format file
gets compensated by more memory usage and initializaton time afterwards. Maybe
your temporary macros create holes in the token array. The memory that is
configured in the configuration files also matter. Some memory blobs are saved at
their configured size, others dismiss the top part that is not used when saving
the format but allocate the lot when the format is loaded. That means that memory
usage in for instance \LUATEX\ can be much larger than a format file suggests.
Keep in mind that a format file is basically a memory dump.

Now, how does \LUAMETATEX\ compare to \LUATEX. Again we will look at the size of
the format file, but you need to keep in mind that for various reasons the \LMTX\
macros are somewhat more efficient than the \MKIV\ ones, in the meantime some new
mechanism were added, which adds more \TEX\ and \LUA\ code, but I still expect
(at least for now) a smaller format file. However when we create the format we
see this (reformatted):

\starttyping
Dumping format 'cont-en.fmt 2021.6.9' in file 'cont-en.fmt':
tokenlist compacted from 489733 to 488204 entries,
1437 potentially aliased lua call/value entries,
max string length 69, 16 fingerprint
+ 16 engine + 28 preamble
+ 836326 stringpool
+ 10655 nodes + 3905660 tokens
+ 705300 equivalents
+ 23072 math codes + 493024 text codes
+ 38132 primitives + 497352 hashtable
+ 4 fonts + 10272 math + 1008 language + 180 insert
+ 10305643 bytecodes
+ 12 housekeeping = 16826700 total.
\stoptyping

This looks quite different from the \LUATEX\ output. Here we report more detail:
for each blob we mention the number of bytes used. The final result is a file
that takes 16.826.700 bytes. That number should be compared with the 19.399.216
for \LUATEX. So, we need less indeed. But, when we compress the \LUAMETATEX\
format we get this: 5,913,932 which is much less than the 11,129,903 compressed
size that the \LUATEX\ engine makes of it. One reason for using level 3 zip
compression compression in \LUATEX\ is that (definitely when we started) it loads
faster. It adds to creating the dump but doesn't really influence loading,
although that depends a bit on the compiler used. It is not easy to see from
these numbers what goes on, but when you consider the fact that we mostly store
32 bit numbers it will also be clear that many can be zero or have two or three
zero bytes. There's a lot of repetition involved!

So let's look at some of these numbers. The mentioning of token list compaction
relates to getting rid of holes in memory. Each token takes 8 bytes, 4 for the
token identifier, internally called a cmd and chr, and 4 for a value like an
integer or dimension value, or a glue pointer, or a pointer to a next token, etc.
In our case compaction doesn't save that much.

The mentioning of potentially aliased \LUA\ call|/|value entries is more a warning.
Because the \LUA\ engine starts fresh each run, you cannot store its \quote
{pointers} and because hashes are randomized this means that you need to delay
initialization to startup time, definitely for function tokens.

Strings in \TEX\ can be pretty long but in practice they aren't. In \CONTEXT\ the
maximum string length is 69. This makes it possible to use one byte for
registering the string length instead of four which saves quite a bit. Of course
one large string will spoil this game.

The fingerprint, engine, preamble and later housekeeping bytes can be neglected
but the string pool not. These are the bytes that make up the strings. The bytes
are stored in format but when loaded become dynamically allocated. The \LUATEX\
engine and its successor don't really have a pool.

Now comes a confusing number. There are not tens of thousands of nodes allocated.
A node is just a pointer into a large array so actually node references are just
indices. Their size varies from 2 slots to 25; the largest are par nodes, while
shape nodes are allocated dynamically. So what gets reported are the number of
bytes that nodes take. each node slot takes 8 bytes, so a glyph node of 12
bytes takes 96 bytes, while a glue spec node (think skip registers) takes 5 slots
or 40 bytes. These are amounts of memory that were not realistic when \TEX\ was
written. For the record: in \LUATEX\ glue spec nodes are not shared, so we have
many more.

The majority of \TEX\ related dump data is for tokens, and here we need 3905660
which means 488K tokens (each reported value also includes some overhead). The
memory used for the table of equivalents makes for some 88K of them. This table
relates to macros (their names and content). Keep in mind that (math) character
references are also macros.

The next sections that get loaded are math and text codes. These are the
mentioned compartimized character properties. The number of math codes is not
that large (because we delay much of math) but the text codes are plenty, think
of lc, uc, sf, hj, catcodes, etc. Compared to \LUATEX\ we have more categories
but use less space because we have an more granular storage model. Optimizing
that bit really payed off, also because we have more vectors.

The way primitives and macro names get resolved is pretty much the same in all
engines but by using the fact that we operate in 32 bit I could actually get rid
of some parallel tables that handle saving and restore. Some optimizations relate
to the fact that the register ranges are part of the game so basically we have
some holes in there when they are not used. I guess this is why \ETEX\ uses a
sparse model for the registers above 255. What also saved a lot is that we don't
need to store font names, because these are available in another way; even in
\LUATEX\ that takes a large, basically useless, chunk. The memory that a macro
without parameters consumes is 8 bytes smaller and in \CONTEXT\ we have lots of
these.
We don't really store fonts, so that section is small, but we do store the math
parameters, and there is not much we can save there. We also have more such
parameters in \LUAMETATEX\ so there we might actually use more storage. The
information related to languages is also minimal because patterns and exceptions
are loaded at runtime. A new category (compared to \LUATEX) is inserts because in
\LUAMETATEX\ we can use an alternative (not register based) variant. As you can
see from the 180 bytes used, indeed \CONTEXT\ is using that variant.

That leaves a large block of more than 10 million bytes that relates to \LUA\
byte code. A large part of that is the huge \LUA\ character table that \CONTEXT\
uses. The implementation of font handling also takes quite a bit and we're not
even talking of all the auxiliary \LUA\ modules, \XML\ processing, etc. When
\CONTEXT\ would load that on demand, which is nearly always, the format file
would be much smaller but one would pay for it later. Loading the (some 600)
\LUA\ byte code chunks takes of course some time as does initialization but not
much.

All that said, the reason why we have a large format file can be understood well
if one considers what goes in there. The \CONTEXT\ format files for \PDFTEX\ and
\XETEX\ are 3.3 and 4.7 MB each which is smaller but not that much when you
consider the fact that there is no \LUA\ code stored and that there are less
character tables and an \ETEX\ register model used. But a format file is not the
whole story. Runtime memory usage also comes at a price.

The current memory settings of \CONTEXT\ are as follows; these values get
reported when a format has been generated and can be queried at runtime an any
moment:

\starttabulate[|l|r|r|r|r|]
\BC           \BC       max \BC      min \BC      set \BC     stp \BC \NR
\HL
\BC string    \NC   2097152 \NC   150000 \NC   500000 \NC  100000 \NC \NR
\BC pool      \NC 100000000 \NC 10000000 \NC 20000000 \NC 1000000 \NC \NR
\BC hash      \NC   2097152 \NC   150000 \NC   250000 \NC  100000 \NC \NR
\BC lookup    \NC   2097152 \NC   150000 \NC   250000 \NC  100000 \NC \NR
\BC node      \NC  50000000 \NC  1000000 \NC  5000000 \NC  500000 \NC \NR
\BC token     \NC  10000000 \NC  1000000 \NC 10000000 \NC  250000 \NC \NR
\BC buffer    \NC 100000000 \NC  1000000 \NC 10000000 \NC 1000000 \NC \NR
\BC input     \NC    100000 \NC    10000 \NC   100000 \NC   10000 \NC \NR
\BC file      \NC      2000 \NC      500 \NC     2000 \NC     250 \NC \NR
\BC nest      \NC     10000 \NC     1000 \NC    10000 \NC    1000 \NC \NR
\BC parameter \NC    100000 \NC    20000 \NC   100000 \NC   10000 \NC \NR
\BC save      \NC    500000 \NC   100000 \NC   500000 \NC   10000 \NC \NR
\BC font      \NC    100000 \NC      250 \NC      250 \NC     250 \NC \NR
\BC language  \NC     10000 \NC      250 \NC      250 \NC     250 \NC \NR
\BC mark      \NC     10000 \NC       50 \NC       50 \NC      50 \NC \NR
\BC insert    \NC       500 \NC       10 \NC       10 \NC      10 \NC \NR
\stoptabulate

The maxima is what can be used at most. Apart from the magic number 2097152 all
these maxima can be bumped at compile time but if you need more, you might wonder
of your approach to rendering makes sense. The minima are what always gets
allocated, and again these are hard coded defaults. The size can be configured
and is normally the same as the minima but we use larger values in \CONTEXT. The
step is how much an initial memory blob will grow when more is needed than is
currently available. The last four entries show that we don't start out with many
fonts (especially when we use the \CONTEXT\ compact font model not that many are
needed) and because \CONTEXT\ implements marks in a different way we actually
don't need them. We do use the new insert properties storage model and for now
the set sizes are enough for what we need.

In practice a \LUAMETATEX\ run uses less memory than a \LUATEX\ one, not only
because memory allocation is more dynamic, but also because of other
optimizations. When the compact font model is used (something \CONTEXT) even less
memory is needed. Even this claim should be made with care. Whenever I discuss
the use of resources one needs to limit the conclusions to \CONTEXT. I can't
speak for other macro packages simply because I don't know the internals and the
design decisions made and their impact on the statistics. As a teaser I show the
impact of some definitions:

\starttyping
\chardef     \MyFooA1234
\Umathchardef\MyFooB"1 "0 "1234
\Umathcode   1 2 3 4
\def         \MyFooC{ABC}
\def         \MyFooD#1{A#1C}
\def         \MyFooE{\directlua{print("some lua")}}
\stoptyping

The stringpool grows because we store the names (here they are of equal length).
Only symbolic definitions bump the hashtable and equivalents. And with
definitions that have text inside the number of bytes taken by tokens grows fast
because every character in that linked list takes 8 bytes, 4 for the character
with its catcode state and 4 for the link to the next token.

\starttabulate[|l||||||]
\BC                       \BC stringpool \BC tokens  \BC equivalents \BC hashtable \BC total    \NC \NR
\HL
\NC                       \NC 836408     \NC 3906124 \NC 705316      \NC 497396    \NC 16828987 \NC \NR
\NC \type {\chardef}      \NC 836415     \NC 3906116 \NC 705324      \NC 497408    \NC 16829006 \NC \NR
\NC \type {\Umathchardef} \NC 836422     \NC 3906116 \NC 705324      \NC 497420    \NC 16829025 \NC \NR
\NC \type {\Umathcode}    \NC 836422     \NC 3906124 \NC 705324      \NC 497420    \NC 16829033 \NC \NR
\NC \type {\def} (no arg) \NC 836429     \NC 3906148 \NC 705332      \NC 497428    \NC 16829080 \NC \NR
\NC \type {\def} (arg)    \NC 836436     \NC 3906196 \NC 705340      \NC 497440    \NC 16829155 \NC \NR
\NC \type {\def} (text)   \NC 836443     \NC 3906372 \NC 705348      \NC 497452    \NC 16829358 \NC \NR
\stoptabulate

So, every time a user wants some feature (some extra checking, a warning, color
or font support for some element) that results in a trivial extension to the
core, it can bump the size fo the format file more than you think. Of course when
it leads to some overhaul sharing code can actually make the format shrink too. I
hope it is clear now that there really is not much to deduce from the bare
numbers. Just try to imagine what:

\starttyping
\definefilesynonym
  [type-imp-newcomputermodern-book.mkiv]
  [type-imp-newcomputermodern.mkiv]
\stoptyping

adds to the format. Convenience has a price.

\stopchapter

\stopcomponent

% Some bonus content:

When processing thousand paragraphs \type {tufte.tex}, staying below 4 seconds
(just over 60 pages per second) all|-|in that looks ok. But it doesn't say that
much. Outputting 1000 pages in 2 seconds tells a bit about the overhead on a page
but again in practice things work out differently. So what do we need to
consider?

\startitemize

\startitem
    Check what macros and resources are preloaded and what gets always loaded at
    runtime.
\stopitem

\startitem
    After a first run it's likely that the operating system has resources in its
    cache so start measuring after a few runs.
\stopitem

\startitem
    Best run a test many times and and take the average runtime.
\stopitem

\startitem
    Simple macro performance tests can be faster than in real usage because the
    related bytes are in \CPU\ cache memory. So one can only use that to test a
    specific improvement (or hit due to added functionality).
\stopitem

\startitem
    The size of the used \TEX\ tree can matter. The file databases need to be
    loaded and consulted.
\stopitem

\startitem
    The binary matters: is it optimized, does it load libraries, is it 64 bit or not.
\stopitem

\startitem
    Local and|/|or global font definitions can hit performance and when a style
    does many redundant switches it might hit performance. Of course that only is
    the case when font switching is adaptive.
\stopitem

\startitem
    The granularity of subsystems impacts performance: advanced color support,
    inheritance used in mechanisms, abstraction combined with extensive
    support for features, it all matters.
\stopitem

\startitem
    The more features one enables the more it will impact performance as does
    preprocessing the input (normalizing, bidi checking, etc).
\stopitem

\startitem
    It matters how the page (and layout) dimensions are defined. Although
    language doesn't really play a role (apart from possible hyphenation)
    specific scripts might.
\stopitem

\stopitemize

These are just a few points, but it might be clear that I don't take comparisons
too serious simply because it's real runs that matter. As long as we're in the
runtime comfort zone we're okay. You can run tests within the domain of a macro
package but comparing macro package makes not that much sense. It can even
backfire, especially when claims were made about what should be or not be in a
kernel (while later violating that) or relying on old stories (or rumors) about a
variant macro package being slow. (The same is true when comparing one's favorite
operating system.) Yes, the \CONTEXT\ format file is huge and performance less
than for instance plain \TEX. If that is a problem and not a virtue then make
sure your own alternative will never end up like that. And just don't come to
conclusions about a system that you don't really know.