% language=us

\startcomponent hybrid-optimize

\environment hybrid-environment

\startchapter[title={Optimizations again}]

\startsection [title={Introduction}]

Occasionally we do some timing on new functionality in either
\LUATEX\ or \MKIV, so here's another wrapup.

\stopsection

\startsection [title={Font loading}]

In \CONTEXT\ we cache font data in a certain way. Loading a font from the cache
takes hardly any time. However, preparation takes more time as well memory as we
need to go from the fontforge ordering to one we can use. In \MKIV\ we have
several font tables:

\startitemize[packed]
\startitem
    The original fontforge table: this one is only loaded once and converted to
    another representation that is cached.
\stopitem
\startitem
    The cached font representation that is the basis for further manipulations.
\stopitem
\startitem
    In base mode this table is converted to a (optionally cached) scaled \TFM\
    table that is passed to \TEX.
\stopitem
\startitem
    In node mode a limited scaled version is passed to \TEX. As with base mode,
    this table is kept in memory so that we can access the data.
\stopitem
\startitem
    When processing features in node mode additional (shared) subtables are
    created that extend the memorized catched table.
\stopitem
\stopitemize

This model is already quite old and dates from the beginning of \MKIV. Future
versions might use different derived tables but for the moment we need all this
data if only because it helps us with the development.

The regular method to construct a font suitable for \TEX, either or not using
base mode or node mode in \MKIV, is to load the font as table using \type
{to_table}, a \type {fontloader} method. This means that all information is
available (and can be manipulated). In \MKIV\ this table is converted to another
one and in the process new entries are added and existing ones are freed. Quite
some garbage collection and table resizing takes place in the process. In the
cached instance we share identical tables so there we can gain a lot of memory
and avoid garbage collection.

The difference in usage is as follows:

\starttyping
do
  local f = fontloader.open("somefont.otf") -- allocates font object
  local t = fontloader.to_table(f)          -- allocates table
  fontloader.close(f)                       -- frees font object
  for index, glyph in pairs(t) do
    local width = glyph.width               -- accesses table value
  end
end                                         -- frees table
\stoptyping

Here \type {t} is a complete \LUA\ table and it can get quite large: script fonts
like Zapfino (for latin) or Husayni (for arabic) have lots of alternate shapes
and much features related information, fonts meant for \CJK\ usage have tens of
thousands of glyphs, and math fonts like Cambria have many glyphs and math
specific information.

\starttyping
do
  local f = fontloader.open("somefont.otf") -- allocates font object
  for index=0, t.glyphmax-1 do
    local glyph = f.glyphs[index]           -- assigns user data object
    if glyph then
      local width = glyph.width             -- calls virtual table value
    end
  end
  fontloader.close(f)                       -- frees font object
end
\stoptyping

In this case there is no big table, and \type {glyph} is a so called userdata
object. Its entries are created when asked for. So, where in the first example
the \type {width} of a glyph is a number, in the second case it is a function
disguised as virtual key that will return a number. In the first case you can
change the width, in the second case you can't.

This means that if you want to keep the data around you need to copy it into
another table but you can do that stepwise and selectively. Alternatively you can
keep the font object in memory. As some glyphs can have much data you can imagine
that when you only need to access the width, the userdata method is more
efficient. On the other hand, if you need access to all information, the first
method is more interesting as less overhead is involved.

In the userdata variant only the parent table and its glyph subtable are
virtualized, as are entries in an optional subfonts table. So, if you ask for the
kerns table of a glyph you will get a real table as it makes no sense to
virtualize it. A way in between would have been to request tabls per glyph but as
we will see there is no real benefit in that while it would further complicate
the code.

When in \LUATEX\ 0.63 the loaded font object became partially virtual it was time
to revision the loading code to see if we could benefit from this.

In the following tables we distinguish three cases: the original but adapted
loading code \footnote {For practical reasons we share as much odd as possible
between the methods so some reorganization was needed.}, already a few years old,
the new sparse loading code, using the userdata approach and no longer a raw
table, and a mixed approach where we still use the raw table but instead of
manipulating that one, construct a new one from it. It must be noticed that in
the process of integrating the new method the traditional method suffered.

First we tested Oriental \TEX's Husayni font. This one has lots of features, many
of lookups, and quite some glyphs. Keep in mind that the times concern the
preparation and not the reload from the cache, which is more of less neglectable.
The memory consumption is a snapshot of the current run just after the font has
been loaded. Peak memory is what bothers most users. Later we will explain what
the values between parenthesis refer to.

\starttabulate[|l|c|c|c|]
\FL
\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
\TL
\NC \bf table    \NC 113 MB (102)    \NC 118 MB (117)    \NC 1.8 sec (1.9)         \NC \NR
\NC \bf mixed    \NC 114 MB (103)    \NC 119 MB (117)    \NC 1.9 sec (1.9)         \NC \NR
\NC \bf sparse   \NC 117 MB (104)    \NC 121 MB (120)    \NC 1.9 sec (2.0)         \NC \NR
\NC \bf cached   \NC ~75 MB          \NC ~80 MB          \NC 0.4 sec               \NC \NR
\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC 0.3 sec               \NC \NR
\LL
\stoptabulate

So, here the new method is not offering any advantages. As this is a font we use
quite a lot during development, any loading variant will do the job with similar
efficiency.

Next comes Cambria, a font that carries lots of glyphs and has extensive support
for math. In order to provide a complete bodyfont setup some six instances are
loaded. Interesting is that the original module needs 3.9 seconds instead if 6.4
which is probably due to a different ordering of code which might influence the
garbage collector and it looks like in the reorganized code the garbage collector
kicks in a few times during the font loading. Already long ago we found out that
this is also somewhat platform dependent.

\starttabulate[|l|c|c|c|]
\FL
\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
\TL
\NC \bf table    \NC 155 MB (126)    \NC 210 MB (160)    \NC 6.4 sec (6.8)         \NC \NR
\NC \bf mixed    \NC 154 MB (130)    \NC 210 MB (160)    \NC 6.3 sec (6.7)         \NC \NR
\NC \bf sparse   \NC 140 MB (123)    \NC 199 MB (144)    \NC 6.4 sec (6.8)         \NC \NR
\NC \bf cached   \NC ~90 MB          \NC ~94 MB          \NC 0.6 sec               \NC \NR
\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC 0.3 sec               \NC \NR
\LL
\stoptabulate

Here the sparse method reports less memory usage. There is no other gain as there
is a lot of access to glyph data due to the fact that this font is rather
advanced. More virtualization would probably work against us here.

Being a \CJK\ font, the somewhat feature|-|dumb but large AdobeSongStd-Light has
lots of glyphs. In previous tables we already saw values between parenthesis:
these are values measured with implicit calls to the garbage collector before
writing the font to the cache. For this font much more memory is used but garbage
collection has a positive impact on memory consumption but drastic consequences
for runtime. Eventually it's the cached timing that matters and that is a
constant factor but even then it can disturb users if a first run after an update
takes so much time.

\starttabulate[|l|c|c|c|]
\FL
\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
\TL
\NC \bf table    \NC 180 MB (125)    \NC 185 MB (172)    \NC 4.4 sec (4.5)         \NC \NR
\NC \bf mixed    \NC 190 MB (144)    \NC 194 MB (181)    \NC 4.4 sec (4.7)         \NC \NR
\NC \bf sparse   \NC 153 MB (119)    \NC 232 MB (232)    \NC 8.7 sec (8.9)         \NC \NR
\NC \bf cached   \NC ~96 MB          \NC 100 MB          \NC 0.7 sec               \NC \NR
\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC 0.3 sec               \NC \NR
\LL
\stoptabulate

Peak memory is quite high for the sparse method which is due to the fact that we
have only glyphs (but many) so we have lots of access and small tables being
created and collected. I suspect that in a regular run the loading time is much
lower for the sparse case because this is just too much of a difference.

The last test loaded 40 variants of Latin Modern. Each font has reasonable number
of glyphs (covering the latin script takes some 400--600 glyphs), the normal
amount of kerning, but hardly any features. Reloading these 40 fonts takes about
a second.

\starttabulate[|l|c|c|c|]
\FL
\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
\TL
\NC \bf table    \NC 204 MB (175)    \NC 213 MB (181)    \NC 13.1 sec (16.4)       \NC \NR
\NC \bf mixed    \NC 195 MB (168)    \NC 205 MB (174)    \NC 13.4 sec (16.5)       \NC \NR
\NC \bf sparse   \NC 198 MB (165)    \NC 202 MB (170)    \NC 13.4 sec (16.6)       \NC \NR
\NC \bf cached   \NC 147 MB          \NC 151 MB          \NC ~1.7 sec              \NC \NR
\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC ~0.3 sec              \NC \NR
\LL
\stoptabulate

The old method wins in runtime and this makes it hard to decide which strategy to
follow. Again the numbers between parenthesis show what happens when we do an
extra garbage collection sweep after packaging the font instance. A few more
sweeps in other spots will bring down memory a few megabytes but at the cost of
quite some runtime. The original module that uses the table approach is 3~seconds
faster that the current one. As the code is essentially the same but organized
differently again we suspect the garbage collector to be the culprit.

So when we came this far, Taco and I did some further tests and on his machine
Taco ran a profiler on some of the tests. He posted the following conclusion to
the \LUATEX\ mailing list:

\startnarrower
It seems that the userdata access is useful if {\em but only if} you are very low
on memory. In other cases, it just adds extra objects to be garbage collected,
which makes the collector slower. That is on top of extra time spent on the
actual calls, and even worse: those extra gc objects tend to be scattered around
in memory, resulting in extra minor page faults (cpu cache misses) and all that
has a noticeable effect on run speed: the metatable based access is 20--30\%
slower than the old massive \type {to_table}.

Therefore, there seems little point in expanding the metadata functionality any
further. What is there will stay, but adding more metadata objects appears to be
a waste of time on all sides.
\stopnarrower

This leaves us with a question: should we replace the old module by the
experimental one? It makes sense to do this as in practice users will not be
harmed much. Fonts are cached and loading a cached font is not influenced. The
new module leaves the choice to the user. He or she can decide to limit memory
usage (for cache building) by using directives:

\starttyping
\enabledirectives[fonts.otf.loader.method=table]
\enabledirectives[fonts.otf.loader.method=mixed]
\enabledirectives[fonts.otf.loader.method=sparse]

\enabledirectives[fonts.otf.loader.cleanup]
\enabledirectives[fonts.otf.loader.cleanup=1]
\enabledirectives[fonts.otf.loader.cleanup=2]
\enabledirectives[fonts.otf.loader.cleanup=3]
\stoptyping

The cleanup has three levels and each level adds a garbage collection sweep (in a
different spot). Of course three sweeps per font that is prepared for caching has
quite some impact on performance. If your computer has enough memory it makes no
sense to use any of these directives. For the record: these directives are not
available in the generic (plain \TEX) variant, at least not in the short term. As
Taco mentions, cache misses can have drastic consequences and we've ran into that
years ago already when support for \OPENTYPE\ math was added to \LUATEX: out of a
sudden and without no reason passing a font table to \TEX\ became twice as slow
on my machine. This is comparable with the new, reorganized table loader being
slower than the old one. Eventually I'll get back that time, which is unlikely to
happen with the unserdata variant where there is no way to bring down the number
of function calls and intermediate table creation.

The previously shown values that concern all fonts including creating, caching,
reloading, creating a scaled instance and passing the data to \TEX. In that
process quite some garbage collection can happen and that obscures the real
values. However, in \MKIV\ we report the conversion time when a font gets cached
so that the user at least sees something happening. These timings are on a per
font base. Watch the following values:

\starttabulate[|l|l|l|]
\FL
\NC             \NC \bf table                     \NC \bf sparse                    \NC \NR
\TL
\NC \bf song    \NC 3.2                           \NC 3.6                           \NC \NR
\NC \bf cambria \NC 4.9 (0.9 1.0 0.9 1.1 0.5 0.5) \NC 5.6 (1.1 1.1 1.0 1.2 0.6 0.6) \NC \NR
\NC \bf husayni \NC 1.2                           \NC 1.3                           \NC \NR
\LL
\stoptabulate

In the case of Cambria several fonts are loaded including subfonts from
\TRUETYPE\ containers. This shows that the table variant is definitely faster. It
might be that later this is compensated by additional garbage collection but that
would even worsen the sparse case were more extensive userdata be used. These
values more reflect what Taco measured in the profiler. Improvements to the
garbage collector are more likely to happen than a drastic speed up in function
calls so the table variant is still a safe bet.

There are a few places where the renewed code can be optimized so these numbers
are not definitive. Also, the loader code was not the only code adapted. As we
cannot manipulate the main table in the userdata variant, the code related to
patches and extra features like \type {tlig}, \type {trep} and \type {anum} had
to be rewritten as well: more code and a bit more close to the final table
format.

\starttabulate[|l|c|c|]
\FL
\NC            \NC \bf table         \NC \bf sparse        \NC \NR
\TL
\NC \bf hybrid \NC 310 MB / 10.3 sec \NC 285 MB / 10.5 sec \NC \NR
\NC \bf mk     \NC 884 MB / 47.5 sec \NC 878 MB / 48.7 sec \NC \NR
\LL
\stoptabulate

The timings in the previous table concern runs of a few documents where the \type
{mk} loads quite some large and complex fonts. The runs are times with an empty
cache so all fonts are preprocessed. The memory consumption is the peak load as
reported by the task manager and we need to keep in mind that \LUA\ allocates
more than it needs. Keep in mind that these values are so high because fonts are
created. A regular run takes less memory. Interesting is that for \type {mk} the
original implementation performs better but the difference is about a second
which again indicates that the garbage collector is a major factor. Timing only
the total runtime gives:

\starttabulate[|l|c|c|c|c|]
\FL
\NC        \NC \bf cached \NC \bf original \NC \bf table \NC \bf sparse \NC \NR
\TL
\NC \bf mk \NC 38.1 sec   \NC 75.5 sec     \NC 77.2 sec  \NC 80.8 sec   \NC \NR
\LL
\stoptabulate

Here we used the system timer while in previous tables we used the values as
reported by the timers built in \MKIV\ (and only reported the font loading
times).

The timings above are taken on my laptop running Windows 7 and this is not that
good a platform for precise timings. Tacos measurements were done with
specialized tools and should be trusted more. It looks indeed that the current
level of userdata support is about the best compromise one can get.

{\em In the process I also experimented with virtualizing the final \TFM\ table,
thereby simulating the upcoming virtualization of that table in \LUATEX.
Interesting is that for (for instance) \type {mk.pdf} memory consumption went
down with 20\% but that document is non|-|typical and loades many fonts,
including vitual punk fonts. However, as access to that tables happens
infrequently virtualization makes muich sense there, again only at the toplevel
of the characters subtable.}

\stopsection

\startsection [title={Hyperlinks}]

At \PRAGMA\ we have a long tradition of creating highly interactive documents. I
still remember the days that processing a 20.000 page document with numerous
menus and buttons on each page took a while to get finished, especially if each
page has a \METAPOST\ graphic as well.

On a regular computer a document with so many links is no real problem. After
all, the \PDF\ format is designed in such a way that only the partial content has
to be loaded. However, half a million hyperlinks do demand some memory.

Recently I had to make a document that targets at one of these tablets and it is
no secret that tablets (and e-readers) don't have that much memory. As in
\CONTEXT\ \MKIV\ we have a bit more control over the backend, it will be no
surprise that we are able to deal with such issues more comfortable than in
\MKII.

That specific document (part of a series) contained 1100 pages and each page has
a navigation menu as well as an alphabetic index into the register. There is a
table of contents refering to about 200 chapters and these are backlinked to the
table of contents. There are some also 200 images and tables that end up
elsewhere and again are crosslinked. Of course there is the usual bunch of inline
hyperlinks. So, in total this document has some 32.000 hyperlinks. The input is a
3.03 MB \XML\ file.

\starttabulate[|l|c|c|]
\FL
\NC                                                 \NC \bf size \NC \bf one run \NC \NR
\TL
\NC \bf don't optimize                              \NC 5.76 MB  \NC 59.4 sec    \NC \NR
\NC \bf prefer page references over named ones      \NC 5.66 MB  \NC 56.2 sec    \NC \NR
\NC \bf agressively share similar references        \NC 5.19 MB  \NC 60.2 sec    \NC \NR
\NC \bf optimize page as well as similar references \NC 5.11 MB  \NC 56.5 sec    \NC \NR
\NC \bf disable all interactive features            \NC 4.19 MB  \NC 42.7 sec    \NC \NR
\LL
\stoptabulate

So, by aggressively sharing hyperlinks and turning all internal named
destinations into page destinations we bring down the size noticeably and even
have a faster run. It is for this reason that aggressive sharing is enabled by
default. I you don't want it, you can disable it with:

\starttyping
\disabledirectives[refences.sharelinks]
\stoptyping

Currently we use names for internal (automatically generated) links. We can force
page links for them but still use names for explicit references so that we can
reach them from external documents; this is called mixed mode. When no references
from outside are needed, you can force pagelinks. At some point mixed mode can
become the default.

\starttyping
\enabledirectives[references.linkmethod=page]
\stoptyping

With values: \type {page}, \type {mixed}, \type {names} and \type {yes} being
equivalent to \type {page}. The \MKII\ way of setting this is still supported:

\starttyping
\setupinteraction[page=yes]
\stoptyping

We could probably gain quite some more bytes by turning all repetitive elements
into shared graphical objects but it only makes sense to spend time on that when
a project really needs it (and pays for it). There is upto one megabyte of
(compressed) data related to menus and other screen real estate that qualifies
for this but it might not be worth the trouble.

The reason for trying to minimize the amount of hyperlink related metadata (in
\PDF\ terminology annotations) is that on tablets with not that much memory (and
no virtual memory) we don't want to keep too much of that (redundant) data in
memory. And indeed, the optimized document feels more responsive than the dirty
version, but that could as well be related to the viewing applications.

\stopsection

\startsection[title=Constants]

Not every optimization saves memory of runtime. They are more optimizations due
to changes in circumstances. When \TEX\ had only 256 registers one had to find
ways to get round this. For instance counters are quite handy and you could
quickly run out of them. In \CONTEXT\ there are two ways to deal with this.
Instead of a real count register you can use a macro:

\starttyping
\newcounter \somecounter
\increment  \somecounter
\decrement (\somecounter,4)
\stoptyping

In \MKIV\ many such pseudo counters have been replaced by real ones which is
somewhat faster in usage.

Often one needs a constant and a convenient way to define such a frozen counter
is:

\starttyping
\chardef \myconstant 10
\ifnum \myvariable = \myconstant ....
\ifcase \myconstant ...
\stoptyping

This is both efficient and fast and works out well because \TEX\ treats them as
numbers in comparisons. However, it is somewhat clumsy, as constants have nothing
to do with characters. This is why all such definitions have been replaced by:

\starttyping
\newconstant \myconstant 10
\setconstant \myconstant 12
\ifnum \myvariable = \myconstant ....
\ifcase \myconstant ...
\stoptyping

We use count registers which means that when you set a constant, you can just
assign the new value directly or use the \type {\setcounter} macro.

We already had an alternative for conditionals:

\starttyping
\newconditional \mycondition
\settrue \mycondition
\setfalse \mycondition
\ifconditional \mycondition
\stoptyping

These will also be adapted to counts but first we need a new primitive.

The advantage of these changes is that at the \LUA\ end we can consult as well as
change these values. This means that in the end much more code will be adapted.
Especially changing the constants resulted in quite some cosmetic changes in the
core code.

\stopsection

\startsection[title=Definitions]

Another recent optimization was possible when at the \LUA end settings lccodes
cum suis and some math definitions became possible. As all these initializations
take place at the \LUA\ end till then we were just writing \TEX\ code back to
\TEX, but now we stay at the \LUA end. This not only looks nicer, but also
results in a slightly less memory usage during format generation (a few percent).
Making a format also takes a few tenths of a second less (again a few percent).
The reason why less memory is needed is that instead of writing tens of thousands
\type {\lccode} related commands to \TEX\ we now set the value directly. As
writes to \TEX\ are collected, quite an amount of tokens get cached.

All such small improvements makes that \CONTEXT\ \MKIV\ runs smoother with each
advance of \LUATEX. We do have a wishlist for further improvements but so far we
managed to improve stepwise instead of putting too much pressure on \LUATEX\
development.

\stopsection

\stopchapter

\stopcomponent