hybrid-optimize.tex /size: 23 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent hybrid-optimize
4
5\environment hybrid-environment
6
7\startchapter[title={Optimizations again}]
8
9\startsection [title={Introduction}]
10
11Occasionally we do some timing on new functionality in either
12\LUATEX\ or \MKIV, so here's another wrapup.
13
14\stopsection
15
16\startsection [title={Font loading}]
17
18In \CONTEXT\ we cache font data in a certain way. Loading a font from the cache
19takes hardly any time. However, preparation takes more time as well memory as we
20need to go from the fontforge ordering to one we can use. In \MKIV\ we have
21several font tables:
22
23\startitemize[packed]
24\startitem
25    The original fontforge table: this one is only loaded once and converted to
26    another representation that is cached.
27\stopitem
28\startitem
29    The cached font representation that is the basis for further manipulations.
30\stopitem
31\startitem
32    In base mode this table is converted to a (optionally cached) scaled \TFM\
33    table that is passed to \TEX.
34\stopitem
35\startitem
36    In node mode a limited scaled version is passed to \TEX. As with base mode,
37    this table is kept in memory so that we can access the data.
38\stopitem
39\startitem
40    When processing features in node mode additional (shared) subtables are
41    created that extend the memorized catched table.
42\stopitem
43\stopitemize
44
45This model is already quite old and dates from the beginning of \MKIV. Future
46versions might use different derived tables but for the moment we need all this
47data if only because it helps us with the development.
48
49The regular method to construct a font suitable for \TEX, either or not using
50base mode or node mode in \MKIV, is to load the font as table using \type
51{to_table}, a \type {fontloader} method. This means that all information is
52available (and can be manipulated). In \MKIV\ this table is converted to another
53one and in the process new entries are added and existing ones are freed. Quite
54some garbage collection and table resizing takes place in the process. In the
55cached instance we share identical tables so there we can gain a lot of memory
56and avoid garbage collection.
57
58The difference in usage is as follows:
59
60\starttyping
61do
62  local f = fontloader.open("somefont.otf") -- allocates font object
63  local t = fontloader.to_table(f)          -- allocates table
64  fontloader.close(f)                       -- frees font object
65  for index, glyph in pairs(t) do
66    local width = glyph.width               -- accesses table value
67  end
68end                                         -- frees table
69\stoptyping
70
71Here \type {t} is a complete \LUA\ table and it can get quite large: script fonts
72like Zapfino (for latin) or Husayni (for arabic) have lots of alternate shapes
73and much features related information, fonts meant for \CJK\ usage have tens of
74thousands of glyphs, and math fonts like Cambria have many glyphs and math
75specific information.
76
77\starttyping
78do
79  local f = fontloader.open("somefont.otf") -- allocates font object
80  for index=0, t.glyphmax-1 do
81    local glyph = f.glyphs[index]           -- assigns user data object
82    if glyph then
83      local width = glyph.width             -- calls virtual table value
84    end
85  end
86  fontloader.close(f)                       -- frees font object
87end
88\stoptyping
89
90In this case there is no big table, and \type {glyph} is a so called userdata
91object. Its entries are created when asked for. So, where in the first example
92the \type {width} of a glyph is a number, in the second case it is a function
93disguised as virtual key that will return a number. In the first case you can
94change the width, in the second case you can't.
95
96This means that if you want to keep the data around you need to copy it into
97another table but you can do that stepwise and selectively. Alternatively you can
98keep the font object in memory. As some glyphs can have much data you can imagine
99that when you only need to access the width, the userdata method is more
100efficient. On the other hand, if you need access to all information, the first
101method is more interesting as less overhead is involved.
102
103In the userdata variant only the parent table and its glyph subtable are
104virtualized, as are entries in an optional subfonts table. So, if you ask for the
105kerns table of a glyph you will get a real table as it makes no sense to
106virtualize it. A way in between would have been to request tabls per glyph but as
107we will see there is no real benefit in that while it would further complicate
108the code.
109
110When in \LUATEX\ 0.63 the loaded font object became partially virtual it was time
111to revision the loading code to see if we could benefit from this.
112
113In the following tables we distinguish three cases: the original but adapted
114loading code \footnote {For practical reasons we share as much odd as possible
115between the methods so some reorganization was needed.}, already a few years old,
116the new sparse loading code, using the userdata approach and no longer a raw
117table, and a mixed approach where we still use the raw table but instead of
118manipulating that one, construct a new one from it. It must be noticed that in
119the process of integrating the new method the traditional method suffered.
120
121First we tested Oriental \TEX's Husayni font. This one has lots of features, many
122of lookups, and quite some glyphs. Keep in mind that the times concern the
123preparation and not the reload from the cache, which is more of less neglectable.
124The memory consumption is a snapshot of the current run just after the font has
125been loaded. Peak memory is what bothers most users. Later we will explain what
126the values between parenthesis refer to.
127
128\starttabulate[|l|c|c|c|]
129\FL
130\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
131\TL
132\NC \bf table    \NC 113 MB (102)    \NC 118 MB (117)    \NC 1.8 sec (1.9)         \NC \NR
133\NC \bf mixed    \NC 114 MB (103)    \NC 119 MB (117)    \NC 1.9 sec (1.9)         \NC \NR
134\NC \bf sparse   \NC 117 MB (104)    \NC 121 MB (120)    \NC 1.9 sec (2.0)         \NC \NR
135\NC \bf cached   \NC ~75 MB          \NC ~80 MB          \NC 0.4 sec               \NC \NR
136\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC 0.3 sec               \NC \NR
137\LL
138\stoptabulate
139
140So, here the new method is not offering any advantages. As this is a font we use
141quite a lot during development, any loading variant will do the job with similar
142efficiency.
143
144Next comes Cambria, a font that carries lots of glyphs and has extensive support
145for math. In order to provide a complete bodyfont setup some six instances are
146loaded. Interesting is that the original module needs 3.9 seconds instead if 6.4
147which is probably due to a different ordering of code which might influence the
148garbage collector and it looks like in the reorganized code the garbage collector
149kicks in a few times during the font loading. Already long ago we found out that
150this is also somewhat platform dependent.
151
152\starttabulate[|l|c|c|c|]
153\FL
154\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
155\TL
156\NC \bf table    \NC 155 MB (126)    \NC 210 MB (160)    \NC 6.4 sec (6.8)         \NC \NR
157\NC \bf mixed    \NC 154 MB (130)    \NC 210 MB (160)    \NC 6.3 sec (6.7)         \NC \NR
158\NC \bf sparse   \NC 140 MB (123)    \NC 199 MB (144)    \NC 6.4 sec (6.8)         \NC \NR
159\NC \bf cached   \NC ~90 MB          \NC ~94 MB          \NC 0.6 sec               \NC \NR
160\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC 0.3 sec               \NC \NR
161\LL
162\stoptabulate
163
164Here the sparse method reports less memory usage. There is no other gain as there
165is a lot of access to glyph data due to the fact that this font is rather
166advanced. More virtualization would probably work against us here.
167
168Being a \CJK\ font, the somewhat feature|-|dumb but large AdobeSongStd-Light has
169lots of glyphs. In previous tables we already saw values between parenthesis:
170these are values measured with implicit calls to the garbage collector before
171writing the font to the cache. For this font much more memory is used but garbage
172collection has a positive impact on memory consumption but drastic consequences
173for runtime. Eventually it's the cached timing that matters and that is a
174constant factor but even then it can disturb users if a first run after an update
175takes so much time.
176
177\starttabulate[|l|c|c|c|]
178\FL
179\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
180\TL
181\NC \bf table    \NC 180 MB (125)    \NC 185 MB (172)    \NC 4.4 sec (4.5)         \NC \NR
182\NC \bf mixed    \NC 190 MB (144)    \NC 194 MB (181)    \NC 4.4 sec (4.7)         \NC \NR
183\NC \bf sparse   \NC 153 MB (119)    \NC 232 MB (232)    \NC 8.7 sec (8.9)         \NC \NR
184\NC \bf cached   \NC ~96 MB          \NC 100 MB          \NC 0.7 sec               \NC \NR
185\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC 0.3 sec               \NC \NR
186\LL
187\stoptabulate
188
189Peak memory is quite high for the sparse method which is due to the fact that we
190have only glyphs (but many) so we have lots of access and small tables being
191created and collected. I suspect that in a regular run the loading time is much
192lower for the sparse case because this is just too much of a difference.
193
194The last test loaded 40 variants of Latin Modern. Each font has reasonable number
195of glyphs (covering the latin script takes some 400--600 glyphs), the normal
196amount of kerning, but hardly any features. Reloading these 40 fonts takes about
197a second.
198
199\starttabulate[|l|c|c|c|]
200\FL
201\NC              \NC \bf used memory \NC \bf peak memory \NC \bf font loading time \NC \NR
202\TL
203\NC \bf table    \NC 204 MB (175)    \NC 213 MB (181)    \NC 13.1 sec (16.4)       \NC \NR
204\NC \bf mixed    \NC 195 MB (168)    \NC 205 MB (174)    \NC 13.4 sec (16.5)       \NC \NR
205\NC \bf sparse   \NC 198 MB (165)    \NC 202 MB (170)    \NC 13.4 sec (16.6)       \NC \NR
206\NC \bf cached   \NC 147 MB          \NC 151 MB          \NC ~1.7 sec              \NC \NR
207\NC \bf baseline \NC ~67 MB          \NC ~71 MB          \NC ~0.3 sec              \NC \NR
208\LL
209\stoptabulate
210
211The old method wins in runtime and this makes it hard to decide which strategy to
212follow. Again the numbers between parenthesis show what happens when we do an
213extra garbage collection sweep after packaging the font instance. A few more
214sweeps in other spots will bring down memory a few megabytes but at the cost of
215quite some runtime. The original module that uses the table approach is 3~seconds
216faster that the current one. As the code is essentially the same but organized
217differently again we suspect the garbage collector to be the culprit.
218
219So when we came this far, Taco and I did some further tests and on his machine
220Taco ran a profiler on some of the tests. He posted the following conclusion to
221the \LUATEX\ mailing list:
222
223\startnarrower
224It seems that the userdata access is useful if {\em but only if} you are very low
225on memory. In other cases, it just adds extra objects to be garbage collected,
226which makes the collector slower. That is on top of extra time spent on the
227actual calls, and even worse: those extra gc objects tend to be scattered around
228in memory, resulting in extra minor page faults (cpu cache misses) and all that
229has a noticeable effect on run speed: the metatable based access is 20--30\%
230slower than the old massive \type {to_table}.
231
232Therefore, there seems little point in expanding the metadata functionality any
233further. What is there will stay, but adding more metadata objects appears to be
234a waste of time on all sides.
235\stopnarrower
236
237This leaves us with a question: should we replace the old module by the
238experimental one? It makes sense to do this as in practice users will not be
239harmed much. Fonts are cached and loading a cached font is not influenced. The
240new module leaves the choice to the user. He or she can decide to limit memory
241usage (for cache building) by using directives:
242
243\starttyping
244\enabledirectives[fonts.otf.loader.method=table]
245\enabledirectives[fonts.otf.loader.method=mixed]
246\enabledirectives[fonts.otf.loader.method=sparse]
247
248\enabledirectives[fonts.otf.loader.cleanup]
249\enabledirectives[fonts.otf.loader.cleanup=1]
250\enabledirectives[fonts.otf.loader.cleanup=2]
251\enabledirectives[fonts.otf.loader.cleanup=3]
252\stoptyping
253
254The cleanup has three levels and each level adds a garbage collection sweep (in a
255different spot). Of course three sweeps per font that is prepared for caching has
256quite some impact on performance. If your computer has enough memory it makes no
257sense to use any of these directives. For the record: these directives are not
258available in the generic (plain \TEX) variant, at least not in the short term. As
259Taco mentions, cache misses can have drastic consequences and we've ran into that
260years ago already when support for \OPENTYPE\ math was added to \LUATEX: out of a
261sudden and without no reason passing a font table to \TEX\ became twice as slow
262on my machine. This is comparable with the new, reorganized table loader being
263slower than the old one. Eventually I'll get back that time, which is unlikely to
264happen with the unserdata variant where there is no way to bring down the number
265of function calls and intermediate table creation.
266
267The previously shown values that concern all fonts including creating, caching,
268reloading, creating a scaled instance and passing the data to \TEX. In that
269process quite some garbage collection can happen and that obscures the real
270values. However, in \MKIV\ we report the conversion time when a font gets cached
271so that the user at least sees something happening. These timings are on a per
272font base. Watch the following values:
273
274\starttabulate[|l|l|l|]
275\FL
276\NC             \NC \bf table                     \NC \bf sparse                    \NC \NR
277\TL
278\NC \bf song    \NC 3.2                           \NC 3.6                           \NC \NR
279\NC \bf cambria \NC 4.9 (0.9 1.0 0.9 1.1 0.5 0.5) \NC 5.6 (1.1 1.1 1.0 1.2 0.6 0.6) \NC \NR
280\NC \bf husayni \NC 1.2                           \NC 1.3                           \NC \NR
281\LL
282\stoptabulate
283
284In the case of Cambria several fonts are loaded including subfonts from
285\TRUETYPE\ containers. This shows that the table variant is definitely faster. It
286might be that later this is compensated by additional garbage collection but that
287would even worsen the sparse case were more extensive userdata be used. These
288values more reflect what Taco measured in the profiler. Improvements to the
289garbage collector are more likely to happen than a drastic speed up in function
290calls so the table variant is still a safe bet.
291
292There are a few places where the renewed code can be optimized so these numbers
293are not definitive. Also, the loader code was not the only code adapted. As we
294cannot manipulate the main table in the userdata variant, the code related to
295patches and extra features like \type {tlig}, \type {trep} and \type {anum} had
296to be rewritten as well: more code and a bit more close to the final table
297format.
298
299\starttabulate[|l|c|c|]
300\FL
301\NC            \NC \bf table         \NC \bf sparse        \NC \NR
302\TL
303\NC \bf hybrid \NC 310 MB / 10.3 sec \NC 285 MB / 10.5 sec \NC \NR
304\NC \bf mk     \NC 884 MB / 47.5 sec \NC 878 MB / 48.7 sec \NC \NR
305\LL
306\stoptabulate
307
308The timings in the previous table concern runs of a few documents where the \type
309{mk} loads quite some large and complex fonts. The runs are times with an empty
310cache so all fonts are preprocessed. The memory consumption is the peak load as
311reported by the task manager and we need to keep in mind that \LUA\ allocates
312more than it needs. Keep in mind that these values are so high because fonts are
313created. A regular run takes less memory. Interesting is that for \type {mk} the
314original implementation performs better but the difference is about a second
315which again indicates that the garbage collector is a major factor. Timing only
316the total runtime gives:
317
318\starttabulate[|l|c|c|c|c|]
319\FL
320\NC        \NC \bf cached \NC \bf original \NC \bf table \NC \bf sparse \NC \NR
321\TL
322\NC \bf mk \NC 38.1 sec   \NC 75.5 sec     \NC 77.2 sec  \NC 80.8 sec   \NC \NR
323\LL
324\stoptabulate
325
326Here we used the system timer while in previous tables we used the values as
327reported by the timers built in \MKIV\ (and only reported the font loading
328times).
329
330The timings above are taken on my laptop running Windows 7 and this is not that
331good a platform for precise timings. Tacos measurements were done with
332specialized tools and should be trusted more. It looks indeed that the current
333level of userdata support is about the best compromise one can get.
334
335{\em In the process I also experimented with virtualizing the final \TFM\ table,
336thereby simulating the upcoming virtualization of that table in \LUATEX.
337Interesting is that for (for instance) \type {mk.pdf} memory consumption went
338down with 20\% but that document is non|-|typical and loades many fonts,
339including vitual punk fonts. However, as access to that tables happens
340infrequently virtualization makes muich sense there, again only at the toplevel
341of the characters subtable.}
342
343\stopsection
344
345\startsection [title={Hyperlinks}]
346
347At \PRAGMA\ we have a long tradition of creating highly interactive documents. I
348still remember the days that processing a 20.000 page document with numerous
349menus and buttons on each page took a while to get finished, especially if each
350page has a \METAPOST\ graphic as well.
351
352On a regular computer a document with so many links is no real problem. After
353all, the \PDF\ format is designed in such a way that only the partial content has
354to be loaded. However, half a million hyperlinks do demand some memory.
355
356Recently I had to make a document that targets at one of these tablets and it is
357no secret that tablets (and e-readers) don't have that much memory. As in
358\CONTEXT\ \MKIV\ we have a bit more control over the backend, it will be no
359surprise that we are able to deal with such issues more comfortable than in
360\MKII.
361
362That specific document (part of a series) contained 1100 pages and each page has
363a navigation menu as well as an alphabetic index into the register. There is a
364table of contents refering to about 200 chapters and these are backlinked to the
365table of contents. There are some also 200 images and tables that end up
366elsewhere and again are crosslinked. Of course there is the usual bunch of inline
367hyperlinks. So, in total this document has some 32.000 hyperlinks. The input is a
3683.03 MB \XML\ file.
369
370\starttabulate[|l|c|c|]
371\FL
372\NC                                                 \NC \bf size \NC \bf one run \NC \NR
373\TL
374\NC \bf don't optimize                              \NC 5.76 MB  \NC 59.4 sec    \NC \NR
375\NC \bf prefer page references over named ones      \NC 5.66 MB  \NC 56.2 sec    \NC \NR
376\NC \bf agressively share similar references        \NC 5.19 MB  \NC 60.2 sec    \NC \NR
377\NC \bf optimize page as well as similar references \NC 5.11 MB  \NC 56.5 sec    \NC \NR
378\NC \bf disable all interactive features            \NC 4.19 MB  \NC 42.7 sec    \NC \NR
379\LL
380\stoptabulate
381
382So, by aggressively sharing hyperlinks and turning all internal named
383destinations into page destinations we bring down the size noticeably and even
384have a faster run. It is for this reason that aggressive sharing is enabled by
385default. I you don't want it, you can disable it with:
386
387\starttyping
388\disabledirectives[refences.sharelinks]
389\stoptyping
390
391Currently we use names for internal (automatically generated) links. We can force
392page links for them but still use names for explicit references so that we can
393reach them from external documents; this is called mixed mode. When no references
394from outside are needed, you can force pagelinks. At some point mixed mode can
395become the default.
396
397\starttyping
398\enabledirectives[references.linkmethod=page]
399\stoptyping
400
401With values: \type {page}, \type {mixed}, \type {names} and \type {yes} being
402equivalent to \type {page}. The \MKII\ way of setting this is still supported:
403
404\starttyping
405\setupinteraction[page=yes]
406\stoptyping
407
408We could probably gain quite some more bytes by turning all repetitive elements
409into shared graphical objects but it only makes sense to spend time on that when
410a project really needs it (and pays for it). There is upto one megabyte of
411(compressed) data related to menus and other screen real estate that qualifies
412for this but it might not be worth the trouble.
413
414The reason for trying to minimize the amount of hyperlink related metadata (in
415\PDF\ terminology annotations) is that on tablets with not that much memory (and
416no virtual memory) we don't want to keep too much of that (redundant) data in
417memory. And indeed, the optimized document feels more responsive than the dirty
418version, but that could as well be related to the viewing applications.
419
420\stopsection
421
422\startsection[title=Constants]
423
424Not every optimization saves memory of runtime. They are more optimizations due
425to changes in circumstances. When \TEX\ had only 256 registers one had to find
426ways to get round this. For instance counters are quite handy and you could
427quickly run out of them. In \CONTEXT\ there are two ways to deal with this.
428Instead of a real count register you can use a macro:
429
430\starttyping
431\newcounter \somecounter
432\increment  \somecounter
433\decrement (\somecounter,4)
434\stoptyping
435
436In \MKIV\ many such pseudo counters have been replaced by real ones which is
437somewhat faster in usage.
438
439Often one needs a constant and a convenient way to define such a frozen counter
440is:
441
442\starttyping
443\chardef \myconstant 10
444\ifnum \myvariable = \myconstant ....
445\ifcase \myconstant ...
446\stoptyping
447
448This is both efficient and fast and works out well because \TEX\ treats them as
449numbers in comparisons. However, it is somewhat clumsy, as constants have nothing
450to do with characters. This is why all such definitions have been replaced by:
451
452\starttyping
453\newconstant \myconstant 10
454\setconstant \myconstant 12
455\ifnum \myvariable = \myconstant ....
456\ifcase \myconstant ...
457\stoptyping
458
459We use count registers which means that when you set a constant, you can just
460assign the new value directly or use the \type {\setcounter} macro.
461
462We already had an alternative for conditionals:
463
464\starttyping
465\newconditional \mycondition
466\settrue \mycondition
467\setfalse \mycondition
468\ifconditional \mycondition
469\stoptyping
470
471These will also be adapted to counts but first we need a new primitive.
472
473The advantage of these changes is that at the \LUA\ end we can consult as well as
474change these values. This means that in the end much more code will be adapted.
475Especially changing the constants resulted in quite some cosmetic changes in the
476core code.
477
478\stopsection
479
480\startsection[title=Definitions]
481
482Another recent optimization was possible when at the \LUA end settings lccodes
483cum suis and some math definitions became possible. As all these initializations
484take place at the \LUA\ end till then we were just writing \TEX\ code back to
485\TEX, but now we stay at the \LUA end. This not only looks nicer, but also
486results in a slightly less memory usage during format generation (a few percent).
487Making a format also takes a few tenths of a second less (again a few percent).
488The reason why less memory is needed is that instead of writing tens of thousands
489\type {\lccode} related commands to \TEX\ we now set the value directly. As
490writes to \TEX\ are collected, quite an amount of tokens get cached.
491
492All such small improvements makes that \CONTEXT\ \MKIV\ runs smoother with each
493advance of \LUATEX. We do have a wishlist for further improvements but so far we
494managed to improve stepwise instead of putting too much pressure on \LUATEX\
495development.
496
497\stopsection
498
499\stopchapter
500
501\stopcomponent
502