mk-structure.tex /size: 19 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\usemodule[narrowtt]
4
5\environment mk-environment
6
7\startcomponent mk-structure
8
9\chapter{Everything structure}
10
11At the time of this writing, \CONTEXT\ \MKIV\ spends some 50\% of
12its time in \LUA. There are several reasons for this.
13
14\startitemize[packed]
15\item All \IO\ goes via \LUA, including messages and logging. This includes
16      file searching which happened to be done by the \KPSE\ library.
17\item Much font handling is done by \LUA\ too, for instance \OPENTYPE\ features
18      are completely handled by \LUA.
19\item Because \TEX\ is highy optimized, its influence on runtime is less
20      prominent. Even if we delegate some tasks to \LUA, \TEX\ still has
21      work to do.
22\stopitemize
23
24Among the reported statistics of a 242 page version of \type
25{mk.pdf} (not containing this chapter) we find the following:
26
27\startntyping
28input load time           - 0.094 seconds
29startup time              - 0.905 seconds (including runtime option file processing)
30jobdata time              - 0.140 seconds saving, 0.062 seconds loading
31fonts load time           - 5.413 seconds
32xml load time             - 0.000 seconds, lpath calls: 46, cached calls: 31
33lxml load time            - 0.000 seconds preparation, backreferences: 0
34mps conversion time       - 0.000 seconds
35node processing time      - 1.747 seconds including kernel
36kernel processing time    - 0.343 seconds
37attribute processing time - 2.075 seconds
38language load time        - 0.109 seconds, n=4
39graphics processing time  - 0.109 seconds including tex, n=7
40metapost processing time  - 0.484 seconds, loading: 0.016 seconds, execution: 0.203 seconds, n: 65
41current memory usage      - 332 MB
42loaded patterns           - gb:gb:pat:exc:3 nl:nl:pat:exc:4 us:us:pat:exc:2
43control sequences         - 34245 of 165536
44callbacks                 - direct: 235579, indirect: 18665, total: 254244 (1050 per page)
45runtime                   - 25.818 seconds, 242 processed pages, 242 shipped pages, 9.373 pages/second
46\stopntyping
47
48The startup time includes initial font loading (we don't store fonts
49in the format). Jobdata time involves loading and saving multipass data
50used for tables of contents, references, positioning, etc. The time needed
51for loading fonts is over 5 seconds due to the fact that we load a couple of
52real large and complex fonts. Node processing time mostly is related to
53\OPENTYPE\ feature support. The kernel processing time refers to hyphenation
54and line breaking, for which (of course) we use \TEX. Direct callbacks are
55implicit calls to \LUA, using \type {\directlua} while the indirect calls
56concern overloaded \TEX\ functions and callbacks triggered by \TEX\ itself.
57
58Depending on the system load on my laptop, the throughput is around
5910 pages per second for this document, which is due to the fact
60that some font trickery takes place using a few arabic fonts, some
61chinese, a bunch of metapost punk instances, Zapfino, etc.
62
63The times reported are accumulated times and contain quite some
64accumulated rounding errors so assuming that the operating system
65rounds up the times, the totals in practice might be higher. So,
66looking at the numbers, you might wonder if the load on \LUA\ will
67become even larger. This is not necessary. Some tasks can be done
68better in \LUA\ but not always with less code, especially when we
69want to extend functionality and to provide more robust solutions.
70Also, even if we win some processing time we might as well waste
71it in interfacing between \TEX\ and \LUA. For instance, we can
72delegate pretty printing to \LUA, but most documents don't contain
73verbatim at all. We can handle section management by \LUA, but how
74many section headers does a document have?
75
76When the future of \TEX\ is discussed, among the ideas presented
77is to let \TEX\ stick to typesetting and implement it as a
78component (or library) on top of a (maybe dedicated) language.
79This might sound like a nice idea, but eventually we will end up
80with some kind of user interface and a substantial amount of code
81dedicated to dealing with fonts, structure, character management,
82math etc.
83
84In the process of converting \CONTEXT\ to \MKIV\ we try to use
85each language (\TEX, \LUA, \METAPOST) for what it is best suited
86for. Instead of starting from scratch, we start with existing code
87and functionality, because we need a running system. Eventually we
88might find \TEX's role as language being reduced to (or maybe we can
89better talk of \quote {focused on}) mostly aspects of
90typesetting, but \CONTEXT\ as a whole will not be much different
91from the perspective of the user.
92
93So, this is how the transition of \CONTEXT\ takes place:
94
95\startitemize[packed]
96\item We started with replacing isolated bits and pieces of code
97      where \LUA\ is a more natural candidate, like file \IO, encoding
98      issues.
99\item We implement new functionality, for instance \OPENTYPE\
100      and \TYPEONE\ support.
101\item We reimplement mechanisms that are not efficient as we want them
102      to be, like buffers and verbatim.
103\item We add new features, for instance tree based \XML\ processing.
104\item After evaluating we reimplement again when needed (or when \LUATEX\
105      evolves).
106\stopitemize
107
108Yet another transition is the one we will discuss next:
109
110\startitemize[packed]
111\item We replace complex mechanisms by new ones where we separate
112      management and typesetting.
113\stopitemize
114
115This not so trivial effort because it affects many aspects of \CONTEXT\ and
116as such we need to adapt a lot of code at the same time: all things
117related to structure:
118
119\startitemize[packed]
120\item sectioning (chapters, sections, etc)
121\item numbering (pages, itemize, enumeration, floats, etc)
122\item marks (used for headers and footers)
123\item lists (tables of contents, lists of floats, sorted lists)
124\item registers (including collapsing of page ranges)
125\item cross referencing (to text as well as pages)
126\item notes (footnotes, endnotes, etc)
127\stopitemize
128
129All these mechanisms are somehow related. A section head can occur
130in a list, can be cross referenced, might be shows in a header and
131of course can have a number. Such a number can have multiple
132components (1.A.3) where each component can have its own
133conversion, rendering (fonts, colors) and selectively have less
134components. In tables of contents either or not we want to see all
135components, separators etc. Such a table can be generated at each
136level, which demands filtering mechanisms. The same is true for
137registers. There we have page numbers too, and these may be
138prefixed by section numbers, possibly rendered differently than
139the original section number.
140
141Much if this is possible in \CONTEXT\ \MKII, but the code that
142deals with this is not always nice and clean and right from the start
143of the \LUATEX\ project it has been on the agenda to clean it up. The code
144evolved over time and
145functionality was added when needed. But, the projects
146that we deal with demand more (often local) control over the
147components of a number.
148
149What makes structure related data complex is that we need to keep
150track of each aspect in order to be able to reproduce the
151rendering in for instance a table of contents, where we also may
152want to change some of the aspects (for instance separators in a
153different color). Another pending issue is \XML\ and although we
154could normally deal with this quite well, it started making sense
155to make all multi|-|pass data (registers, tables of content,
156sorted lists, references, etc.) more \XML\ aware. This is a
157somewhat hairy task, if only because we need to switch between
158\TEX\ mode and \XML\ mode when needed and at the same time keep an
159eye on unwanted expansion: do we keep structure in the content or
160not?
161
162Rewriting the code that deals with these aspects of typesetting is
163the first step in a separation of code in \MKII\ and \MKIV. Until
164now we tried to share much code, but this no longer makes sense.
165Also, at the \CONTEXT\ conference in Bohinj (2008) it was decided
166that given the development of \MKIV, it made sense to freeze
167\MKII\ (apart from bug fixes and minor extensions). This decision
168opens the road to more drastic changes. We will roll back some of
169the splits in code that made sharing code possible and just
170replace whole components of \CONTEXT\ as a whole. This also gives
171us the opportunity to review code more drastically than until now
172in the perspective of \ETEX.
173
174Because this stage in the rewrite of \CONTEXT\ might bring some
175compatibility issues with it (especially for users who use the
176more obscure tuning options), I will discuss some of the changes
177here. A bit of understanding might make users more tolerant.
178
179The core data structure that we need to deal with is a number, which
180can be constructed in several ways.
181
182\def\NotaBeneR{\inframed[frame=off,background=color,backgroundcolor=mktransparentred]}
183\def\NotaBeneG{\inframed[frame=off,background=color,backgroundcolor=mktransparentgreen]}
184\def\NotaBeneB{\inframed[frame=off,background=color,backgroundcolor=mktransparentblue]}
185\def\NotaBeneY{\inframed[frame=off,background=color,backgroundcolor=mktransparentyellow]}
186\def\NotaBeneS{\inframed[frame=off,background=color,backgroundcolor=mktransparentgray]}
187
188\starttabulate[|l|l|]
189\NC sectioning   \NC \NotaBeneR{1.A.2.II} some title \NC \NR
190\NC pagenumber   \NC page \NotaBeneR{1.A}\NotaBeneG{--}\NotaBeneB{23} \NC \NR
191\NC reference    \NC in chapter \NotaBeneR{2.II} \NC \NR
192\NC marking      \NC \NotaBeneR{A}: some title with preceding number \NC \NR
193\NC contents     \NC \NotaBeneR{2.II} some title with some page number \NotaBeneR{1.A}\NotaBeneG{--}\NotaBeneB{23} \NC \NR
194\NC index        \NC some word \NotaBeneB{23}, \NotaBeneR{A}\NotaBeneG{--}\NotaBeneB{42}---\NotaBeneR{B}\NotaBeneG{--}\NotaBeneB{48} \NC \NR
195\NC itemize      \NC \NotaBeneY{a} first item \NotaBeneY{a.1} subitem item \NC \NR
196\NC enumerate    \NC example \NotaBeneR{1.A.2.II}\NotaBeneG{.}\NotaBeneY{a} \NC \NR
197\NC floatcaption \NC figure \NotaBeneR{1}\NotaBeneG{--}\NotaBeneB{2} \NC \NR
198\NC footnotes    \NC note \NotaBeneS{\symbol[3]} \NC \NR
199\stoptabulate
200
201In this table we see how numbers are composed:
202
203\starttabulate[|l|p|]
204\NC \NotaBeneR{section number} \NC It has several components, separated by symbols
205                                   and with an optional final symbol \NC \NR
206\NC \NotaBeneG{separator}      \NC This can be different for each level and can
207                                   have dedicated rendering options \NC \NR
208\NC \NotaBeneB{page number}    \NC That can be preceded by a (partial) sectionnumber
209                                   and separated from the page number by another symbol \NC \NR
210\NC \NotaBeneY{counter}        \NC It can be preceded by a (partial) sectionnumber and
211                                   can also have subnumbers with its own separation
212                                   properties \NC \NR
213\NC \NotaBeneS{symbol}         \NC Sometimes numbers get represented by symbols in which
214                                   case we use pagewise restarting symbol sets \NC \NR
215\stoptabulate
216
217Say that at some point we store a section number and/or page
218number. With the number we need to store information about the
219conversion (number, character, roman numeral, etc) and the
220separators, including their rendering. However, when we reuse that
221stored information we might want to discard some components and/or
222use a different rendering. In traditional \CONTEXT\ we have
223control over some aspects but due to the way numbers are stored
224for later reuse this control is limited.
225
226Say that we have cloned a subsection head as follows:
227
228\starttyping
229\definehead[MyHead][section]
230\stoptyping
231
232This is used as:
233
234\starttyping
235\MyHead[example]{Example}
236\stoptyping
237
238In \MKII\ we save a list entry (which has the number, the title
239and a reference to the page) and a reference to the the number,
240the title and the page (tagged \type {example}). Page numbers are
241stored in such a way that we can filter at specific section
242levels. This permits local tables of contents.
243
244The entry in the multi pass data file looks as follows (we collect all
245multi pass data in one file):
246
247\starttyping
248\mainreference{}{example}{2--0-1-1-0-0-0-0--1}{1}{{I.I}{Example}}%
249\listentry{MyHead}{2}{I.I}{Example}{2--0-1-1-0-0-0-0--1}{1}%
250\stoptyping
251
252In \MKIV\ we store more information and use tables for that. Currently
253the entry looks as follows:
254
255\starttyping
256structure.lists.collected={
257 {
258   ...
259 },
260 {
261  metadata={
262   catcodes=4,
263   coding="tex",
264   internal=2,
265   kind="section",
266   name="MyHead",
267   reference="example",
268  },
269  pagenumber={
270   numbers={ 1, 1, 0 },
271  },
272  sectionnumber={
273   conversion="R",
274   conversionset="default",
275   numbers={ 0, 2 },
276   separatorset="default",
277  },
278  sectiontitle={
279   label="MyHead",
280   title="Example",
281  },
282 },
283 {
284  ...
285 },
286}
287\stoptyping
288
289There can be much more information in each of the subtables. For
290instance, the \type {pagenumber} and \type {sectionnumber}
291subtables can have \type {prefix}, \type {separatorset},
292\type{conversion}, \type {conversionset}, \type {stopper}, \type
293{segments} and \type {connector} fields, and the \type {metadata}
294table can contain information about the \XML\ root document so
295that associated filtering and handling can be reconstructed. With the
296section title we store information about the preceding label text
297(seldom used, think of \quote{Part B}).
298
299This entry is used for lists as well as cross referencing.
300Actually, the stored information is also used for markings
301(running heads). This means that these mechanisms must be able to
302distinguish between where and how information is stored.
303
304These tables look rather verbose and indeed they are. We end up
305with much larger multi|-|pass data files but fortunately loading them
306is quite efficient. Serializing on the other hand might cost some time
307which is compensated by the fact that we no longer store
308information in token lists associated with nodes in \TEX's lists
309and in the future we might even move more data handling to the
310\LUA\ end. Also, in future versions we will share similar data
311(like page number information) more efficiently.
312
313Storing date at the \LUA\ end also has consequences for the
314typesetting. When specific data is needed a call to \LUA\ is
315necessary. In the future we might offer both push and pull methods
316(\LUA\ pushing information to the typesetting code versus \LUA\
317triggering typesetting code). For lists we pull, and for registers
318we currently push. Depending on our experiences we might change
319these strategies.
320
321A side effect of the rewrite is that we force more consistency.
322For instance, you see a \type {conversion} field in the list. This
323is the old way of defining the way a number gets converted. The
324modern approach is to use sets. Because we now have a more
325stringent inheritance model at the user interface level, this
326might lead to incompatible conversions at lower levels (when
327unset). Instead of cooking up some nasty compatibility hacks, we
328accept some incompatibility, if only because users have to adapt
329their styles to new font technology anyway. And for older
330documents there is still \MKII.
331
332Instead of introducing many extra configuration variables (for each
333level of sectioning) we introduce sets. These replace some of the
334existing parameters and are the follow up on some (undocumented)
335precursor of sets. Examples of sets are:
336
337\starttyping
338\definestructureseparatorset [default][][.]
339\definestructureconversionset[default][][numbers]
340\definestructureresetset     [default][][0]
341\definestructureprefixset    [default][section-2,section-3][]
342\definestructureseparatorset [appendix][][.]
343\definestructureconversionset[appendix][Romannumerals,Characters][]
344\definestructureresetset     [appendix][][0]
345\stoptyping
346
347The third parameter is the default value. The sets that relate to typesetting
348can have a rendering specification:
349
350\starttyping
351\definestructureseparatorset
352  [demosep]
353  [demo->!,demo->?,demo->*,demo->@]
354  [demo->/]
355\stoptyping
356
357Here we apply \type{demo} to each of the separators as well as to the
358default. The renderer is defined with:
359
360\starttyping
361\defineprocessor[demo][style=\bfb,color=red]
362\stoptyping
363
364You can imagine that, although this is quite possible in \TEX,
365dealing with sets, splitting them, handling the rendering, etc.\
366is easier in \LUA\ that in \TEX. Of course the code still looks
367somewhat messy, if only because the problem is messy. Part if this
368mess is related to the fact that we might have to specify all
369components that make up a number.
370
371\starttabulate
372\NC section    \NC section number as part of head  \NC \NR
373\NC list       \NC section number as part of list entry  \NC \NR
374\NC            \NC section number as part of page number prefix \NC \NR
375\NC            \NC (optionally prefixed) page number \NC \NR
376\NC counter    \NC section number as part of counter prefix  \NC \NR
377\NC            \NC (optionally prefixed) counter value(s) \NC \NR
378\NC pagenumber \NC section number as part of page number \NC \NR
379\NC            \NC pagenumber components (realpage, page, subpage) \NC \NR
380\stoptabulate
381
382As a result we have upto 3 sets of parameters:
383
384\starttabulate
385\NC section    \NC \type{section*} \NC \NR
386\NC list       \NC \type{section*} \type{prefix*} \type{page*} \NC \NR
387\NC counter    \NC \type{section*} \type{number*} \NC \NR
388\NC pagenumber \NC \type{prefix*} \type{page*} \NC \NR
389\stoptabulate
390
391When reimplementing the structure related commands, we also have
392to take mechanisms into account that relate to them. For instance,
393index sorter code is also used for sorted lists, so when we adapt
394one mechanism we also have to adapt the other. The same is true
395for cross references, that are used all over the place. It helps
396that for the moment we can omit the more obscure interaction
397related mechanism, if only because users will seldom use them.
398Such mechanisms are also related to the backend and we're not yet
399in the stage where we upgrade the backend code. In case you wonder
400why references can be such a problematic areas think of the
401following:
402
403\starttyping
404\goto{here}[page(10),StartSound{ping},StartVideo{demo}]
405\goto{there}[page(10),VideLayer{example},JS(SomeScript{hi world})]
406\goto{anywhere}[url(mypreviouslydefinedurl)]
407\stoptyping
408
409The \CONTEXT\ cross reference mechanism permits mixed usage of simple
410hyperlinks (jump to some page) and more advanced viewer actions like
411showing widgets and runnign \JAVASCRIPT\ code. And even a simple
412reference like:
413
414\starttyping
415\at{here and there}[somefile::sometarget]
416\stoptyping
417
418involves some code because we need to handle the three words as
419well as the outer reference. \footnote {Currently \CONTEXT\ does
420its own splitting of multiword references, and does so by reusing
421hyperlink resources in the backend format. This might change in
422the future.} The reason why we need to reimplement referencing
423along with structure lays in the fact that for some structure
424components (like section headers and float references) we no
425longer store cross reference information separately but filter it
426from the data stored in the list (see example before).
427
428The \LUA\ code involved in dealing with the more complex
429references shown here is much more flexible and robust than the
430original \TEX\ code. This is a typical example of where the
431accumulated time spent on the \TEX\ based solution is large
432compared to the time spent on the \LUA\ variant. It's like driving
433200 km by car through hilly terrain and wondering how one did that
434in earlier times. Just like today scenery is not by definition better
435than yestedays, \MKIV\ code is not always better than \MKII\ code.
436
437\stopcomponent
438