mk-goingbeta.tex /size: 18 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent mk-goingbeta
4
5\environment mk-environment
6
7\doifmodeelse {tug} {
8
9    \title {Lua\TeX\ going beta}
10
11    \subject{by Hans Hagen \& Taco Hoekwater}
12
13    This is Chapter~XI from \notabene {\CONTEXT, from \MKII\ to \MKIV}, a document
14    that describes our explorations, experiments and decisions made while
15    we develop \LUATEX.
16
17    \blank[3*big]
18
19} {
20
21    \chapter {Going beta}
22
23}
24
25\subject{introduction}
26
27We're closing in on the day that we will go beta with \LUATEX\ (end of July
282007). By now we have a rather good picture of its potential and to what
29extend \LUATEX\ will solve some of our persistent problems. Let's first
30summarize our reasons for and objectives with \LUATEX.
31
32\startitemize
33
34\item The world has moved from 8~bits to 32~bits and more, and this is
35quite noticeable in the arena of fonts. Although \TYPEONE\ fonts could host
36more than 256 glyphs, the associated technology was limited to 256. The advent
37of \OPENTYPE\ fonts will make it easier to support multiple languages at the
38same time without the need to switch fonts at awkward times.
39
40\item At the same time \UNICODE\ is replacing 8~bit based encoding vectors and
41code pages (input regimes). The most popular and rather efficient \UTF8 encoding
42has become a de factor standard in document encoding and interchange.
43
44\item Although we can do real neat tricks with \TEX, given some nasty programming,
45we are touching the limits of its possibilities. In order for it to survive we
46need to extend the engine but not at the cost of base compatibility.
47
48\item Coding solutions in a macro language is fine, but sometimes you long to a more
49procedural approach. Manipulating text, handling \IO, interfacing \unknown\ the
50technology moves on and we need to move along too.
51
52\stopitemize
53
54Hence \LUATEX: a merge of the mainstream traditional \TEX\ engines, stripped from
55broken or incomplete features and opened up to an embedded \LUA\ scripting engine.
56
57We will describe the impact of this new engine by starting from its core components
58reflected in the specific \LUA\ interface libraries. Missing here is embedded support
59for \METAPOST, because it's not yet there (apart from the fact that we use \LUA\ to
60convert \METAPOST\ graphics into \TEX). Also missing is the interfacing to the \PDF\
61backend, which is also on the agenda for later. Special extensions, for instance those
62dealing with runtime statistics are also not discussed. Since we use \CONTEXT\ as
63testbed, we will refer to the \LUATEX\ aware version of this macro package, \MKIV, but
64most conclusions are rather generic.
65
66\subject{tex internals}
67
68In order to manipulate \TEX's data structures, we need access to all those registers.
69Already early in the development, dimension and counters were accessible and when
70token and node interfaces were implemented, those registers also were interfaced.
71
72Those who read the previous chapters will have noticed that we hardly discussed this
73option. The reason is that we didn't yet needed that access much in order to implement
74font support and list processing. After all, most of the data that we need to access and
75manipulate is not in the registers at all. Information meant for \LUA\ can be stored
76in \LUA\ data structures. In fact, the basic call
77
78\starttyping
79\directlua 0 {some lua code}
80\stoptyping
81
82has shown to be a pretty good starting point and the fact that one can print back to
83the \TEX\ engine overcomes the need to store results in shared variables.
84
85\starttyping
86\def\valueofpi{\directlua0{tex.sprint(math.pi()}}
87\stoptyping
88
89The number of such direct calls is not that large anyway. More often a call to \LUA\
90will be initiated by a callback, i.e.\ a hook into the \TEX\ machinery.
91
92What will be the impact of access on \CONTEXT\ \MKIV ? This is yet hard to tell. In a
93later stage of the development, when parts of the \TEX\ machinery will be rewritten in
94order to get rid of the current global nature of many variables, we will gain more
95control and access to \TEX's internals. Core functionality will be isolated, can be
96extended and|/|or overloaded and at that moment access to internals is much more
97needed. But certainly that will be beyond the current registers and variables.
98
99\subject{callbacks}
100
101These are the spine of \LUATEX: here both worlds communicate with each other. A callback
102is a place in the \TEX\ kernel where some information is passed to \LUA\ and some result
103is returned that is then used along the road. The reference manual mentions them all and
104we will not repeat them here. Interesting is that in \MKIV\ most of them are used and for
105tasks that are rather natural to their place and function.
106
107\starttyping
108callback.register("tex_wants_to_do_this",
109    function but_use_lua_to_do_it_instead(a,b,c)
110        -- do whatever you like with a, b and c
111        return a, b, c
112    end
113)
114\stoptyping
115
116The impact of callbacks on \MKIV\ is big. It provides us a way to solve persistent
117problems or reimplement existing solutions in more convenient ways. Because we tested
118realistic functionality on real (moderately complex) documents using a pretty large
119macro package, we can safely conclude that callbacks are quite efficient. Stepwise
120\LUA\ kicks in in order to:
121
122\startitemize[packed]
123\item influence the input medium so that it provides a sequence of \UTF\ characters
124\item manipulate the stream of characters that will be turned into a list of tokens
125\item convert the list of tokens into another list of tokens
126\item enhance the list of nodes that will be turned into a typeset paragraph
127\item tweak the mechanisms that come into play when lines are constructed
128\item finalize the result that will end up in the output medium
129\stopitemize
130
131Interesting is that manipulating tokens is less useful than it may look at first
132sight. This has to do with the fact that it's (mostly) an expanded stream and at that
133time we've lost some information or need to do quite some coding in order to analyze
134the information and act upon it.
135
136Will \CONTEXT\ users see any of this? Chances are small that they will, although we
137will provide hooks so that they can add special code themselves. Users activating
138a callback has some danger, since it may overload already existing functionality.
139Chaining functionality in a callback also has drawbacks, if only that one may be
140confronted with already processed results and|/|or may destroy this result in
141unpredictable ways. So, as with most low level \TEX\ features, \CONTEXT\ users will
142work with more abstract interfaces.
143
144\subject{in- and output}
145
146In \MKIV\ we will no longer use the \KPSE\ library directly. Instead we use a
147reimplementation in \LUA\ that not only is more efficient, but also more powerful:
148it can read from \ZIP\ files, use protocols, be more clever in searching, reencodes
149the input streams when needed, etc. The impact on \MKIV\ is large. Most \TEX\ code
150that deals with input reencoding has gone away and is replaced by \LUA\ code.
151
152Although it is not directly related with reading from the input medium, in that stage
153we also replaced verbatim handling code. Such (often messy) catcode related situations
154are now handled more flexible, thanks to fast catcode table switching (a new
155\LUATEX\ feature) and features like syntax highlighting can be made more neat.
156
157Buffers, a quite old but frequently used feature of \CONTEXT, are now kept in
158memory instead of files. This speeds up runs. Auxiliary data, aka multi||pass
159information, will no longer be stored in \TEX\ files but in \LUA\ files. In
160\CONTEXT\ we have one such auxiliary file and in \MKII\ this file is selectively
161filtered, but in \MKIV\ we will be less careful with memory and load all that
162data once. Such speed improvements compensate the fact that \LUATEX\ is somewhat
163slower than it's ancestor \PDFTEX. (Actually, the fact that \LUATEX\ is a bit
164slower that \PDFTEX\ is mostly due to the fact that it has \ALEPH\ code on
165board.)
166
167Users often wonder why there  are so many temporary files, but these mostly relate
168to \METAPOST\ support. These will go away once we have \METAPOST\ as a library.
169
170In a similar way support for \XML\ will be enriched. We already have experimental
171loaders, filters and other code, and integration is on the agenda. Since \CONTEXT\ uses
172\XML\ for some sub systems, this may have some impact.
173
174Other \IO\ related improvements involve debugging, error handling and logging. We can pop
175up helpers and debug screens (\MKIV\ can produce \XHTML\ output and then launch a
176browser). Users can choose more verbose logging of \IO\ and ask for log data to be
177formatted in \XML. These parts need some additional work, because in the end we will
178also reimplement and extend \TEX's error handling.
179
180Another consequence of this will be that we will be able to package \TEX\ more
181conveniently. We can put all the files that are needed into a \ZIP\ file so that we only
182need to ship that \ZIP\ file and a binary.
183
184
185\subject{font readers}
186
187Handling \OPENTYPE\ involves more that just loading yet another font format. Of course
188loading an \OPENTYPE\ file is a necessity but we need to do more. Such fonts come with
189features. Features can involve replacing one representation of a character by another
190one of combining sequences into other sequences and finaly resolving them to one or more
191glyphs.
192
193Given the numerous options we will have to spend quite some time on extending \CONTEXT\
194with new features. Instead of defining more and more font instances (the traditional \TEX\ way
195of doing things) we will will provides feature switching. In the end this will make
196the often confusing font mechanisms less complex for the user to understand. Instead of
197for instance loading an extra font (set) that provides old style numerals, we will
198decouple this completely from fonts and provide it as yet another property of a piece
199of text. The good news is that much of the most important machinery is alresady in
200place (ligature building and such). Here we also have to decide what we let \TEX\ do
201and what we do by processing node lists. For instance kerning and ligature building
202can either be done by \TEX\ or by \LUA. Given the fact that \TEX\ does some juggling
203with character kerning while determining hyphenation points, we can as well disable
204\TEX's kerning and let \LUA\ handle it. Thereby \TEX\ only has to deal with paragraph
205building. (After all, we need to leave \TEX\ some core functionality to deal with.)
206
207Another everlasting burden on macro writers and users is dealing with character
208representations missing from a font. Of course, since we use named glyphs in
209\CONTEXT\ \MKII\ already much of this can be hidden, but in \MKIV\ we can
210create virtual fonts on the fly and keep thinking in terms of characters and
211glyphs instead of dealing with boxes and other structures that don't go well with
212for instance hyphenating words.
213
214This brings us to hyphenation, historically bound to fonts in traditional \TEX. This
215dependency will go away. In \MKII\ we already ship \UTF8\ based patterns fore some time
216and these can be conveniently used in \MKIV\ too. We experimented with using hyphenated
217word lists and this looks promising. You may expect more advanced ways of dealing with
218words, hyphenation and paragraph building in the near future. When we presented the
219first version of \LUATEX\ a few years ago, we only had the basic \type {\directlua} call
220available and could do a bit of string manipulation on the input. A fancy demo was to
221color wrongly spelled words. Now we can do that more robustly on the node lists.
222
223Loading and preparing fonts for usage in \LUATEX\ or actually \MKIV\ because this depends
224on the macro package takes some runtime.  For this reason we introduces caching
225into \MKIV: data that is used frequently is written to a cache and converted to \LUA\
226bytecode. Loading the converted files is incredibly fast. Of course there is aprice to
227pay: disk space, but that comes cheap these days. Also, it may as well be compensated
228by the fact that we can kick out many redundant files from the core \TEX\ distributions
229(metric files for instance).
230
231\subject{tokens handlers}
232
233Do we need to handle tokens? So far in experimental \MKIV\ code we only used these hooks
234to demonstrate what \TEX\ does with your characters. For a while we also constructed
235token lists when we wanted to inject \type {\pdfliteral} code in node lists, but that
236became obsolete when automatic string to token conversion was introduced in the node
237conversion code. Now we inject literal whatsit nodes. It may be worth noticing that
238playing with token lists gave us some good insight in bottlenecks because quite some
239small table allocation and garbage collections goes on.
240
241\subject{nodes and attributes}
242
243These are the most promissing new features. In itself, nodes are not new, nor are
244attributes. In some sense when we use primitives like \type {\hbox}, \type {\vskip},
245\type {\lastpenalty} the result is a node, but we can only control and inspect their
246properties within hard coded bounds. We cannot really look into boxes, and the last
247penalty may be obscured by a whatsit (a mark, a special, a write, etc.). Attributes
248could be fakes with marks and macro bases stacks of states. Native attributes
249are more powerful and each node can cary a truckload of them.
250
251With \LUATEX, out of a sudden we can look into \TEX's internals and manipulate
252them. Although I don't claim to be a real expert on these internals, even after
253over a decade of \TEX\ programming, I'm sometimes surprised what I found there.
254When we are playing with these interfaces, we often run into situations
255where we need to add much print statements to the \LUA\ code in order to find
256out what \TEX\ is returning. It all has to do with the way \TEX\ collects
257information and when it decides to act. In regular \TEX\ much goes unnoticed, but
258when one has for instance a callback that deals with page building there are many
259places where this gets called and some of these places need special treatment.
260
261Undoubtely this will have a huge impact on \CONTEXT\ \MKIV. Instead of parsing
262an input stream, we can now manipulate node lists in order to achieve (slight)
263inter||character spacing which is often needed in sectioning titles. The nice
264thing about this new approach is that we no longer have interference from
265characters that need multiple tokens (input characters) in order to be
266constructed, which complicates parsing (needed to split glyphs in \MKII).
267
268Signaling where to letterspace is done with the mentioned attributes. There can be
269many of them and they behave like fonts: they obey grouping, travel with the nodes
270and are therefore insensitive for box and page splitting. They can be set at the
271\TEX\ end but needs to be handled at the \LUA\ side. One may wonder what kind
272of macro packages would be around when \TEX\ has attributes right from its start.
273
274In \MKII\ letterspacing is handled by parsing the input and injecting skips.
275Another approach would be to use a font where each character has more kerns or space
276around it (a virtual font can do that). But that would not only demand knowledge of
277what fonts need that that treatment, but also many more fonts and generating them is
278no fun for users. In \PDFTEX\ there is a letterspace feature, where virtual fonts
279are generated on the fly, and with such an approach one has to compensate for the
280first and last character in a line, in order to get rid of the left- and
281rightmost added space (being part of the glyph). The solution where nodes are
282manipulated does put that burden upon the user.
283
284Another example of node processing is adding specific kerns around some punctuation
285symbols, as is custom in French. You don't want to know what it takes to do that
286in traditional \TEX, but if I mention the fact that colons become active characters
287you can imagine the nightmare. Hours of hacking and maybe even days of dealing with
288mechanisms that make these active colons workable in places where colons are used
289for non text are now even more wasted time if you consider that it takes a few lines
290of code in \MKIV. Currently we let \CONTEXT\ support both good old \TEX\
291(represented by \PDFTEX), \XETEX\ (a \UNICODE\ and \OPENTYPE\ aware variant) and
292\LUATEX\ by shared and dedicated \MKII\ and \MKIV\ code.
293
294Vertical spacing can be a pain. Okay, currently \MKII\ has a rather sophisticated way to
295deal with vertical spacing in ways that give documents a consistent look and feel, but
296every now and then we run into border cases that cannot be dealt with simply because
297we cannot look back in time. This is needed because \TEX\ adds content to the main
298vertical list and then it's gone from our view. Take for instance section titles. We don't
299want them dangling at the bottom of a page. But at the same time we want itemized lists
300to look well, i.e.\ keep items together in some situations. Graphics that follow a section
301title pose similar problems. Adding penalties helps but these may come too late, or
302even worse, they may obscure previous skips which then cannot be dealt with by successive
303skips. To simplify the problem: take a skip of 12pt, followed by a penalty, followed by
304another skip of 24pt. In \CONTEXT\ this has to become a penalty followed by one skip
305of 24pt.
306
307Dealing with this in the page builder is rather easy. Ok, due to the way \TEX\ adds
308content to the page stream, we need to collect, treat and flush, but currently this
309works all right. In \CONTEXT\ \MKIV\ we will have skips with three additional properties:
310priority over other skips, penalties, and a category (think of: ignore, force,
311replace, add).
312
313When we experimented with this kind of things we quickly decided that additional
314experiments with grid snapping also made sense. These mechanisms are among the more
315complex ones on \CONTEXT. A simple snap feature took a few lines of \LUA\ code and
316hooking it into \MKIV\ was not that complex either. Eventually we will reimplement
317all vertical spacing and grid snapping code of \MKII\ in \LUA. Because one of
318\CONTEXT\ column mechanism is grid aware, we may as well adath that and|/|or implement
319an additional mechanism.
320
321A side effect of being able to do this in \LUATEX\ is that the code taken from \PDFTEX\
322is cleaned up: all (recently added) static kerning code is removed (inter||character
323spacing, pre- and post character kerning, experimental code that can fix the heights
324and depths of lines, etc.). The core engine will only deal with dynamic features,
325like \HZ\ and protruding.
326
327So, the impact on \MKIV\ of nodes and attributes is pretty big! Horizontal spacing isues,
328vertical spacing, grid snapping are just a few of the things we will reimplement. Other
329things are line numbering, multiple content streams with synchronization, both are
330already present in \MKII\ but we can do a better job in \MKIV.
331
332\subject{generic code}
333
334In the previous text \MKIV\ was mentioned often, but some of the features are rather
335generic in nature. So, how generic can interfaces be implemented? When the \MKIV\ code
336has matured, much of the \LUA\ and glue||to||\TEX\ code will be generic in nature.
337Eventually \CONTEXT\ will become a top layer on what we internally call \METATEX, a
338collection of kernel modules that one can use to build specialized macro packages.
339To some extent \METATEX\ can be for \LUATEX\ what plain is for \TEX. But if and how
340fast this will be reality depends on the amount of time that we (and other members of
341the \CONTEXT\ development team) can allocate to this.
342
343\stopcomponent
344