mk-breakingapart.tex /size: 14 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent mk-breakingapart
4
5\environment mk-environment
6
7\chapter{Breaking apart}
8
9[todo: mention changes to hyphenchar etc]
10
11Because the long term objective is to have control over all aspects of the
12typesetting, quite some effort went into opening up one of the cornerstones
13of \TEX: breaking paragraphs into lines. And because this is closely related
14to hyphenating words, this effort also meant that we had to deal with ligature
15building and kerning.
16
17This is best explained with an example. Imagine that we have the following
18sentence  \footnote {The World Without Us, Alan Weisman; a quote from Richard
19Thomson in  chapter: Polymers are Forever.}
20
21\startnarrower \setupalign[nothyphenated]
22We imagined it was being ground down smaller and smaller, into a kind of
23powder. And we realized that smaller and smaller could lead to bigger and
24bigger problems.
25\stopnarrower
26
27With the current language settings for US English this can be hyphenated
28as follows:
29
30\startnarrower
31{\forgetall \hyphenatedpar{We imagined it was being ground down smaller and
32smaller, into a kind of powder. And we realized that smaller and smaller
33could lead to bigger and bigger problems.}}
34\stopnarrower
35
36So, when breaking a paragraph into lines, \TEX\ has a few options, but here
37actually not that many. If we permits two character snippets, we can get:
38
39\startnarrower \lefthyphenmin=2 \righthyphenmin=2
40{\forgetall \hyphenatedpar{We imagined it was being ground down smaller and
41smaller, into a kind of powder. And we realized that smaller and smaller
42could lead to bigger and bigger problems.}}
43\stopnarrower
44
45If we revert to UK English, we get:
46
47\startnarrower
48{\forgetall \uk \hyphenatedpar{We imagined it was being ground down smaller and
49smaller, into a kind of powder. And we realized that smaller and smaller
50could lead to bigger and bigger problems.}}
51\stopnarrower
52
53or, more tolerant,
54
55\startnarrower \lefthyphenmin=2 \righthyphenmin=2
56{\forgetall \uk \hyphenatedpar{We imagined it was being ground down smaller and
57smaller, into a kind of powder. And we realized that smaller and smaller
58could lead to bigger and bigger problems.}}
59\stopnarrower
60
61or with Dutch patterns:
62
63\startnarrower
64{\forgetall \nl \hyphenatedpar{We imagined it was being ground down smaller and
65smaller, into a kind of powder. And we realized that smaller and smaller
66could lead to bigger and bigger problems.}}
67\stopnarrower
68
69The code in traditional \TEX\ that deals with hyphenation and linebreaks is rather
70interwoven. There is a relationship between the font encoding and the way patterns
71are encodes. A few years after \TEX\ was written, support for multiple languages was
72added, which resulted in a mix of (kind of global) language settings (no nodes) and
73language nodes in the node lists. Traditionally it roughly works as follows:
74
75\startitemize
76
77\item The input \type {We imagined it} is tokenized and turned into glyph nodes. If
78non \ASCII\ characters are used (like pre composed accented characters) there may be
79a translation step: macros or active characters can insert \type {\char} commands or
80map onto other characters, for instance input byte 123 can become byte 198 which in
81turn ends up as a reference in a glyph node to a font slot. Whatever method is used to
82go from input to glyph node, eventually we have a reference to a position in a font.
83Unfortunately we had only 256 such slots per font.
84
85\item When it's time to break a paragraph into lines, traditional \TEX\ walks over
86the list, reconstruct words and inserts hyphenation points. In the process,
87inter|-|character kerns that are already injected need to be removed and reinserted,
88and ligatures have to be decomposed and recomposed. The magic of hyphenation is
89controlled by discretionary nodes. These specify what to do when a word is hyphenated.
90Take for instance the Dutch word \type {effe} which hyphenated becomes \type {ef-fe}
91so the \type {ff} either stays, or is split into \type {f-} and \type {f}.
92
93\item Because a glyph node is bound to a font, there is a relationship with the
94font encoding. Because there is no one 8-bit encoding that suits all languages, we
95may end up with several instances of a font in one document (used for different
96languages) and each when we switch language and|/|or font, we also have to enable
97a suitable set of patterns (in a matching encoding).
98
99\stopitemize
100
101You can imagine that this may lead to moderately complex mechanisms in macro packages.
102For instance, in \CONTEXT, to each language multiple font encodings can be bound and
103a switch of fonts (with related encoding) also results in a switch to a suitable set
104of patterns. But in \MKIV\ things are done different.
105
106First of all, we got rid of font encodings by exclusively using \UNICODE. We already
107were using \UTF\ encoded patterns (so that we could load them under different font
108encodings) so less patterns had to be loaded per language. That happened even before
109the \LUATEX\ development arrived at hyphenation.
110
111Before that effort started, Taco and I already played a bit with alternative
112hyphenation methods. For instance, we took large word lists with hyphenation points
113inserted. Taco wrote a loader (\LUA\ could not handle the large tables as function
114return value) and I made some hyphenation code in \LUA. Surprisingly we found out that
115it was pretty efficient, although we didn't  have the weighted hyphenation points
116that patterns may provide. Basically we simulated the \type {\hyphenation} command.
117
118While we went back to fonts, Taco's college Nanning wrote the first version of a new
119hyphenation storage mechanism, so when about half a year later we were ready to deal with the
120linebreak mechanisms, one of the key components was more or less ready. Where fonts forced me to
121write quite some \LUA\ code (still not finished), the new hyphenation
122mechanisms could be supported rather easy, if only because the framework was already
123kind of present (written during the experiments). Even better, when splitting the old
124code into \MKII\ and new \MKIV\ code, I could do most housekeeping in \LUA, and only
125needed a minimal amount of \TEX\ interfacing (partly redundant because of the shared
126interface). The new mechanism also was no longer bound to the format, which means
127that we could postpone loading of the patterns to runtime. Instead of the still
128supported traditional loading of patterns and exceptions, we load them under \LUA\
129control. This gave me yet another nice excercise in using \type {lpeg} (\LUA's string
130parser).
131
132With a new pattern loader in place, Taco started separating the hyphenation, ligature
133building and kerning. Each stage now has its own callback and each stage has an
134associated \LUA\ function, so that one can create a different order of execution or
135integrate it in other node parsing activities, most noticeably the handling of
136\OPENTYPE\ features.
137
138When I was trying to integrate this into the already existing node processing sequences,
139some nasty tricks were needed in order to feed the hyphenation function. At that
140moment it was still partly modelled after the traditional \TEX\ way, which boiled down
141to the following. As soon as the hyphenation function is invoked, it needs to know what
142the current language is. This information is not stored in the node list, only mid
143paragraph language switched are stored. Due to the fact that much information in \TEX\
144is global (well, in \LUATEX\ less and less) this complicates matters. Because in \MKIV\
145hyphenation, ligature building and kerning are done differently (dus to \OPENTYPE) we
146used the hyphenation callback to collect the language parameters so that we could use
147them when we called the hyphenation function later. This can definetely be qualified as
148an ugly hack.
149
150Before we discuss how this was solved, we summarize the state of affairs. In \LUATEX\
151we now have a sequence of callbacks related to paragraph building and in between not
152much happens any more.
153
154\startitemize[packed]
155\item hyphenation
156\item ligaturing
157\item kerning
158\item preparing linebreaking
159\item linebreaking
160\item finishing linebreaking
161\stopitemize
162
163Before we only had:
164
165\startitemize[packed]
166\item preparing linebreaking
167\stopitemize
168
169and this is where \MKIV\ hooks in ist code. The first three are disabled by
170associating them with dummy functions. I'm still not sure how the last two will
171fit it, especially because there is some interplay between \OPENTYPE\ features
172and linebreaking, like alternative glyphs at the end of the line. Because the
173\HZ\ and protruding mechanisms also will be supported we may as well end up with
174a mechanism for alternative glyphs built into the linebreak algorithm.
175
176Back to the current situation. What made matters even more complicated was the
177fact that we need to manipulate node lists while building horizontal material
178(hpacking) as well as for paragraphs (pre|-|linebreaking). Compare the following
179two situations. In the first case the hbox is packaged and hyphenation is not
180needed.
181
182\starttyping
183text \hbox {text} text
184\stoptyping
185
186However, when we unbox the content, hyphenation needs to be applied.
187
188\starttyping
189\setbox0=\hbox{text} text \unhbox0\ text
190\stoptyping
191
192[I need to check the next]
193
194Traditional \TEX\ does not look at all potential hyphenation points, but only around
195places that have a high probability as line|-|end. \LUATEX\ just hyphenates the whole
196list, although the function can be used selectively over a range, in \MKIV\ we see no
197reason for this and hyphenate whole lists.
198
199The new hyphenation routine not only operates on the whole list, but also can be made
200transparent for uppercase characters. Because we assume \UNICODE\ lowercase codes are
201no longer stored with the patterns (an \ETEX\ extension). The usual left- and
202righthyphenmin control is still there. The first word of a paragraph is no longer
203ignored in the process.
204
205Because the stages are separated now, the opportunity was there to separate between
206characters and glyphs. As with traditional \TEX, only characters are taken into
207account when hyphenating, so how do we distinguish between the two? The subtype (a
208property of each node) already registered if we were dealing with a ligature or not.
209Taco and Nanning had decided to treat the subtype as a bitset and after a bit of
210testing ans skyping we came to the conclusion that we needed an easy way to tag a
211glyph node as being \quote {already processed}. Keep in mind that as in the unhboxed
212example, the unhboxed content is already treated (hpack callback). If you wonder why
213we have these two moments of treatment think of this: if you put something in a box
214and want to know its dimensions, all font related features need to be applied. If the
215box is inserted as is, it can be recognized (a hlist or vlist node) and safely skipped
216in the prelinebreak handling. However, when it is unhboxed, we want to avoid
217reprocessing. Normally reprocessing will be prevented because the glyph nodes are
218mixed with kerns and ligatures are already built, but we can best play safe.
219Once we're done with processing a list (which can involve many passes, depending on
220what treatment is needed) we can tag the glyphs nodes as \quote {done} by adding 256
221to the subtype. We can then test on this property in callbacks while at the same time
222built-in functions like those responsible for hyphenation ignore this high bit.
223
224The transition from character to glyph is also done by changing bits in the subtype.
225At some point we need to set the subtype so that it reflects the node being a glyph,
226ligature or other special type (there are a few more types inherited from omega). I
227know that this all sounds complicated, but in \MKIV\ we now roughly do the following
228(of course this may and probably will change):
229
230\startitemize[packed]
231\item attribute driven manipulations (for instance case change)
232\item language driven manipulations (spell checking, hyphenation)
233\item font driven treatments, mostly features (ligature building, kerning)
234\item turn characters into glyphs (so that they will not be hyphenated again)
235\item normal ligaturing routine (currently still needed for not open type fonts, may
236      become obsolete)
237\item normal kerning routine (currently still needed for not open type fonts, may
238      become obsolete)
239\item attribute driven manipulations (special spacing and kerning)
240\stopitemize
241
242When no callbacks are used, turning characters into glyphs happens automatically behind
243the screens. When using callbacks (as in \MKIV) this needs to be done explicitly
244(but there is a helper function for this).
245
246So, by now \LUATEX\ can determine which glyph nodes play a role in hyphenation but still
247we have this \quote {what language are we in} problem. As usual in the development of
248\LUATEX, these fundamental changes took place in a setting where Taco and I are in a
249persistent state of Skyping, and it did not take much time to decide that in order to
250make the callbacks usable, it made much sense to moving the language related information
251to the glyph node as well, i.e.\ the number of the language object (patterns and
252exceptions), the left and right min values, and the boolean that tells how to treat
253uppercase characters. Each is now accessible in the usual way (by key). The penalty in
254additional memory is zero because it's stored along with the subtype bitset. By going this
255route, the ugly hack mentioned before could be removed as well.
256
257In the process of finalizing the code, discretionary nodes got a slightly different
258implementation. Originally they were organized as follows (ff is a ligature):
259
260\starttyping
261con-text == [c][o](pre=n-,post=,replace=1)[n][t][e][x][t]
262effe     == [e](pre=f-,post=f,replace=1)[ff][e]
263\stoptyping
264
265So, a discretionaty node contained information about what to put at the end of the broken
266line and what to put in front of the next line, as well as the number of following nodes
267in the list to skip when such a linebreak occured. Because this leads to rather messy code
268especially when ligatures are involved, so the decision was made to change the replacement
269counter into a node list holding those (optionally) to be replaced nodes.
270
271\starttyping
272con-text == [c][o](pre=n-,post=,replace=n)[t][e][x][t]
273effe     == [e](pre=f-,post=f,replace=ff)[e]
274\stoptyping
275
276This is much cleaner, but a consequence of this change was that all \MKIV\ node manipulation
277code written so far had to be reviewed.
278
279Of course we need to spend a few words on performance. We keep doing performance tests
280but currently we only remove bottlenecks that bother us. Later in the development
281optimization will tke place in the code. One reason is that the code changes, another
282reason is that large portions of \PASCAL\ code is turned  into \CCODE. Because
283integrating these changes (apart from preparations) took place within  a few weeks, we
284could reasonably well compare the old and the new hyphenation mechanisms using our
285(evolving) manuals and surprisingly the performance was certainly not worse than before.
286
287\stopcomponent
288