1
2
3\environment luametatexstyle
4
5\startdocument[title=Languages]
6
7\startsection[title={Introduction}]
8
9Although languages play an important role in a macro package that doesnt mean
10that \TEX\ is busy with it. The engine only needs to know how to hyphenate and
11for that a number that identifies what patterns to use is sufficient. All the
12action happens in the hyphenator: what characters make words, how many characters
13are kept at the left and right, which symbols end up at the end or beginning of a
14line, what input combine into (normally) dashes, how do we penalize a hyphenation
15point, etc.
16
17Where in regular \TEX\ we have special nodes that signal a language switch, and
18some shared variables that determine mentioned details, in \LUATEX\ every glyph
19carries the language information, including those minima. In \LUAMETATEX\ we put
20even more in a glyph by using a bitset of options. We also have some more
21character code bound properties. The \LUATEX\ engines store the current state in
22the glyph and discretionary nodes.
23
24You can find more practical information about languages in \CONTEXT\ manuals than
25in this document because users seldom go low level. Before we discuss these low
26level aspect anyway, we discuss how we came thus far; for that we borrow from the
27\LUATEX\ and \LUAMETATEX\ manuals.
28
29\stopsection
30
31\startsection[title={Evolution}]
32
33\LUATEXs internal handling of the characters and glyphs that eventually become
34typeset is quite different from the way \TEX82 handles those same objects. The
35easiest way to explain the difference is to focus on unrestricted horizontal mode
36(i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal
37with the differences that occur in horizontal and math modes.
38
39In \TEX82, the characters you type are converted into \type {char} node records
40when they are encountered by the main control loop. \TEX\ attaches and processes
41the font information while creating those records, so that the resulting \quote
42{horizontal list} contains the final forms of ligatures and implicit kerning.
43This packaging is needed because we may want to get the effective width of for
44instance a horizontal box. No hyphenation is needed in that case.
45
46When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one
47word at time) the \type {char} node records into a string by replacing ligatures
48with their components and ignoring the kerning. Then it runs the hyphenation
49algorithm on this string, and converts the hyphenated result back into a \quote
50{horizontal list} that is consecutively spliced back into the paragraph stream.
51Keep in mind that the paragraph may contain unboxed horizontal material, which
52then already contains ligatures and kerns and the words therein are part of the
53hyphenation process.
54
55Lets stress this: before \LUATEX\ ligaturing and kerning took place during input,
56and hyphenation, combined with temporarily juggling ligatures and kerns, took
57place while building the paragraph. Its a selective process where hyphenation
58only takes place where it is expected to influence the line breaks.
59
60Those \type {char} node records are somewhat misnamed, as they are glyph
61positions in specific fonts, and therefore not really \quote {characters} in the
62linguistic sense. In \TEX82 there is no language information inside the \type
63{char} node records at all. Instead, language information is passed along using
64\type {language whatsit} nodes inside the horizontal list.
65
66In \LUATEX\ and thereby \LUAMETATEX\ the situation is quite different. The
67characters you type are always converted into \type {glyph} node records with a
68special subtype to identify them as being intended as linguistic characters.
69\LUATEX\ stores the needed language information in those records, but does not do
70any fontrelated processing at the time of node creation. It only stores the
71index of the current font and a reference to a character in that font.
72
73When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all
74hyphenation points right into the whole node list. Next, it processes all the
75font information in the whole list, creating ligatures and adjusting kerning, and
76finally it adjusts all the subtype identifiers so that the records are \quote
77{glyph nodes} from now on. Actually in \LUAMETATEX\ the subtype is no longer used
78to store the state but that is not relevant here.
79
80In \LUAMETATEX\ we also have this separation but there is more control over when
81hyphenation is applied, what becomes en and emdashes, hoe penalties kick in,
82etc. There are some additional callbacks that can manipulate words as they are
83encountered and exceptions can be handled differently.
84
85\stopsection
86
87\startsection[title={Characters, glyphs and discretionaries},reference=charsandglyphs]
88
89\TEX82 (including \PDFTEX) differentiates between \type {char} nodes and \type
90{lig} nodes. The former are simple items that contained nothing but a \quote
91{character} and a \quote {font} field, and they lived in the same memory as
92tokens did. The latter also contained a list of components, and a subtype
93indicating whether this ligature was the result of a word boundary, and it was
94stored in the same place as other nodes like boxes and kerns and glues.
95
96In \LUAMETATEX\ we no longer keep the list of components with the glyph node
97because we have to deal with more advanced scenarios in \quote {node mode}, for
98instance in attaching vowels to stepwise constructed ligatures. Also, in
99\OPENTYPE\ ligatures are just a many to one mapping and the kind of ligatures
100that we see \TEX\ fonts in \OPENTYPE\ often are achieved by kerning substituted
101single glyphs.
102
103In \LUATEX, these two types are merged into one, somewhat larger structure called
104a \type {glyph} node. Besides having the old character, font, and component
105fields there are a few more, like \quote {attr}, these nodes also contain a
106subtype, that codes four main types and two additional ghost types. For
107ligatures, multiple bits can be set at the same time (in case of a singleglyph
108word).
109
110\startitemize
111 \startitem
112 \type {character}, for characters to be hyphenated: the lowest bit
113 (bit 0) is set to 1.
114 \stopitem
115 \startitem
116 \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is
117 not set.
118 \stopitem
119 \startitem
120 \type {ligature}, for constructed ligatures bit 1 is set.
121 \stopitem
122\stopitemize
123
124But while \TEX86\ has this construct, deconstruct and reconstruct model in \LUATEX\
125we dont do that so in the end this made little sense do we dropped it. We still
126have a (small) protection field that fulfills the job of signaling that were done
127with processing glyphs.
128
129We now arrive at languages. The \type {glyph} nodes also contain language data,
130split into four items that were current when the node was created: the \type
131{\setlanguage} (15bits), \type {\lefthyphenmin} (8bits), \type {\righthyphenmin}
132(8bits), and \type {\uchyph} (1bit). In \LUAMETATEX\ we just use small
133dedicated fields instead.
134
135Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256
136characters long. The language is stored with each character. You can set
137\type {\firstvalidlanguage} to for instance1 and make thereby language0
138an ignored hyphenation language. In \LUAMETATEX\ we have a more reasonable
139allowance because we dont expect that many languages in one document, but we do
140permits longer words.
141
142The new primitive \type {\hyphenationmin} can be used to signal the minimal length
143of a word. This value is stored with the (current) language.
144
145Because the \type {\uchyph} value is saved in the actual nodes, its handling is
146subtly different from \TEX82: changes to \type {\uchyph} become effective
147immediately, not at the end of the current partial paragraph. But this is true
148for more properties: for instance we store a penalty in a discretionary node and
149freeze glue in spaces, of course all at the price of using more memory.
150
151Typeset boxes now always have their language information embedded in the nodes
152themselves, so there is no longer a possible dependency on the surrounding
153language settings. In \TEX82, a midparagraph statement like \type {\unhbox0}
154would process the box using the current paragraph language unless there was a
155\type {\setlanguage} issued inside the box. In \LUATEX, all language variables
156are already frozen. Also, every list is hyphenated so that the font handler can
157do its job taking that into account.
158
159In traditional \TEX\ the process of hyphenation is driven by \type {\lccode}s. In
160\LUATEX\ we made this dependency less strong. There are several strategies
161possible. When you do nothing, the currently used \type {\lccode}s are used, when
162loading patterns, setting exceptions or hyphenating a list.
163
164When you set \type {\savinghyphcodes} to a value greater than zero the current set
165of \type {\lccode}s will be saved with the language. In that case changing a \type
166{\lccode} afterwards has no effect. However, you can adapt the set with:
167
168\starttyping
169\hjcodea=a
170\stoptyping
171
172This change is global which makes sense if you keep in mind that the moment that
173hyphenation happens is (normally) when the paragraph or a horizontal box is
174constructed. When \type {\savinghyphcodes} was zero when the language got
175initialized you start out with nothing, otherwise you already have a set.
176
177When a \type {\hjcode} is greater than 0 but less than 32 the value indicates the
178to be used length. In the following example we map a character (\type {x}) onto
179another one in the patterns and tell the engine that \type {Å“} counts as two
180characters. Because traditionally zero itself is reserved for inhibiting
181hyphenation, a value of 32 counts as zero.
182
183Here are some examples (we assume that French patterns are used):
184
185\starttabulate[]
186\NC \NC \type{foobar} \NC \type{foobar} \NC \NR
187\NC \type{\hjcodex=o} \NC \type{fxxbar} \NC \type{fxxbar} \NC \NR
188\NC \type{\lefthyphenmin3} \NC \type{œdipus} \NC \type{œdipus} \NC \NR
189\NC \type{\lefthyphenmin4} \NC \type{œdipus} \NC \type{œdipus} \NC \NR
190\NC \type{\hjcodeœ=2} \NC \type{œdipus} \NC \type{œdipus} \NC \NR
191\NC \type{\hjcodei=32 \hjcoded=32} \NC \type{œdipus} \NC \type{œdipus} \NC \NR
192\NC
193\stoptabulate
194
195Carrying all this information with each glyph would give too much overhead and
196also make the process of setting up these codes more complex. A solution with
197\type {\hjcode} sets was considered but rejected because in practice the current
198approach is sufficient and it would not be compatible anyway.
199
200Beware: the values are always saved in the format, independent of the setting
201of \type {\savinghyphcodes} at the moment the format is dumped.
202
203We also have \type {\hccode} or hyphen code. A character can be set to non zero
204to indicate that it should be regarded as value visible hyphenation point. These
205examples show how that works (it is the second bit in \type {\hyphenationmode}
206that does the magic but we set them all here):
207
208\startbuffer
209{\hsize 1mm \hccode"2014 \zerocount \hyphenationmode "0000000 xxx\emdash xxx \par}
210{\hsize 1mm \hccode"2014 "2014\relax \hyphenationmode "0000000 xxx\emdash xxx \par}
211
212{\hsize 1mm \hccode"2014 \zerocount \hyphenationmode "FFFFFFF xxx\emdash xxx \par}
213{\hsize 1mm \hccode"2014 "2014\relax \hyphenationmode "FFFFFFF xxx\emdash xxx \par}
214
215{\hyphenationmode "0000000 xxxxxxxxx \par}
216{\hyphenationmode "FFFFFFF xxxxxxxxx \par}
217\stopbuffer
218
219\typebuffer
220
221Here we assign the code point because who knows what future extensions will
222bring. As with the other codes you can also set them from \LUA. The feature is
223experimental and might evolve when \CONTEXT\ users come up with reasonable
224demands.
225
226\startpacked \getbuffer \stoppacked
227
228A boundary node normally would mark the end of a word which interferes with for
229instance discretionary injection. For this you can use the \type {\wordboundary}
230as a trigger. Here are a few examples of usage:
231
232\startbuffer
233discretediscrete
234\stopbuffer
235\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
236\startbuffer
237discrete\discretionary{}{}{}discrete
238\stopbuffer
239\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
240\startbuffer
241discrete\wordboundary\discretionary{}{}{}discrete
242\stopbuffer
243\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
244\startbuffer
245discrete\wordboundary\discretionary{}{}{}\wordboundary discrete
246\stopbuffer
247\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
248\startbuffer
249discrete\wordboundary\discretionary{}{}{}\wordboundary discrete
250\stopbuffer
251\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
252
253We only accept an explicit hyphen when there is a preceding glyph and we skip a
254sequence of explicit hyphens since that normally indicates a \type {} or \type
255{} ligature in which case we can in a worse case usage get bad node lists
256later on due to messed up ligature building as these dashes are ligatures in base
257fonts. This is a side effect of separating the hyphenation, ligaturing and
258kerning steps.
259
260The start and end of a sequence of characters is signalled by a \type {glue}, \type
261{penalty}, \type {kern} or \type {boundary} node. But by default also a \type
262{hlist}, \type {vlist}, \type {rule}, \type {dir}, \type {whatsit}, \type {insert}, and
263\type {adjust} node indicate a start or end. You can omit the last set from the
264test by setting flags in \type {\hyphenationmode}:
265
266\starttworows
267\getbuffer[engine:syntax:hyphenationcodes]
268\stoptworows
269
270The word start is determined as follows:
271
272\starttabulate[ll]
273\FL
274\BC node \BC behaviour \NC \NR
275\TL
276\BC boundary \NC yes when wordboundary \NC \NR
277\BC hlist \NC when the start bit is set \NC \NR
278\BC vlist \NC when the start bit is set \NC \NR
279\BC rule \NC when the start bit is set \NC \NR
280\BC dir \NC when the start bit is set \NC \NR
281\BC whatsit \NC when the start bit is set \NC \NR
282\BC glue \NC yes \NC \NR
283\BC math \NC skipped \NC \NR
284\BC glyph \NC exhyphenchar (one only) : yes (so no ) \NC \NR
285\BC otherwise \NC yes \NC \NR
286\LL
287\stoptabulate
288
289The word end is determined as follows:
290
291\starttabulate[ll]
292\FL
293\BC node \BC behaviour \NC \NR
294\TL
295\BC boundary \NC yes \NC \NR
296\BC glyph \NC yes when different language \NC \NR
297\BC glue \NC yes \NC \NR
298\BC penalty \NC yes \NC \NR
299\BC kern \NC yes when not italic (for some historic reason) \NC \NR
300\BC hlist \NC when the end bit is set \NC \NR
301\BC vlist \NC when the end bit is set \NC \NR
302\BC rule \NC when the end bit is set \NC \NR
303\BC dir \NC when the end bit is set \NC \NR
304\BC whatsit \NC when the end bit is set \NC \NR
305\BC ins \NC when the end bit is set \NC \NR
306\BC adjust \NC when the end bit is set \NC \NR
307\LL
308\stoptabulate
309
310\in {Figures} [hb:1] upto \in [hb:5] show some examples. In all cases we set the
311min values to 1 and make sure that the words hyphenate at each character.
312
313\hyphenation{one two}
314
315\def\SomeTest#1#2
316 {\lefthyphenmin \plusone
317 \righthyphenmin \plusone
318 \parindent \zeropoint
319 \everypar \emptytoks
320 \dontcomplain
321 \hbox to 2cm {
322 \vtop {
323 \hsize 1pt
324 \advance\hyphenationmode#1\relax
325 #2
326 \par}}}
327
328\startplacefigure[reference=hb:1,title={\type{one}}]
329 \startcombination[4*1]
330 {\SomeTest {0}{one}} {\type {0}}
331 {\SomeTest {64}{one}} {\type {64}}
332 {\SomeTest{128}{one}} {\type{128}}
333 {\SomeTest{192}{one}} {\type{192}}
334 \stopcombination
335\stopplacefigure
336
337\startplacefigure[reference=hb:2,title={\type{one\null two}}]
338 \startcombination[4*1]
339 {\SomeTest {0}{one\null two}} {\type {0}}
340 {\SomeTest {64}{one\null two}} {\type {64}}
341 {\SomeTest{128}{one\null two}} {\type{128}}
342 {\SomeTest{192}{one\null two}} {\type{192}}
343 \stopcombination
344\stopplacefigure
345
346\startplacefigure[reference=hb:3,title={\type{\null one\null two}}]
347 \startcombination[4*1]
348 {\SomeTest {0}{\null one\null two}} {\type {0}}
349 {\SomeTest {64}{\null one\null two}} {\type {64}}
350 {\SomeTest{128}{\null one\null two}} {\type{128}}
351 {\SomeTest{192}{\null one\null two}} {\type{192}}
352 \stopcombination
353\stopplacefigure
354
355\startplacefigure[reference=hb:4,title={\type{one\null two\null}}]
356 \startcombination[4*1]
357 {\SomeTest {0}{one\null two\null}} {\type {0}}
358 {\SomeTest {64}{one\null two\null}} {\type {64}}
359 {\SomeTest{128}{one\null two\null}} {\type{128}}
360 {\SomeTest{192}{one\null two\null}} {\type{192}}
361 \stopcombination
362\stopplacefigure
363
364\startplacefigure[reference=hb:5,title={\type{\null one\null two\null}}]
365 \startcombination[4*1]
366 {\SomeTest {0}{\null one\null two\null}} {\type {0}}
367 {\SomeTest {64}{\null one\null two\null}} {\type {64}}
368 {\SomeTest{128}{\null one\null two\null}} {\type{128}}
369 {\SomeTest{192}{\null one\null two\null}} {\type{192}}
370 \stopcombination
371\stopplacefigure
372
373In traditional \TEX\ ligature building and hyphenation are interwoven with the
374line break mechanism. In \LUATEX\ these phases are isolated. As a consequence we
375deal differently with (a sequence of) explicit hyphens. We already have added
376some control over aspects of the hyphenation and yet another one concerns
377automatic hyphens (e.g.\ \type {} characters in the input).
378
379Hyphenation and discretionary injection is driven by a mode parameter which is
380a bitset made from the following values, some of which we saw in the previous
381examples.
382
383\starttabulate[rTp]
384\NC \tohexadecimal \normalhyphenationcode \NC honour (normal) \type{\discretionary}s \NC \NR
385\NC \tohexadecimal \automatichyphenationcode \NC turn \type {} into (automatic) discretionaries \NC \NR
386\NC \tohexadecimal \explicithyphenationcode \NC turn \type {\-} into (explicit) discretionaries \NC \NR
387\NC \tohexadecimal \syllablehyphenationcode \NC hyphenate (syllable) according to language \NC \NR
388\NC \tohexadecimal \uppercasehyphenationcode \NC hyphenate uppercase characters too (replaces \type {\uchyph} \NC \NR
389\NC \tohexadecimal \compoundhyphenationcode \NC permit break at an explicit hyphen (border cases) \NC \NR
390\NC \tohexadecimal \strictstarthyphenationcode \NC traditional \TEX\ compatibility wrt the start of a word \NC \NR
391\NC \tohexadecimal \strictendhyphenationcode \NC traditional \TEX\ compatibility wrt the end of a word \NC \NR
392\NC \tohexadecimal \automaticpenaltyhyphenationcode \NC use \type {\automatichyphenpenalty} \NC \NR
393\NC \tohexadecimal \explicitpenaltyhyphenationcode \NC use \type {\explicithyphenpenalty} \NC \NR
394\NC \tohexadecimal \permitgluehyphenationcode \NC turn glue in discretionaries into kerns \NC \NR
395\NC \tohexadecimal \permitallhyphenationcode \NC okay, lets be even more tolerant in discretionaries \NC \NR
396\NC \tohexadecimal \permitmathreplacehyphenationcode \NC and again were more permissive \NC \NR
397\NC \tohexadecimal \lazyligatureshyphenationcode \NC controls how successive explicit discretionaries are handled in base mode \NC \NR
398\NC \tohexadecimal \forcecheckhyphenationcode \NC treat all discretionaries equal when breaking lines (in all three passes) \NC \NR
399\NC \tohexadecimal \forcehandlerhyphenationcode \NC kick in the handler (experiment) \NC \NR
400\NC \tohexadecimal \feedbackcompoundhyphenationcode \NC feedback compound snippets \NC \NR
401\stoptabulate
402
403Some of these options are still experimental, simply because not all aspects and
404side effects have been explored. You can find some experimental use cases in
405\CONTEXT.
406
407There are also \type {\discretionaryoptions}. Some are set by the engine:
408
409\starttworows
410\getbuffer[engine:syntax:discoptioncodes]
411\stoptworows
412
413\stopsection
414
415\startsection[title={Controlling hyphenation}]
416
417The \typ {\hyphenationmin} parameter can be used to set the minimal word length,
418so setting it to a value of$5$ means that only words of 6 characters and more
419will be hyphenated, of course within the constraints of the \typ {\lefthyphenmin}
420and \typ {\righthyphenmin} values (as stored in the glyph node). This primitive
421accepts a number and stores the value with the language.
422
423The \type {\noboundary} command is used to inject a whatsit node but now injects
424a normal node with type \type {boundary} and subtype0. In addition you can say:
425
426\starttyping
427x\boundary 123\relax y
428\stoptyping
429
430This has the same effect but the subtype is now1 and the value123 is stored.
431The traditional ligature builder still sees this as a cancel boundary directive
432but at the \LUA\ end you can implement different behaviour. The added benefit of
433passing this value is a side effect of the generalization. The subtypes2 and3
434are used to control protrusion and word boundaries in hyphenation and have
435related primitives.
436
437\stopsection
438
439\startsection[title={The main control loop}]
440
441In \LUATEXs main loop, almost all input characters that are to be typeset are
442converted into \type {glyph} node records with subtype \quote {character}, but
443there are a few exceptions.
444
445\startitemize[n]
446
447\startitem
448 The \type {\accent} primitive creates nodes with subtype \quote {glyph}
449 instead of \quote {character}: one for the actual accent and one for the
450 accentee. The primary reason for this is that \type {\accent} in \TEX82 is
451 explicitly dependent on the current font encoding, so it would not make much
452 sense to attach a new meaning to the primitives name, as that would
453 invalidate many old documents and macro packages. A secondary reason is that
454 in \TEX82, \type {\accent} prohibits hyphenation of the current word. Since
455 in \LUATEX\ hyphenation only takes place on \quote {character} nodes, it is
456 possible to achieve the same effect. Of course, modern \UNICODE\ aware macro
457 packages will not use the \type {\accent} primitive at all but try to map
458 directly on composed characters.
459
460 This change of meaning did happen with \type {\char}, that now generates
461 \quote {glyph} nodes with a character subtype. In traditional \TEX\ there was
462 a strong relationship between the 8bit input encoding, hyphenation and
463 glyphs taken from a font. In \LUATEX\ we have \UTF\ input, and in most cases
464 this maps directly to a character in a font, apart from glyph replacement in
465 the font engine. If you want to access arbitrary glyphs in a font directly
466 you can always use \LUA\ to do so, because fonts are available as \LUA\
467 table.
468\stopitem
469
470\startitem
471 All the results of processing in math mode eventually become nodes with
472 \quote {glyph} subtypes. In fact, the result of processing math is just
473 a regular list of glyphs, kerns, glue, penalties, boxes etc.
474\stopitem
475
476\startitem
477 Automatic discretionaries are handled differently. \TEX82 inserts an empty
478 discretionary after sensing an input character that matches the \type
479 {\hyphenchar} in the current font. This test is wrong in our opinion: whether
480 or not hyphenation takes place should not depend on the current font, it is a
481 language property. \footnote {When \TEX\ showed up we didnt have \UNICODE\
482 yet and being limited to eight bits meant that one sometimes had to
483 compromise between supporting character input, glyph rendering, hyphenation.}
484
485 The \type {\defaulthyphenchar} parameter is used as fallback when defining a
486 font where that one is not explicitly set.
487
488 In \LUATEX, it works like this: if it senses a string of input characters
489 that matches the value of the new integer parameter \type {\exhyphenchar}, it
490 will insert an explicit discretionary after that series of nodes. Initially
491 \TEX\ sets the \type {\exhyphenchar=\-}. Incidentally, this is a global
492 parameter instead of a languagespecific one because it may be useful to
493 change the value depending on the document structure instead of the text
494 language.
495
496 The insertion of discretionaries after a sequence of explicit hyphens happens
497 at the same time as the other hyphenation processing, {\it not\/} inside the
498 main control loop.
499
500 The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a
501 word should be considered for hyphenation at all. If the \type {\hyphenchar}
502 of the font attached to the first character node in a word is negative, then
503 hyphenation of that word is abandoned immediately. This behaviour is added
504 for backward compatibility only, and the use of \type {\hyphenchar=1} as a
505 means of preventing hyphenation should not be used in new \LUATEX\ documents.
506\stopitem
507
508\startitem
509 The \type {\setlanguage} command no longer creates whatsits. The meaning of
510 \type {\setlanguage} is changed so that it is now an integer parameter like all
511 others. That integer parameter is used in \type {glyph} node creation to add
512 language information to the glyph nodes. In conjunction, the \type {\language}
513 primitive is extended so that it always also updates the value of \type
514 {\setlanguage}.
515\stopitem
516
517\startitem
518 The \type {\noboundary} command (that prohibits word boundary processing
519 where that would normally take place) now does create nodes. These nodes are
520 needed because the exact place of the \type {\noboundary} command in the
521 input stream has to be retained until after the ligature and font processing
522 stages.
523\stopitem
524
525\startitem
526 There is no longer a \type {mainloop} label in the code. Remember that
527 \TEX82 did quite a lot of processing while adding \type {charnodes} to the
528 horizontal list? For speed reasons, it handled that processing code outside
529 of the \quote {main control} loop, and only the first character of any \quote
530 {word} was handled by that \quote {main control} loop. In \LUATEX, there is
531 no longer a need for that (all hard work is done later), and the (now very
532 small) bits of characterhandling code have been moved back inline. When
533 \type {\tracingcommands} is on, this is visible because the full word is
534 reported, instead of just the initial character.
535\stopitem
536
537\stopitemize
538
539Because we tend to make hard coded behavior configurable a few new primitives
540have been added:
541
542\starttyping
543\automatichyphenpenalty
544\explicithyphenpenalty
545\stoptyping
546
547These relate to:
548
549\starttyping
550\automaticdiscretionary
551\explicitdiscretionary
552\stoptyping
553
554The usage of these penalties is controlled by the \type {\hyphenationmode} flags
555{\tt0x\tohexadecimal\automaticpenaltyhyphenationcode } and
556{\tt0x\tohexadecimal\explicitpenaltyhyphenationcode} and when these are not set \typ
557{\exhyphenpenalty} is used.
558
559You can use the \type {\tracinghyphenation} variable to get a bit more information
560about what happens.
561
562\starttabulate[lTl]
563\FL
564\BC value \BC effect \NC\NR
565\TL
566\NC 1 \NC report redundant pattern (happens by default in \LUATEX) \NC\NR
567\NC 2 \NC report words that reach the hyphenator and got treated \NC\NR
568\NC 3 \NC show the result of a hyphenated word (a node list) \NC\NR
569\LL
570\stoptabulate
571
572\stopsection
573
574\startsection[title={Loading patterns and exceptions},reference=patternsexceptions]
575
576Although we keep the traditional approach towards hyphenation (which is still
577superior) the implementation of the hyphenation algorithm in \LUATEX\ is quite
578different from the one in \TEX82.
579
580After expansion, the argument for \type {\patterns} has to be proper \UTF8 with
581individual patterns separated by spaces, no \type {\char} or \type {\chardef}d
582commands are allowed. The current implementation is quite strict and will reject
583all non\UNICODE\ characters. Likewise, the expanded argument for \type
584{\hyphenation} also has to be proper \UTF8, but here a bit of extra syntax is
585provided:
586
587\startitemize[n]
588\startitem
589 Three sets of arguments in curly braces (\type {{}{}{}}) indicate a desired
590 complex discretionary, with arguments as in \type {\discretionary}s command in
591 normal document input.
592\stopitem
593\startitem
594 A \type {} indicates a desired simple discretionary, cf.\ \type {\-} and
595 \type {\discretionary{}{}{}} in normal document input.
596\stopitem
597\startitem
598 Internal command names are ignored. This rule is provided especially for \type
599 {\discretionary}, but it also helps to deal with \type {\relax} commands that
600 may sneak in.
601\stopitem
602\startitem
603 An \type {=} indicates a (nondiscretionary) hyphen in the document input.
604\stopitem
605\stopitemize
606
607The expanded argument is first converted back to a spaceseparated string while
608dropping the internal command names. This string is then converted into a
609dictionary by a routine that creates keyvalue pairs by converting the other
610listed items. It is important to note that the keys in an exception dictionary
611can always be generated from the values. Here are a few examples:
612
613\starttabulate[lll]
614\FL
615\BC value \BC implied key (input) \BC effect \NC\NR
616\TL
617\NC \type {table} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{}{}{}ble}) \NC\NR
618\NC \type {ba{k}{}{c}ken} \NC backen \NC \type {ba\discretionary{k}{}{c}ken} \NC\NR
619\LL
620\stoptabulate
621
622The resultant patterns and exception dictionary will be stored under the language
623code that is the present value of \type {\language}.
624
625In the last line of the table, you see there is no \type {\discretionary} command
626in the value: the command is optional in the \TEXbased input syntax. The
627underlying reason for that is that it is conceivable that a whole dictionary of
628words is stored as a plain text file and loaded into \LUATEX\ using one of the
629functions in the \LUA\ \type {language} library. This loading method is quite a bit
630faster than going through the \TEX\ language primitives, but some (most?) of that
631speed gain would be lost if it had to interpret command sequences while doing so.
632
633It is possible to specify extra hyphenation points in compound words by using
634\type {{}{}{}} for the explicit hyphen character (replace \type {} by the
635actual explicit hyphen character if needed). For example, this matches the word
636\quote {multiwordboundaries} and allows an extra break inbetween \quote
637{boun} and \quote {daries}:
638
639\starttyping
640\hyphenation{multi{}{}{}word{}{}{}boundaries}
641\stoptyping
642
643The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that
644hyphenation heavily depended on font encodings. This is no longer true in
645\LUATEX, and the corresponding primitive is basically ignored. Because we now
646have \type {\hjcode}, the case related codes can be used exclusively for \type
647{\uppercase} and \type {\lowercase}.
648
649The three curly brace pair pattern in an exception can be somewhat unexpected so
650we will try to explain it by example. The pattern \type {foo{}{}{x}bar} pattern
651creates a lookup \type {fooxbar} and the pattern \type {foo{}{}{}bar} creates
652\type {foobar}. Then, when a hit happens there is a replacement text (\type {x})
653or none. Because we introduced penalties in discretionary nodes, the exception
654syntax now also can take a penalty specification. The value between square brackets
655is a multiplier for \type {\exceptionpenalty}. Here we have set it to 10000 so
656effectively we get 30000 in the example.
657
658\def\ShowSample#1#2
659 {\startlinecorrection[blank]
660 \hyphenation{#1}
661 \exceptionpenalty=10000
662 \bTABLE[foregroundstyle=type]
663 \bTR
664 \bTD[align=middle,nx=4] \type{#1} \eTD
665 \eTR
666 \bTR
667 \bTD[align=middle] \type{10em} \eTD
668 \bTD[align=middle] \type {3em} \eTD
669 \bTD[align=middle] \type {0em} \eTD
670 \bTD[align=middle] \type {6em} \eTD
671 \eTR
672 \bTR
673 \bTD[width=10em]\vtop{\hsize 10em 123 #2 123\par}\eTD
674 \bTD[width=10em]\vtop{\hsize 3em 123 #2 123\par}\eTD
675 \bTD[width=10em]\vtop{\hsize 0em 123 #2 123\par}\eTD
676 \bTD[width=10em]\vtop{\setupalign[verytolerant,stretch]\rmtf\hsize 6em 123 #2 #2 #2 #2 123\par}\eTD
677 \eTR
678 \eTABLE
679 \stoplinecorrection}
680
681\ShowSample{x{a}{b}{}x{a}{b}{}x{a}{b}{}x{a}{b}{}xx}{xxxxxx}
682\ShowSample{x{a}{b}{}x{a}{b}{}[3]x{a}{b}{}[1]x{a}{b}{}xx}{xxxxxx}
683
684\ShowSample{z{a}{b}{z}{a}{b}{z}{a}{b}{z}{a}{b}{z}z}{zzzzzz}
685\ShowSample{z{a}{b}{z}{a}{b}{z}[3]{a}{b}{z}[1]{a}{b}{z}z}{zzzzzz}
686
687\stopsection
688
689\startsection[title={Applying hyphenation}]
690
691The internal structures \LUATEX\ uses for the insertion of discretionaries in
692words is very different from the ones in \TEX82, and that means there are some
693noticeable differences in handling as well.
694
695First and foremost, there is no \quote {compressed trie} involved in hyphenation.
696The algorithm still reads pattern files generated by \PATGEN, but \LUATEX\ uses a
697finite state hash to match the patterns against the word to be hyphenated. This
698algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in
699turn is inspired by \TEX.
700
701There are a few differences between \LUATEX\ and \TEX82 that are a direct result
702of the implementation:
703
704\startitemize
705\startitem
706 \LUATEX\ happily hyphenates the full \UNICODE\ character range.
707\stopitem
708\startitem
709 Pattern and exception dictionary size is limited by the available memory
710 only, all allocations are done dynamically. The trierelated settings in
711 \type {texmf.cnf} are ignored.
712\stopitem
713\startitem
714 Because there is no \quote {trie preparation} stage, language patterns never
715 become frozen. This means that the primitive \type {\patterns} (and its \LUA\
716 counterpart \type {language.patterns}) can be used at any time, not only in
717 ini\TEX.
718\stopitem
719\startitem
720 Only the string representation of \type {\patterns} and \type {\hyphenation} is
721 stored in the format file. At format load time, they are simply
722 reevaluated. It follows that there is no real reason to preload languages
723 in the format file. In fact, it is usually not a good idea to do so. It is
724 much smarter to load patterns no sooner than the first time they are actually
725 needed.
726\stopitem
727\startitem
728 \LUATEX\ uses the languagespecific variables \type {\prehyphenchar} and \type
729 {\posthyphenchar} in the creation of implicit discretionaries, instead of
730 \TEX82s \type {\hyphenchar}, and the values of the languagespecific
731 variables \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit
732 discretionaries (instead of \TEX82s empty discretionary).
733\stopitem
734\startitem
735 The value of the two counters related to hyphenation, \type {\hyphenpenalty}
736 and \type {\exhyphenpenalty}, are now stored in the discretionary nodes. This
737 permits a local overload for explicit \type {\discretionary} commands. The
738 value current when the hyphenation pass is applied is used. When no callbacks
739 are used this is compatible with traditional \TEX. When you apply the \LUA\
740 \type {language.hyphenate} function the current values are used.
741\stopitem
742\startitem
743 The hyphenation exception dictionary is maintained as keyvalue hash, and
744 that is also dynamic, so the \type {hyphsize} setting is not used either.
745\stopitem
746\stopitemize
747
748Because we store penalties in the disc node the \type {\discretionary} command has
749been extended to accept an optional penalty specification, so you can do the
750following:
751
752\startbuffer
753\hsize1mm
7541:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par
7552:foo\discretionary penalty 10000 {}{}{}bar\par
7563:foo\discretionary{}{}{}bar\par
757\stopbuffer
758
759\typebuffer
760
761This results in:
762
763\blank \start \getbuffer \stop \blank
764
765Inserted characters and ligatures inherit their attributes from the nearest glyph
766node item (usually the preceding one, but the following one for the items
767inserted at the lefthand side of a word).
768
769Word boundaries are no longer implied by font switches, but by language switches.
770One word can have two separate fonts and still be hyphenated correctly (but it
771can not have two different languages, the \type {\setlanguage} command forces a
772word boundary).
773
774All languages start out with \type {\prehyphenchar=\-}, \type {\posthyphenchar=0},
775\type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the
776values of one of these four parameters, you are actually changing the settings
777for the current \type {\language}, this behaviour is compatible with \type {\patterns}
778and \type {\hyphenation}.
779
780\LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256
781characters long (up from 64 in \TEX82). Longer words are ignored right now, but
782eventually either the limitation will be removed or perhaps it will become
783possible to silently ignore the excess characters (this is what happens in
784\TEX82, but there the behaviour cannot be controlled).
785
786If you are using the \LUA\ function \type {language.hyphenate}, you should be aware
787that this function expects to receive a list of \quote {character} nodes. It will
788not operate properly in the presence of \quote {glyph}, \quote {ligature}, or
789\quote {ghost} nodes, nor does it know how to deal with kerning.
790
791\stopsection
792
793\startsection[title={Applying ligatures and kerning}]
794
795We discuss this base mode aspect here because in traditional \TEX\ the process is
796interwoven with hyphenation. After all possible hyphenation points have been
797inserted in the list, \LUATEX\ will process the list to convert the \quote
798{character} nodes into \quote {glyph} and \quote {ligature} nodes. This is
799actually done in two stages: first all ligatures are processed, then all kerning
800information is applied to the result list. But those two stages are somewhat
801dependent on each other: If the used font makes it possible to do so, the
802ligaturing stage adds virtual \quote {character} nodes to the word boundaries in
803the list. While doing so, it removes and interprets \type {\noboundary} nodes.
804The kerning stage deletes those word boundary items after it is done with them,
805and it does the same for \quote {ghost} nodes. Finally, at the end of the kerning
806stage, all remaining \quote {character} nodes are converted to \quote {glyph}
807nodes.
808
809This separation is worth mentioning because, if you overrule from \LUA\ only one
810of the two callbacks related to font handling, then you have to make sure you
811perform the tasks normally done by \LUATEX\ itself in order to make sure that the
812other, nonoverruled, routine continues to function properly.
813
814Although we could improve the situation the reality is that in modern \OPENTYPE\
815fonts ligatures can be constructed in many ways: by replacing a sequence of
816characters by one glyph, or by selectively replacing individual glyphs, or by
817kerning, or any combination of this. Add to that contextual analysis and it will
818be clear that we have to let \LUA\ do that job instead. The generic font handler
819that we provide (which is part of \CONTEXT) distinguishes between base mode
820(which essentially is what we describe here and which delegates the task to \TEX)
821and node mode (which deals with more complex fonts.
822
823In so called base mode, where \TEX\ does the work, the ligature construction
824(normally) goes in small steps. An \type {f} followed by an \type {f} becomes an
825\type {ff} ligatures and that one followed by an \type {i} can become a \type
826{ffi} ligature. The situation can be complicated by hyphenation points between
827these characters. When there are several in a ligature collapsing happens. Flag
828{\tt 0x\tohexadecimal \lazyligatureshyphenationcode} in the \typ
829{\hyphenationmode} variable determines if this happens lazy or greedy, i.e.\ the
830first hyphen wins or the last one does. In practice a \CONTEXT\ user wont have
831to deal with this because most fonts are processed in node mode.
832
833\stopsection
834
835\startsection[title={Breaking paragraphs into lines}]
836
837This code is almost unchanged, but because of the abovementioned changes with
838respect to discretionaries and ligatures, line breaking will potentially be
839different from traditional \TEX. The actual line breaking code is still based on
840the \TEX82 algorithms, and there can be no discretionaries inside of
841discretionaries. But, as patterns evolve and font handling can influence
842discretionaries, you need to be aware of the fact that long term consistency is
843not an engine matter only.
844
845But that situation is now fairly common in \LUATEX, due to the changes to the
846ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented
847slightly different from the \TEX82 nodes: the \type {nobreak} text is now
848embedded inside the disc node, where previously these nodes kept their place in
849the horizontal list. In traditional \TEX\ the discretionary node contains a
850counter indicating how many nodes to skip, but in \LUATEX\ we store the pre, post
851and replace text in the discretionary node.
852
853The combined effect of these two differences is that \LUATEX\ does not always use
854all of the potential breakpoints in a paragraph, especially when fonts with many
855ligatures are used. Of course kerning also complicates matters here. In practice
856that doesnt matter much because the par builder has enough solution space due to
857spaces; its not like out of a sudden we wonder why paragraphs look worse.
858
859The \typ {\doublehyphendemerits} and \typ {\finalhyphendemerits} parameters play
860a role in the par builder: they discourage a page break when there are two or
861more hyphens in a row and if theres one in the prelast line. These are not
862bound to a language.
863
864\stopsection
865
866
867
868\startsection[title={The \type {language} library}][library=lang]
869
870This library provides the interface to the internal structure representing a
871language, and the associated functions.
872
873\starttyping [option=LUA]
874function language.new ( <t:nil> <t:integer> identifier )
875 return <t:userdata> language
876end
877\stoptyping
878
879This function creates a new userdata object. An object of type \type {<language>}
880is the first argument to most of the other functions in the \type {language}
881library. These functions can also be used as if they were object methods, using
882the colon syntax. Without an argument, the next available internal id number will
883be assigned to this object. With argument, an object will be created that links
884to the internal language with that id number. The number returned is the internal
885\type {\language} id number this object refers to.
886
887\starttyping [option=LUA]
888function language.id ( <t:userdata> language )
889 return <t:integer> identifier
890end
891\stoptyping
892
893You can load exceptions with:
894
895\starttyping [option=LUA]
896function language.hyphenation( <t:userdata> language, <t:string> list)
897 no return value
898end
899\stoptyping
900
901When no string is given (the first example) a string with all exceptions is
902returned.
903
904\starttyping [option=LUA]
905function language.hyphenation ( <t:userdata> language )
906 return <t:string> list
907end
908\stoptyping
909
910This either returns the current hyphenation exceptions for this language, or adds
911new ones. The syntax of the string is explained in\in {section}
912[patternsexceptions].
913
914This call clears the exception dictionary (string) for this language:
915
916\starttyping [option=LUA]
917function language.clearhyphenation( <t:userdata> language )
918 no return value
919end
920\stoptyping
921
922This function creates a hyphenation key from the supplied hyphenation value. The
923syntax of the argument string is explained in \in {section} [patternsexceptions].
924The function is useful if you want to do something else based on the words in a
925dictionary file, like spellchecking.
926
927\starttyping [option=LUA]
928function language.clean(<t:userdata> language, <t:string> str)
929 return <t:string> cln
930end
931
932function language.clean(<t:string> str)
933 return <t:string> cln
934end
935\stoptyping
936
937This adds additional patterns for this language object, or returns the current
938set. The syntax of this string is explained in \in {section}
939[patternsexceptions].
940
941\starttyping [option=LUA]
942function language.patterns( <t:userdata> language, <string> list )
943 no return value
944end
945\stoptyping
946
947The registered list can be fetched with:
948
949\starttyping [option=LUA]
950function language.patterns( <t:userdata> language )
951 return <t:string> list
952end
953\stoptyping
954
955This can be used to clear the pattern dictionary for a language.
956
957\starttyping [option=LUA]
958function language.clearpatterns ( <t:userdata> language )
959 no return value
960end
961\stoptyping
962
963
964This function sets (or gets) the value of the \TEX\ parameter
965\type {\hyphenationmin}.
966
967\starttyping [option=LUA]
968function language.hyphenationmin ( <t:userdata> language, <t:number> n )
969 no return value
970end
971\stoptyping
972
973\starttyping [option=LUA]
974function language.hyphenationmin ( <t:userdata> language )
975 return <t:integer> n
976end
977\stoptyping
978
979These two are used to get or set the \quote {prebreak} and \quote
980{postbreak} hyphen characters for implicit hyphenation in this language. The
981initial values are decimal 45 (hyphen) and decimal0 (indicating emptiness).
982
983\starttyping [option=LUA]
984function language.prehyphenchar ( <t:userdata> language, <t:integer> n) end
985function language.posthyphenchar ( <t:userdata> language, <t:integer> n) end
986
987function language.prehyphenchar ( <t:userdata> language) return <t:integer> n end
988function language.posthyphenchar ( <t:userdata> language) return <t:integer> n end
989\stoptyping
990
991These gets or set the \quote {prebreak} and \quote {postbreak} hyphen
992characters for explicit hyphenation in this language. Both are initially
993decimal0 (indicating emptiness).
994
995\starttyping [option=LUA]
996function language.preexhyphenchar ( <t:userdata> language, <t:integer> n) end
997function language.postexhyphenchar ( <t:userdata> language, <t:integer> n) end
998
999function language.preexhyphenchar ( <t:userdata> language) return <t:integer> n end
1000function language.postexhyphenchar ( <t:userdata> language) return <t:integer> n end
1001\stoptyping
1002
1003The next call inserts hyphenation points (discretionary nodes) in a node list. If
1004\type {tail} is given as argument, processing stops on that node. Currently,
1005\type {success} is always true if \type {head} (and optionally \type {tail}) are
1006proper nodes, regardless of possible other errors.
1007
1008\starttyping [option=LUA]
1009function language.hyphenate( <t:node> head, <t:node> tail)
1010 return <t:boolean> success
1011end
1012\stoptyping
1013
1014Hyphenation works only on \quote {characters}, a special subtype of all the glyph
1015nodes with the node subtype having the value \type {1}. Glyph modes with
1016different subtypes are not processed. See \in {section} [charsandglyphs] for more
1017details.
1018
1019The following two commands can be used to set or query a \type {\hjcode}:
1020
1021\starttyping [option=LUA]
1022function language.sethjcode (
1023 <t:userdata> language,
1024 <t:number> character,
1025 <t:number> usedchar
1026)
1027 no return value
1028end
1029
1030function language.gethjcode (
1031 <t:userdata> language,
1032 <t:number> character
1033)
1034 return <t:number> usedchar
1035end
1036\stoptyping
1037
1038There are similar function for \type {\hccode}:
1039
1040\starttyping [option=LUA]
1041function language.sethccode (
1042 <t:userdata> language,
1043 <t:number> character,
1044 <t:number> usedchar
1045)
1046 no return value
1047end
1048
1049function language.gethccode (
1050 <t:userdata> language,
1051 <t:number> character
1052)
1053 return <t:number> usedchar
1054end
1055\stoptyping
1056
1057\stopsection
1058
1059\startsection[title=Math]
1060
1061For the record we mention that in math you can also have discretionaries:
1062
1063\starttyping
1064$ 2x \mathdiscretionary{}{}{} 1 = 3y $
1065\stoptyping
1066
1067these actually do relate to languages but are not stored in the language data but
1068have to be handled by the macro package. It will be clear that there is a bit
1069involved because we have spacing and penalties driven by math classes.
1070
1071\stopsection
1072
1073\startsection[title=Tracing]
1074
1075There are several trackers in \CONTEXT\ that can show where hyphenation was considered and
1076where it got applied, but this is really macro package dependent. There is also a built in
1077tracing command: \typ {\tracinghyphenation}. When you say:
1078
1079\starttyping
1080\tracinghyphenation2
1081\tracingonline 2
1082\stoptyping
1083
1084You get something like this:
1085
1086\starttyping[option=]
10871:3: [language: not hyphenated There]
10881:3: [language: hyphenated several at 1 positions]
10891:3: [language: hyphenated trackers at 1 positions]
10901:3: [language: not hyphenated where]
10911:3: [language: hyphenated hyphenation at 2 positions]
10921:3: [language: hyphenated considered at 2 positions]
10931:3: [language: not hyphenated where]
10941:3: [language: hyphenated applied at 1 positions]
10951:3: [language: hyphenated really at 1 positions]
10961:3: [language: not hyphenated macro]
10971:3: [language: hyphenated package at 1 positions]
10981:3: [language: hyphenated dependent at 2 positions]
10991:3: [language: not hyphenated There]
11001:3: [language: not hyphenated built]
11011:3: [language: hyphenated tracing at 1 positions]
11021:3: [language: hyphenated command at 1 positions]
11031:3: [language: hyphenated tracinghyphenation at 3 positions]
1104\stoptyping
1105
1106
1107
1108
1109\startbuffer
1110Higher values give more details, like the pre, post and replace lists so that
1111output is rather noisy. Contrary to \type {\tracinghyphenation} is verbatim we do
1112permit it \type {\tracinghyphenation} to be hyphenated.
1113\stopbuffer
1114
1115\typebuffer
1116
1117renders as:
1118
1119\getbuffer
1120
1121and traces as:
1122
1123\starttyping[option=]
11241:3: [language: hyphenated renders at 1 positions]
11251:4: [language: not hyphenated Higher]
11261:4: [language: hyphenated values at 1 positions]
11271:4: [language: hyphenated details at 1 positions]
11281:4: [language: hyphenated replace at 1 positions]
11291:4: [language: not hyphenated lists]
11301:4: [language: hyphenated output at 1 positions]
11311:4: [language: not hyphenated rather]
11321:4: [language: not hyphenated noisy]
11331:4: [language: hyphenated Contrary at 1 positions]
11341:4: [language: hyphenated verbatim at 2 positions]
11351:4: [language: hyphenated permit at 1 positions]
11361:4: [language: hyphenated hyphenated at 2 positions]
11371:3: [language: not hyphenated traces]
1138\stoptyping
1139
1140\stopsection
1141
1142\stopdocument
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156 |