luametatex-languages.tex /size: 43 Kb    last modification: 2023-12-21 09:43
1% language=us runpath=texruns:manuals/luametatex
2
3\environment luametatex-style
4
5\startcomponent luametatex-languages
6
7\startchapter[reference=languages,title={Languages, characters, fonts and glyphs}]
8
9\startsection[title={Introduction}]
10
11\topicindex {languages}
12
13\LUATEX's internal handling of the characters and glyphs that eventually become
14typeset is quite different from the way \TEX82 handles those same objects. The
15easiest way to explain the difference is to focus on unrestricted horizontal mode
16(i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal
17with the differences that occur in horizontal and math modes.
18
19In \TEX82, the characters you type are converted into \type {char} node records
20when they are encountered by the main control loop. \TEX\ attaches and processes
21the font information while creating those records, so that the resulting \quote
22{horizontal list} contains the final forms of ligatures and implicit kerning.
23This packaging is needed because we may want to get the effective width of for
24instance a horizontal box.
25
26When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one
27word at time) the \type {char} node records into a string by replacing ligatures
28with their components and ignoring the kerning. Then it runs the hyphenation
29algorithm on this string, and converts the hyphenated result back into a \quote
30{horizontal list} that is consecutively spliced back into the paragraph stream.
31Keep in mind that the paragraph may contain unboxed horizontal material, which
32then already contains ligatures and kerns and the words therein are part of the
33hyphenation process.
34
35Those \type {char} node records are somewhat misnamed, as they are glyph
36positions in specific fonts, and therefore not really \quote {characters} in the
37linguistic sense. There is no language information inside the \type {char} node
38records at all. Instead, language information is passed along using \type
39{language whatsit} nodes inside the horizontal list.
40
41In \LUATEX, the situation is quite different. The characters you type are always
42converted into \nod {glyph} node records with a special subtype to identify them
43as being intended as linguistic characters. \LUATEX\ stores the needed language
44information in those records, but does not do any font|-|related processing at
45the time of node creation. It only stores the index of the current font and a
46reference to a character in that font.
47
48When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all
49hyphenation points right into the whole node list. Next, it processes all the
50font information in the whole list (creating ligatures and adjusting kerning),
51and finally it adjusts all the subtype identifiers so that the records are \quote
52{glyph nodes} from now on.
53
54\stopsection
55
56\startsection[title={Characters, glyphs and discretionaries},reference=charsandglyphs]
57
58\topicindex {characters}
59\topicindex {glyphs}
60\topicindex {hyphenation}
61
62\TEX82 (including \PDFTEX) differentiates between \type {char} nodes and \type
63{lig} nodes. The former are simple items that contained nothing but a \quote
64{character} and a \quote {font} field, and they lived in the same memory as
65tokens did. The latter also contained a list of components, and a subtype
66indicating whether this ligature was the result of a word boundary, and it was
67stored in the same place as other nodes like boxes and kerns and glues. In
68\LUAMETATEX\ we no longer keep the list of components with the glyph node.
69
70In \LUATEX, these two types are merged into one, somewhat larger structure called
71a \nod {glyph} node. Besides having the old character, font, and component
72fields there are a few more, like \quote {attr} that we will see in \in {section}
73[glyphnodes], these nodes also contain a subtype, that codes four main types and
74two additional ghost types. For ligatures, multiple bits can be set at the same
75time (in case of a single|-|glyph word).
76
77\startitemize
78    \startitem
79        \type {character}, for characters to be hyphenated: the lowest bit
80        (bit 0) is set to 1.
81    \stopitem
82    \startitem
83        \nod {glyph}, for specific font glyphs: the lowest bit (bit 0) is
84        not set.
85    \stopitem
86    \startitem
87        \type {ligature}, for constructed ligatures bit 1 is set.
88    \stopitem
89\stopitemize
90
91The \nod {glyph} nodes also contain language data, split into four items that
92were current when the node was created: the \prm {setlanguage} (15~bits), \prm
93{lefthyphenmin} (8~bits), \prm {righthyphenmin} (8~bits), and \prm {uchyph}
94(1~bit).
95
96Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256
97characters long. The language is stored with each character. You can set
98\prm {firstvalidlanguage} to for instance~1 and make thereby language~0
99an ignored hyphenation language.
100
101The new primitive \prm {hyphenationmin} can be used to signal the minimal length
102of a word. This value is stored with the (current) language.
103
104Because the \prm {uchyph} value is saved in the actual nodes, its handling is
105subtly different from \TEX82: changes to \prm {uchyph} become effective
106immediately, not at the end of the current partial paragraph.
107
108Typeset boxes now always have their language information embedded in the nodes
109themselves, so there is no longer a possible dependency on the surrounding
110language settings. In \TEX82, a mid|-|paragraph statement like \type {\unhbox0}
111would process the box using the current paragraph language unless there was a
112\prm {setlanguage} issued inside the box. In \LUATEX, all language variables
113are already frozen.
114
115In traditional \TEX\ the process of hyphenation is driven by \type {lccode}s. In
116\LUATEX\ we made this dependency less strong. There are several strategies
117possible. When you do nothing, the currently used \type {lccode}s are used, when
118loading patterns, setting exceptions or hyphenating a list.
119
120When you set \prm {savinghyphcodes} to a value greater than zero the current set
121of \type {lccode}s will be saved with the language. In that case changing a \type
122{lccode} afterwards has no effect. However, you can adapt the set with:
123
124\starttyping
125\hjcode`a=`a
126\stoptyping
127
128This change is global which makes sense if you keep in mind that the moment that
129hyphenation happens is (normally) when the paragraph or a horizontal box is
130constructed. When \prm {savinghyphcodes} was zero when the language got
131initialized you start out with nothing, otherwise you already have a set.
132
133When a \prm {hjcode} is greater than 0 but less than 32 the value indicates the
134to be used length. In the following example we map a character (\type {x}) onto
135another one in the patterns and tell the engine that \type {œ} counts as two
136characters. Because traditionally zero itself is reserved for inhibiting
137hyphenation, a value of 32 counts as zero.
138
139Here are some examples (we assume that French patterns are used):
140
141\starttabulate[||||]
142\NC                                  \NC \type{foobar} \NC \type{foo-bar} \NC \NR
143\NC \type{\hjcode`x=`o}              \NC \type{fxxbar} \NC \type{fxx-bar} \NC \NR
144\NC \type{\lefthyphenmin3}           \NC \type{œdipus} \NC \type{œdi-pus} \NC \NR
145\NC \type{\lefthyphenmin4}           \NC \type{œdipus} \NC \type{œdipus}  \NC \NR
146\NC \type{\hjcode`œ=2}               \NC \type{œdipus} \NC \type{œdi-pus} \NC \NR
147\NC \type{\hjcode`i=32 \hjcode`d=32} \NC \type{œdipus} \NC \type{œdipus}  \NC \NR
148\NC
149\stoptabulate
150
151Carrying all this information with each glyph would give too much overhead and
152also make the process of setting up these codes more complex. A solution with
153\type {hjcode} sets was considered but rejected because in practice the current
154approach is sufficient and it would not be compatible anyway.
155
156Beware: the values are always saved in the format, independent of the setting
157of \prm {savinghyphcodes} at the moment the format is dumped.
158
159A boundary node normally would mark the end of a word which interferes with for
160instance discretionary injection. For this you can use the \prm {wordboundary}
161as a trigger. Here are a few examples of usage:
162
163\startbuffer
164    discrete---discrete
165\stopbuffer
166\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
167\startbuffer
168    discrete\discretionary{}{}{---}discrete
169\stopbuffer
170\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
171\startbuffer
172    discrete\wordboundary\discretionary{}{}{---}discrete
173\stopbuffer
174\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
175\startbuffer
176    discrete\wordboundary\discretionary{}{}{---}\wordboundary discrete
177\stopbuffer
178\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
179\startbuffer
180    discrete\wordboundary\discretionary{---}{}{}\wordboundary discrete
181\stopbuffer
182\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
183
184We only accept an explicit hyphen when there is a preceding glyph and we skip a
185sequence of explicit hyphens since that normally indicates a \type {--} or \type
186{---} ligature in which case we can in a worse case usage get bad node lists
187later on due to messed up ligature building as these dashes are ligatures in base
188fonts. This is a side effect of separating the hyphenation, ligaturing and
189kerning steps.
190
191The start and end of a sequence of characters is signalled by a \nod {glue}, \nod
192{penalty}, \nod {kern} or \nod {boundary} node. But by default also a \nod
193{hlist}, \nod {vlist}, \nod {rule}, \nod {dir}, \nod {whatsit}, \nod {insert}, and
194\nod {adjust} node indicate a start or end. You can omit the last set from the
195test by setting flags in \prm {hyphenationmode}:
196
197\starttabulate[|c|l|]
198\DB value      \BC behaviour \NC \NR
199\TB
200\NC            \NC not strict \NC \NR
201\NC \type{64}  \NC strict start \NC \NR
202\NC \type{128} \NC strict end \NC \NR
203\NC \type{192} \NC strict start and strict end \NC \NR
204\LL
205\stoptabulate
206
207The word start is determined as follows:
208
209\starttabulate[|l|l|]
210\DB node      \BC behaviour \NC \NR
211\TB
212\BC boundary  \NC yes when wordboundary \NC \NR
213\BC hlist     \NC when the start bit is set \NC \NR
214\BC vlist     \NC when the start bit is set \NC \NR
215\BC rule      \NC when the start bit is set \NC \NR
216\BC dir       \NC when the start bit is set \NC \NR
217\BC whatsit   \NC when the start bit is set \NC \NR
218\BC glue      \NC yes \NC \NR
219\BC math      \NC skipped \NC \NR
220\BC glyph     \NC exhyphenchar (one only) : yes (so no -- ---) \NC \NR
221\BC otherwise \NC yes \NC \NR
222\LL
223\stoptabulate
224
225The word end is determined as follows:
226
227\starttabulate[|l|l|]
228\DB node      \BC behaviour \NC \NR
229\TB
230\BC boundary  \NC yes \NC \NR
231\BC glyph     \NC yes when different language \NC \NR
232\BC glue      \NC yes \NC \NR
233\BC penalty   \NC yes \NC \NR
234\BC kern      \NC yes when not italic (for some historic reason) \NC \NR
235\BC hlist     \NC when the end bit is set \NC \NR
236\BC vlist     \NC when the end bit is set \NC \NR
237\BC rule      \NC when the end bit is set \NC \NR
238\BC dir       \NC when the end bit is set \NC \NR
239\BC whatsit   \NC when the end bit is set \NC \NR
240\BC ins       \NC when the end bit is set \NC \NR
241\BC adjust    \NC when the end bit is set \NC \NR
242\LL
243\stoptabulate
244
245\in {Figures} [hb:1] upto \in [hb:5] show some examples. In all cases we set the
246min values to 1 and make sure that the words hyphenate at each character.
247
248\hyphenation{o-n-e t-w-o}
249
250\def\SomeTest#1#2%
251  {\lefthyphenmin  \plusone
252   \righthyphenmin \plusone
253   \parindent      \zeropoint
254   \everypar       \emptytoks
255   \dontcomplain
256   \hbox to 2cm {%
257     \vtop {%
258       \hsize 1pt
259       \advance\hyphenationmode#1\relax
260       #2
261       \par}}}
262
263\startplacefigure[reference=hb:1,title={\type{one}}]
264    \startcombination[4*1]
265        {\SomeTest  {0}{one}} {\type  {0}}
266        {\SomeTest {64}{one}} {\type {64}}
267        {\SomeTest{128}{one}} {\type{128}}
268        {\SomeTest{192}{one}} {\type{192}}
269    \stopcombination
270\stopplacefigure
271
272\startplacefigure[reference=hb:2,title={\type{one\null two}}]
273    \startcombination[4*1]
274        {\SomeTest  {0}{one\null two}} {\type  {0}}
275        {\SomeTest {64}{one\null two}} {\type {64}}
276        {\SomeTest{128}{one\null two}} {\type{128}}
277        {\SomeTest{192}{one\null two}} {\type{192}}
278    \stopcombination
279\stopplacefigure
280
281\startplacefigure[reference=hb:3,title={\type{\null one\null two}}]
282    \startcombination[4*1]
283        {\SomeTest  {0}{\null one\null two}} {\type  {0}}
284        {\SomeTest {64}{\null one\null two}} {\type {64}}
285        {\SomeTest{128}{\null one\null two}} {\type{128}}
286        {\SomeTest{192}{\null one\null two}} {\type{192}}
287    \stopcombination
288\stopplacefigure
289
290\startplacefigure[reference=hb:4,title={\type{one\null two\null}}]
291    \startcombination[4*1]
292        {\SomeTest  {0}{one\null two\null}} {\type  {0}}
293        {\SomeTest {64}{one\null two\null}} {\type {64}}
294        {\SomeTest{128}{one\null two\null}} {\type{128}}
295        {\SomeTest{192}{one\null two\null}} {\type{192}}
296    \stopcombination
297\stopplacefigure
298
299\startplacefigure[reference=hb:5,title={\type{\null one\null two\null}}]
300    \startcombination[4*1]
301        {\SomeTest  {0}{\null one\null two\null}} {\type  {0}}
302        {\SomeTest {64}{\null one\null two\null}} {\type {64}}
303        {\SomeTest{128}{\null one\null two\null}} {\type{128}}
304        {\SomeTest{192}{\null one\null two\null}} {\type{192}}
305    \stopcombination
306\stopplacefigure
307
308In traditional \TEX\ ligature building and hyphenation are interwoven with the
309line break mechanism. In \LUATEX\ these phases are isolated. As a consequence we
310deal differently with (a sequence of) explicit hyphens. We already have added
311some control over aspects of the hyphenation and yet another one concerns
312automatic hyphens (e.g.\ \type {-} characters in the input).
313
314Hyphenation and discretionary injection is driven by a mode parameter which is
315a bitset made from the following values, some of which we saw in the previous
316examples.
317
318\starttabulate[|l|p|]
319\NC \number \normalhyphenationcode            \NC honour (normal) \prm{discretionary}'s \NC \NR
320\NC \number \automatichyphenationcode         \NC turn \type {-} into (automatic) discretionaries \NC \NR
321\NC \number \explicithyphenationcode          \NC turn \type {\-} into (explicit) discretionaries \NC \NR
322\NC \number \syllablehyphenationcode          \NC hyphenate (syllable) according to language \NC \NR
323\NC \number \uppercasehyphenationcode         \NC hyphenate uppercase characters too (replaces \prm {uchyph} \NC \NR
324\NC \number \compoundhyphenationcode          \NC permit break at an explicit hyphen (border cases) \NC \NR
325\NC \number \strictstarthyphenationcode       \NC traditional \TEX\ compatibility wrt the start of a word \NC \NR
326\NC \number \strictendhyphenationcode         \NC traditional \TEX\ compatibility wrt the end of a word \NC \NR
327\NC \number \automaticpenaltyhyphenationcode  \NC use \prm {automatichyphenpenalty} \NC \NR
328\NC \number \explicitpenaltyhyphenationcode   \NC use \prm {explicithyphenpenalty} \NC \NR
329\NC \number \permitgluehyphenationcode        \NC turn glue in discretionaries into kerns \NC \NR
330\NC \number \permitallhyphenationcode         \NC okay, let's be even more tolerant in discretionaries \NC \NR
331\NC \number \permitmathreplacehyphenationcode \NC and again we're more permissive \NC \NR
332\NC \number \lazyligatureshyphenationcode     \NC controls how successive explicit discretionaries are handled in base mode \NC \NR
333\NC \number \forcecheckhyphenationcode        \NC treat all discretionaries equal when breaking lines (in all three passes) \NC \NR
334\NC \number \forcehandlerhyphenationcode      \NC kick in the handler (experiment) \NC \NR
335\NC \number \feedbackcompoundhyphenationcode  \NC feedback compound snippets \NC \NR
336\stoptabulate
337
338Some of these options are still experimental, simply because not all aspects and
339side effects have been explored. You can find some experimental use cases in
340\CONTEXT.
341
342\stopsection
343
344\startsection[title={Controlling hyphenation}]
345
346\startsubsection[title={\prm {hyphenationmin}}]
347
348\topicindex {languages}
349\topicindex {hyphenation}
350
351This primitive can be used to set the minimal word length, so setting it to a value
352of~$5$ means that only words of 6 characters and more will be hyphenated, of course
353within the constraints of the \prm {lefthyphenmin} and \prm {righthyphenmin}
354values (as stored in the glyph node). This primitive accepts a number and stores
355the value with the language.
356
357\stopsubsection
358
359\startsubsection[title={\prm {boundary}, \prm {noboundary}, \prm {protrusionboundary} and \prm {wordboundary}}]
360
361The \prm {noboundary} command is used to inject a whatsit node but now injects a normal
362node with type \nod {boundary} and subtype~0. In addition you can say:
363
364\starttyping
365x\boundary 123\relax y
366\stoptyping
367
368This has the same effect but the subtype is now~1 and the value~123 is stored.
369The traditional ligature builder still sees this as a cancel boundary directive
370but at the \LUA\ end you can implement different behaviour. The added benefit of
371passing this value is a side effect of the generalization. The subtypes~2 and~3
372are used to control protrusion and word boundaries in hyphenation and have
373related primitives.
374
375\stopsubsection
376
377\stopsection
378
379\startsection[title={The main control loop}]
380
381\topicindex {main loop}
382\topicindex {hyphenation}
383\topicindex {hyphenation+tracing}
384
385In \LUATEX's main loop, almost all input characters that are to be typeset are
386converted into \nod {glyph} node records with subtype \quote {character}, but
387there are a few exceptions.
388
389\startitemize[n]
390
391\startitem
392    The \prm {accent} primitive creates nodes with subtype \quote {glyph}
393    instead of \quote {character}: one for the actual accent and one for the
394    accentee. The primary reason for this is that \prm {accent} in \TEX82 is
395    explicitly dependent on the current font encoding, so it would not make much
396    sense to attach a new meaning to the primitive's name, as that would
397    invalidate many old documents and macro packages. A secondary reason is that
398    in \TEX82, \prm {accent} prohibits hyphenation of the current word. Since
399    in \LUATEX\ hyphenation only takes place on \quote {character} nodes, it is
400    possible to achieve the same effect. Of course, modern \UNICODE\ aware macro
401    packages will not use the \prm {accent} primitive at all but try to map
402    directly on composed characters.
403
404    This change of meaning did happen with \prm {char}, that now generates
405    \quote {glyph} nodes with a character subtype. In traditional \TEX\ there was
406    a strong relationship between the 8|-|bit input encoding, hyphenation and
407    glyphs taken from a font. In \LUATEX\ we have \UTF\ input, and in most cases
408    this maps directly to a character in a font, apart from glyph replacement in
409    the font engine. If you want to access arbitrary glyphs in a font directly
410    you can always use \LUA\ to do so, because fonts are available as \LUA\
411    table.
412\stopitem
413
414\startitem
415    All the results of processing in math mode eventually become nodes with
416    \quote {glyph} subtypes. In fact, the result of processing math is just
417    a regular list of glyphs, kerns, glue, penalties, boxes etc.
418\stopitem
419
420\startitem
421    Automatic discretionaries are handled differently. \TEX82 inserts an empty
422    discretionary after sensing an input character that matches the \prm
423    {hyphenchar} in the current font. This test is wrong in our opinion: whether
424    or not hyphenation takes place should not depend on the current font, it is a
425    language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\
426    yet and being limited to eight bits meant that one sometimes had to
427    compromise between supporting character input, glyph rendering, hyphenation.}
428
429    In \LUATEX, it works like this: if \LUATEX\ senses a string of input
430    characters that matches the value of the new integer parameter \prm
431    {exhyphenchar}, it will insert an explicit discretionary after that series of
432    nodes. Initially \TEX\ sets the \type {\exhyphenchar=`\-}. Incidentally, this
433    is a global parameter instead of a language-specific one because it may be
434    useful to change the value depending on the document structure instead of the
435    text language.
436
437    The insertion of discretionaries after a sequence of explicit hyphens happens
438    at the same time as the other hyphenation processing, {\it not\/} inside the
439    main control loop.
440
441    The only use \LUATEX\ has for \prm {hyphenchar} is at the check whether a
442    word should be considered for hyphenation at all. If the \prm {hyphenchar}
443    of the font attached to the first character node in a word is negative, then
444    hyphenation of that word is abandoned immediately. This behaviour is added
445    for backward compatibility only, and the use of \type {\hyphenchar=-1} as a
446    means of preventing hyphenation should not be used in new \LUATEX\ documents.
447\stopitem
448
449\startitem
450    The \prm {setlanguage} command no longer creates whatsits. The meaning of
451    \prm {setlanguage} is changed so that it is now an integer parameter like all
452    others. That integer parameter is used in \type {\glyph_node} creation to add
453    language information to the glyph nodes. In conjunction, the \prm {language}
454    primitive is extended so that it always also updates the value of \prm
455    {setlanguage}.
456\stopitem
457
458\startitem
459    The \prm {noboundary} command (that prohibits word boundary processing
460    where that would normally take place) now does create nodes. These nodes are
461    needed because the exact place of the \prm {noboundary} command in the
462    input stream has to be retained until after the ligature and font processing
463    stages.
464\stopitem
465
466\startitem
467    There is no longer a \type {main_loop} label in the code. Remember that
468    \TEX82 did quite a lot of processing while adding \type {char_nodes} to the
469    horizontal list? For speed reasons, it handled that processing code outside
470    of the \quote {main control} loop, and only the first character of any \quote
471    {word} was handled by that \quote {main control} loop. In \LUATEX, there is
472    no longer a need for that (all hard work is done later), and the (now very
473    small) bits of character|-|handling code have been moved back inline. When
474    \prm {tracingcommands} is on, this is visible because the full word is
475    reported, instead of just the initial character.
476\stopitem
477
478\stopitemize
479
480Because we tend to make hard coded behaviour configurable a few new primitives
481have been added:
482
483\starttyping
484\hyphenpenaltymode
485\automatichyphenpenalty
486\explicithyphenpenalty
487\stoptyping
488
489The usage of these penalties is controlled by the \prm {hyphenationmode} flags
490\number\automaticpenaltyhyphenationcode\space and
491\number\explicitpenaltyhyphenationcode\space and when these are not set \prm
492{exhyphenpenalty} is used.
493
494You can use the \prm {tracinghyphenation} variable to get a bit more information
495about what happens.
496
497\starttabulate[|lT|l|]
498\DB value \BC effect \NC\NR
499\TB
500\NC 1     \NC report redundant pattern (happens by default in \LUATEX) \NC\NR
501\NC 2     \NC report words that reach the hyphenator and got treated \NC\NR
502\NC 3     \NC show the result of a hyphenated word (a node list) \NC\NR
503\LL
504\stoptabulate
505
506\stopsection
507
508\startsection[title={Loading patterns and exceptions},reference=patternsexceptions]
509
510\topicindex {hyphenation}
511\topicindex {hyphenation+patterns}
512\topicindex {hyphenation+exceptions}
513\topicindex {patterns}
514\topicindex {exceptions}
515
516Although we keep the traditional approach towards hyphenation (which is still
517superior) the implementation of the hyphenation algorithm in \LUATEX\ is quite
518different from the one in \TEX82.
519
520After expansion, the argument for \prm {patterns} has to be proper \UTF8 with
521individual patterns separated by spaces, no \prm {char} or \prm {chardef}d
522commands are allowed. The current implementation is quite strict and will reject
523all non|-|\UNICODE\ characters. Likewise, the expanded argument for \prm
524{hyphenation} also has to be proper \UTF8, but here a bit of extra syntax is
525provided:
526
527\startitemize[n]
528\startitem
529    Three sets of arguments in curly braces (\type {{}{}{}}) indicate a desired
530    complex discretionary, with arguments as in \prm {discretionary}'s command in
531    normal document input.
532\stopitem
533\startitem
534    A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and
535    \type {\discretionary{-}{}{}} in normal document input.
536\stopitem
537\startitem
538    Internal command names are ignored. This rule is provided especially for \prm
539    {discretionary}, but it also helps to deal with \prm {relax} commands that
540    may sneak in.
541\stopitem
542\startitem
543    An \type {=} indicates a (non|-|discretionary) hyphen in the document input.
544\stopitem
545\stopitemize
546
547The expanded argument is first converted back to a space|-|separated string while
548dropping the internal command names. This string is then converted into a
549dictionary by a routine that creates key|-|value pairs by converting the other
550listed items. It is important to note that the keys in an exception dictionary
551can always be generated from the values. Here are a few examples:
552
553\starttabulate[|l|l|l|]
554\DB value                  \BC implied key (input) \BC effect \NC\NR
555\TB
556\NC \type {ta-ble}         \NC table               \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR
557\NC \type {ba{k-}{}{c}ken} \NC backen              \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR
558\LL
559\stoptabulate
560
561The resultant patterns and exception dictionary will be stored under the language
562code that is the present value of \prm {language}.
563
564In the last line of the table, you see there is no \prm {discretionary} command
565in the value: the command is optional in the \TEX-based input syntax. The
566underlying reason for that is that it is conceivable that a whole dictionary of
567words is stored as a plain text file and loaded into \LUATEX\ using one of the
568functions in the \LUA\ \type {language} library. This loading method is quite a bit
569faster than going through the \TEX\ language primitives, but some (most?) of that
570speed gain would be lost if it had to interpret command sequences while doing so.
571
572It is possible to specify extra hyphenation points in compound words by using
573\type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the
574actual explicit hyphen character if needed). For example, this matches the word
575\quote {multi|-|word|-|boundaries} and allows an extra break inbetween \quote
576{boun} and \quote {daries}:
577
578\starttyping
579\hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries}
580\stoptyping
581
582The motivation behind the \ETEX\ extension \prm {savinghyphcodes} was that
583hyphenation heavily depended on font encodings. This is no longer true in
584\LUATEX, and the corresponding primitive is basically ignored. Because we now
585have \prm {hjcode}, the case related codes can be used exclusively for \prm
586{uppercase} and \prm {lowercase}.
587
588The three curly brace pair pattern in an exception can be somewhat unexpected so
589we will try to explain it by example. The pattern \type {foo{}{}{x}bar} pattern
590creates a lookup \type {fooxbar} and the pattern \type {foo{}{}{}bar} creates
591\type {foobar}. Then, when a hit happens there is a replacement text (\type {x})
592or none. Because we introduced penalties in discretionary nodes, the exception
593syntax now also can take a penalty specification. The value between square brackets
594is a multiplier for \prm {exceptionpenalty}. Here we have set it to 10000 so
595effectively we get 30000 in the example.
596
597\def\ShowSample#1#2%
598  {\startlinecorrection[blank]
599   \hyphenation{#1}%
600   \exceptionpenalty=10000
601   \bTABLE[foregroundstyle=type]
602     \bTR
603       \bTD[align=middle,nx=4] \type{#1} \eTD
604     \eTR
605     \bTR
606       \bTD[align=middle] \type{10em} \eTD
607       \bTD[align=middle] \type {3em} \eTD
608       \bTD[align=middle] \type {0em} \eTD
609       \bTD[align=middle] \type {6em} \eTD
610     \eTR
611     \bTR
612       \bTD[width=10em]\vtop{\hsize 10em 123 #2 123\par}\eTD
613       \bTD[width=10em]\vtop{\hsize  3em 123 #2 123\par}\eTD
614       \bTD[width=10em]\vtop{\hsize  0em 123 #2 123\par}\eTD
615       \bTD[width=10em]\vtop{\setupalign[verytolerant,stretch]\rmtf\hsize 6em 123 #2 #2 #2 #2 123\par}\eTD
616     \eTR
617   \eTABLE
618   \stoplinecorrection}
619
620\ShowSample{x{a-}{-b}{}x{a-}{-b}{}x{a-}{-b}{}x{a-}{-b}{}xx}{xxxxxx}
621\ShowSample{x{a-}{-b}{}x{a-}{-b}{}[3]x{a-}{-b}{}[1]x{a-}{-b}{}xx}{xxxxxx}
622
623\ShowSample{z{a-}{-b}{z}{a-}{-b}{z}{a-}{-b}{z}{a-}{-b}{z}z}{zzzzzz}
624\ShowSample{z{a-}{-b}{z}{a-}{-b}{z}[3]{a-}{-b}{z}[1]{a-}{-b}{z}z}{zzzzzz}
625
626\stopsection
627
628\startsection[title={Applying hyphenation}]
629
630\topicindex {hyphenation+how it works}
631\topicindex {hyphenation+discretionaries}
632\topicindex {discretionaries}
633
634The internal structures \LUATEX\ uses for the insertion of discretionaries in
635words is very different from the ones in \TEX82, and that means there are some
636noticeable differences in handling as well.
637
638First and foremost, there is no \quote {compressed trie} involved in hyphenation.
639The algorithm still reads pattern files generated by \PATGEN, but \LUATEX\ uses a
640finite state hash to match the patterns against the word to be hyphenated. This
641algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in
642turn is inspired by \TEX.
643
644There are a few differences between \LUATEX\ and \TEX82 that are a direct result
645of the implementation:
646
647\startitemize
648\startitem
649    \LUATEX\ happily hyphenates the full \UNICODE\ character range.
650\stopitem
651\startitem
652    Pattern and exception dictionary size is limited by the available memory
653    only, all allocations are done dynamically. The trie|-|related settings in
654    \type {texmf.cnf} are ignored.
655\stopitem
656\startitem
657    Because there is no \quote {trie preparation} stage, language patterns never
658    become frozen. This means that the primitive \prm {patterns} (and its \LUA\
659    counterpart \type {language.patterns}) can be used at any time, not only in
660    ini\TEX.
661\stopitem
662\startitem
663    Only the string representation of \prm {patterns} and \prm {hyphenation} is
664    stored in the format file. At format load time, they are simply
665    re|-|evaluated. It follows that there is no real reason to preload languages
666    in the format file. In fact, it is usually not a good idea to do so. It is
667    much smarter to load patterns no sooner than the first time they are actually
668    needed.
669\stopitem
670\startitem
671    \LUATEX\ uses the language-specific variables \prm {prehyphenchar} and \prm
672    {posthyphenchar} in the creation of implicit discretionaries, instead of
673    \TEX82's \prm {hyphenchar}, and the values of the language|-|specific
674    variables \prm {preexhyphenchar} and \prm {postexhyphenchar} for explicit
675    discretionaries (instead of \TEX82's empty discretionary).
676\stopitem
677\startitem
678    The value of the two counters related to hyphenation, \prm {hyphenpenalty}
679    and \prm {exhyphenpenalty}, are now stored in the discretionary nodes. This
680    permits a local overload for explicit \prm {discretionary} commands. The
681    value current when the hyphenation pass is applied is used. When no callbacks
682    are used this is compatible with traditional \TEX. When you apply the \LUA\
683    \type {language.hyphenate} function the current values are used.
684\stopitem
685\startitem
686    The hyphenation exception dictionary is maintained as key|-|value hash, and
687    that is also dynamic, so the \type {hyph_size} setting is not used either.
688\stopitem
689\stopitemize
690
691Because we store penalties in the disc node the \prm {discretionary} command has
692been extended to accept an optional penalty specification, so you can do the
693following:
694
695\startbuffer
696\hsize1mm
6971:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par
6982:foo\discretionary penalty 10000 {}{}{}bar\par
6993:foo\discretionary{}{}{}bar\par
700\stopbuffer
701
702\typebuffer
703
704This results in:
705
706\blank \start \getbuffer \stop \blank
707
708Inserted characters and ligatures inherit their attributes from the nearest glyph
709node item (usually the preceding one, but the following one for the items
710inserted at the left-hand side of a word).
711
712Word boundaries are no longer implied by font switches, but by language switches.
713One word can have two separate fonts and still be hyphenated correctly (but it
714can not have two different languages, the \prm {setlanguage} command forces a
715word boundary).
716
717All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0},
718\type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the
719values of one of these four parameters, you are actually changing the settings
720for the current \prm {language}, this behaviour is compatible with \prm {patterns}
721and \prm {hyphenation}.
722
723\LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256
724characters long (up from 64 in \TEX82). Longer words are ignored right now, but
725eventually either the limitation will be removed or perhaps it will become
726possible to silently ignore the excess characters (this is what happens in
727\TEX82, but there the behaviour cannot be controlled).
728
729If you are using the \LUA\ function \type {language.hyphenate}, you should be aware
730that this function expects to receive a list of \quote {character} nodes. It will
731not operate properly in the presence of \quote {glyph}, \quote {ligature}, or
732\quote {ghost} nodes, nor does it know how to deal with kerning.
733
734\stopsection
735
736\startsection[title={Applying ligatures and kerning}]
737
738\topicindex {ligatures}
739\topicindex {kerning}
740
741After all possible hyphenation points have been inserted in the list, \LUATEX\
742will process the list to convert the \quote {character} nodes into \quote {glyph}
743and \quote {ligature} nodes. This is actually done in two stages: first all
744ligatures are processed, then all kerning information is applied to the result
745list. But those two stages are somewhat dependent on each other: If the used font
746makes it possible to do so, the ligaturing stage adds virtual \quote {character}
747nodes to the word boundaries in the list. While doing so, it removes and
748interprets \prm {noboundary} nodes. The kerning stage deletes those word
749boundary items after it is done with them, and it does the same for \quote
750{ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote
751{character} nodes are converted to \quote {glyph} nodes.
752
753This separation is worth mentioning because, if you overrule from \LUA\ only one
754of the two callbacks related to font handling, then you have to make sure you
755perform the tasks normally done by \LUATEX\ itself in order to make sure that the
756other, non|-|overruled, routine continues to function properly.
757
758Although we could improve the situation the reality is that in modern \OPENTYPE\
759fonts ligatures can be constructed in many ways: by replacing a sequence of
760characters by one glyph, or by selectively replacing individual glyphs, or by
761kerning, or any combination of this. Add to that contextual analysis and it will
762be clear that we have to let \LUA\ do that job instead. The generic font handler
763that we provide (which is part of \CONTEXT) distinguishes between base mode
764(which essentially is what we describe here and which delegates the task to \TEX)
765and node mode (which deals with more complex fonts.
766
767In so called base mode, where \TEX\ does the work, the ligature construction
768(normally) goes in small steps. An \type {f} followed by an \type {f} becomes an
769\type {ff} ligatures and that one followed by an \type {i} can become a \type
770{ffi} ligature. The situation can be complicated by hyphenation points between
771these characters. When there are several in a ligature collapsing happens. Flag
772{\tttf "\uchexnumbers {\lazyligatureshyphenationcode}} in the \prm
773{hyphenationmode} variable determines if this happens lazy or greedy, i.e.\ the
774first hyphen wins or the last one does. In practice a \CONTEXT\ user won't have
775to deal with this because most fonts are processed in node mode.
776
777\stopsection
778
779\startsection[title={Breaking paragraphs into lines}]
780
781\topicindex {linebreaks}
782\topicindex {paragraphs}
783\topicindex {discretionaries}
784
785This code is almost unchanged, but because of the above|-|mentioned changes with
786respect to discretionaries and ligatures, line breaking will potentially be
787different from traditional \TEX. The actual line breaking code is still based on
788the \TEX82 algorithms, and there can be no discretionaries inside of
789discretionaries. But, as patterns evolve and font handling can influence
790discretionaries, you need to be aware of the fact that long term consistency is
791not an engine matter only.
792
793But that situation is now fairly common in \LUATEX, due to the changes to the
794ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented
795slightly different from the \TEX82 nodes: the \type {no_break} text is now
796embedded inside the disc node, where previously these nodes kept their place in
797the horizontal list. In traditional \TEX\ the discretionary node contains a
798counter indicating how many nodes to skip, but in \LUATEX\ we store the pre, post
799and replace text in the discretionary node.
800
801The combined effect of these two differences is that \LUATEX\ does not always use
802all of the potential breakpoints in a paragraph, especially when fonts with many
803ligatures are used. Of course kerning also complicates matters here.
804
805\stopsection
806
807\startsection[title={The \type {language} library}][library=lang]
808
809\startsubsection[title={\type {new} and \type {id}}]
810
811\topicindex {languages+library}
812
813\libindex {new}
814\libindex {id}
815
816This library provides the interface to \LUATEX's structure representing a
817language, and the associated functions.
818
819\startfunctioncall
820<language> l = language.new()
821<language> l = language.new(<number> id)
822\stopfunctioncall
823
824This function creates a new userdata object. An object of type \type {<language>}
825is the first argument to most of the other functions in the \type {language}
826library. These functions can also be used as if they were object methods, using
827the colon syntax. Without an argument, the next available internal id number will
828be assigned to this object. With argument, an object will be created that links
829to the internal language with that id number.
830
831\startfunctioncall
832<number> n = language.id(<language> l)
833\stopfunctioncall
834
835The number returned is the internal \prm {language} id number this object refers
836to.
837
838\stopsubsection
839
840\startsubsection[title={\type {hyphenation}}]
841
842\libindex {hyphenation}
843
844You can load exceptions with:
845
846\startfunctioncall
847<string> n = language.hyphenation(<language> l)
848language.hyphenation(<language> l, <string> n)
849\stopfunctioncall
850
851When no string is given (the first example) a string with all exceptions is
852returned.
853
854\stopsubsection
855
856\startsubsection[title={\type {clearhyphenation} and \type {clean}}]
857
858\libindex {clearhyphenation}
859\libindex {clean}
860
861This either returns the current hyphenation exceptions for this language, or adds
862new ones. The syntax of the string is explained in~\in {section}
863[patternsexceptions].
864
865\startfunctioncall
866language.clearhyphenation(<language> l)
867\stopfunctioncall
868
869This call clears the exception dictionary (string) for this language.
870
871\startfunctioncall
872<string> n = language.clean(<language> l, <string> o)
873<string> n = language.clean(<string> o)
874\stopfunctioncall
875
876This function creates a hyphenation key from the supplied hyphenation value. The
877syntax of the argument string is explained in \in {section} [patternsexceptions].
878This function is useful if you want to do something else based on the words in a
879dictionary file, like spell|-|checking.
880
881\stopsubsection
882
883\startsubsection[title={\type {patterns} and \type {clearpatterns}}]
884
885\libindex {patterns}
886\libindex {clearpatterns}
887
888\startfunctioncall
889<string> n = language.patterns(<language> l)
890language.patterns(<language> l, <string> n)
891\stopfunctioncall
892
893This adds additional patterns for this language object, or returns the current
894set. The syntax of this string is explained in \in {section}
895[patternsexceptions].
896
897\startfunctioncall
898language.clearpatterns(<language> l)
899\stopfunctioncall
900
901This can be used to clear the pattern dictionary for a language.
902
903\stopsubsection
904
905\startsubsection[title={\type {hyphenationmin}}]
906
907\libindex {hyphenationmin}
908
909This function sets (or gets) the value of the \TEX\ parameter
910\type {\hyphenationmin}.
911
912\startfunctioncall
913n = language.hyphenationmin(<language> l)
914language.hyphenationmin(<language> l, <number> n)
915\stopfunctioncall
916
917\stopsubsection
918
919\startsubsection[title={\type {[pre|post][ex|]hyphenchar}}]
920
921\libindex {prehyphenchar}
922\libindex {posthyphenchar}
923\libindex {preexhyphenchar}
924\libindex {postexhyphenchar}
925
926\startfunctioncall
927<number> n = language.prehyphenchar(<language> l)
928language.prehyphenchar(<language> l, <number> n)
929
930<number> n = language.posthyphenchar(<language> l)
931language.posthyphenchar(<language> l, <number> n)
932\stopfunctioncall
933
934These two are used to get or set the \quote {pre|-|break} and \quote
935{post|-|break} hyphen characters for implicit hyphenation in this language. The
936intial values are decimal 45 (hyphen) and decimal~0 (indicating emptiness).
937
938\startfunctioncall
939<number> n = language.preexhyphenchar(<language> l)
940language.preexhyphenchar(<language> l, <number> n)
941
942<number> n = language.postexhyphenchar(<language> l)
943language.postexhyphenchar(<language> l, <number> n)
944\stopfunctioncall
945
946These gets or set the \quote {pre|-|break} and \quote {post|-|break} hyphen
947characters for explicit hyphenation in this language. Both are initially
948decimal~0 (indicating emptiness).
949
950\stopsubsection
951
952\startsubsection[title={\type {hyphenate}}]
953
954\libindex {hyphenate}
955
956The next call inserts hyphenation points (discretionary nodes) in a node list. If
957\type {tail} is given as argument, processing stops on that node. Currently,
958\type {success} is always true if \type {head} (and \type {tail}, if specified)
959are proper nodes, regardless of possible other errors.
960
961\startfunctioncall
962<boolean> success = language.hyphenate(<node> head)
963<boolean> success = language.hyphenate(<node> head, <node> tail)
964\stopfunctioncall
965
966Hyphenation works only on \quote {characters}, a special subtype of all the glyph
967nodes with the node subtype having the value \type {1}. Glyph modes with
968different subtypes are not processed. See \in {section} [charsandglyphs] for
969more details.
970
971\stopsubsection
972
973\startsubsection[title={\type {[set|get]hjcode}}]
974
975\libindex {sethjcode}
976\libindex {gethjcode}
977
978The following two commands can be used to set or query hj codes:
979
980\startfunctioncall
981language.sethjcode(<language> l, <number> char, <number> usedchar)
982<number> usedchar = language.gethjcode(<language> l, <number> char)
983\stopfunctioncall
984
985When you set a hjcode the current sets get initialized unless the set was already
986initialized due to \prm {savinghyphcodes} being larger than zero.
987
988\subsection{\prm {hccode} and \type {[set|get]hccode}}
989
990A character can be set to non zero to indicate that it should be regarded as
991value visible hyphenation point. These examples show how that works (it si the
992second bit in \prm {hyphenationmode} that does the magic but we set them all
993here):
994
995\startbuffer
996{\hsize 1mm \hccode"2014 \zerocount  \hyphenationmode "0000000 xxx\emdash xxx \par}
997{\hsize 1mm \hccode"2014 "2014\relax \hyphenationmode "0000000 xxx\emdash xxx \par}
998
999{\hsize 1mm \hccode"2014 \zerocount  \hyphenationmode "FFFFFFF xxx\emdash xxx \par}
1000{\hsize 1mm \hccode"2014 "2014\relax \hyphenationmode "FFFFFFF xxx\emdash xxx \par}
1001
1002{\hyphenationmode "0000000 xxx--xxx---xxx \par}
1003{\hyphenationmode "FFFFFFF xxx--xxx---xxx \par}
1004\stopbuffer
1005
1006\typebuffer
1007
1008Here we assign the code point because who knows what future extensions will
1009bring. As with the other codes you can also set them from \LUA. The feature is
1010experimental and might evolve when \CONTEXT\ users come up with reasonable
1011demands.
1012
1013\startpacked \getbuffer \stoppacked
1014
1015\stopsubsection
1016
1017\stopsection
1018
1019\stopchapter
1020
1021\stopcomponent
1022
1023% \parindent0pt \hsize=1.1cm
1024% 12-34-56 \par
1025% 12-34-\hbox{56} \par
1026% 12-34-\vrule width 1em height 1.5ex \par
1027% 12-\hbox{34}-56 \par
1028% 12-\vrule width 1em height 1.5ex-56 \par
1029% \hjcode`\1=`\1 \hjcode`\2=`\2 \hjcode`\3=`\3 \hjcode`\4=`\4 \vskip.5cm
1030% 12-34-56 \par
1031% 12-34-\hbox{56} \par
1032% 12-34-\vrule width 1em height 1.5ex \par
1033% 12-\hbox{34}-56 \par
1034% 12-\vrule width 1em height 1.5ex-56 \par
1035
1036