languages-options.tex /size: 11 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent languages-options
4
5\environment languages-environment
6
7\startchapter[title=Options][color=darkblue]
8
9\startsection[title=Introduction]
10
11Hyphenation of words is controlled by so called patterns. They take a word and
12try to match parts with a pattern that describes where a hyphen can be injected.
13Preferred and discouraged injection points accumulate to a score that in the end
14determine where so called discretionary nodes gets injected in the list of
15glyphs that make a word. The patterns are language specific.
16
17This mechanism is agnostic when it comes to the characters involved: they are
18just numbers. However, when in a next step font features like ligature building
19and kerning are applied we also have to deal with language specific properties
20(and meanings). Often a ligature at the boundary of a composed word can make
21reading confusing and has to be avoided. Some of that can be controlled by the
22font when it implements language specific features but because that approach is
23not based on a dictionary it is more about playing safe and prevention than about
24quality.
25
26In the next sections a mechanism is discussed that also uses patterns. This time
27it is about controlling fonts as well as how hyphenation patterns are applied.
28This process kicks in before hyphenation is applied but it definitely has to be
29seen as part of that same process. It is integrated in hyphenation machinery and
30acts as preprocessor with the possibility to feedback and move forward. The
31implementation is such that when it's not used there is no performance penalty.
32\footnote {There are by now plenty of alternative approaches to these problems
33but after some discussion about the pro's and cons of each this new mechanism was
34made. I admit that the fun factor played a role. It is also one of the things we
35can do in \LUAMETATEX\ without worrying about a possible negative impact on
36\LUATEX\ users other than \CONTEXT .}
37
38There are several predefined operations that are characterized by keywords and
39shortcuts and collected in an option list that is part of a language goodie file.
40Examples can be found in the distribution in files with the suffix \type {llg}
41(\LUA\ language goodie). The framework of such a file is:
42
43\starttyping
44return {
45    name       = "whatever",
46    version    = "1.00",
47    comment    = "Goodies for experiments and demo.",
48    author     = "Hans Hagen",
49    copyright  = "ConTeXt development team",
50    options    = {
51        { ... },
52        ........
53        { ... },
54    }
55}
56\stoptyping
57
58These options will eventually result in patterns that are bound to words,
59think of:
60
61\starttabulate[|T||||]
62\NC effe     \NC \type {foo|bar}   \NC \type {..|..}     \NC inhibit ligature \NC \NR
63\NC foobar   \NC \type {foo=bar}   \NC \type {...=...}   \NC inhibit kerning  \NC \NR
64\NC somemore \NC \type {some+more} \NC \type {....+....} \NC compound word    \NC \NR
65\stoptabulate
66
67The whole repertoire is:
68
69\starttabulate[||T|]
70\NC \type {a|b} \NC a:norightligature, b:noleftligature \NC \NR
71\NC \type {a=b} \NC a:norightkern, b:noleftkern         \NC \NR
72\NC \type {a<b} \NC b:noleftkern                        \NC \NR
73\NC \type {a>b} \NC a:norightkern                       \NC \NR
74\NC \type {a+b} \NC a:compound:b                        \NC \NR
75\stoptabulate
76
77Later we will see how some can be combined. An option can be defined using entries
78in a subtable:
79
80\starttabulate[|T|||]
81\NC patterns   \NC hash            \NC \type {[snippet] = "replacement pattern"} \NC \NR
82\NC words      \NC string          \NC string of words, separated by whitespace \NC \NR
83\NC prefixes   \NC string          \NC snippets that combine with words (at the start) \NC \NR
84\NC suffixes   \NC string          \NC snippets that combine with words (at the end) \NC \NR
85\NC matches    \NC array or number \NC a number or table indicating which match matters \NC \NR
86\NC actions    \NC hash            \NC \type {[character] = "action(s)"} \NC \NR
87\NC characters \NC string          \NC permitted characters (additional hjcodes) \NC \NR
88\NC return     \NC integer         \NC what to do next \NC \NR
89\stoptabulate
90
91The default return value is~2 but there are some more:
92
93\starttabulate[|T||]
94\NC 0 \NC go to the next (valid) word \NC \NR
95\NC 1 \NC restart \NC \NR
96\NC 2 \NC exceptions and after that patterns \NC \NR
97\NC 3 \NC patterns \NC \NR
98\stoptabulate
99
100There are some safeguards built in that force a restart. For instance when a word
101is replaced a restart is enforces unless we skip the word. A restart will not
102permit a second replacement (after all we need to avoid endless loops).
103
104In a multi|-|line word list, lines that start with a comment trigger: \LUA's
105double dash or the usual \TEX\ percent sign.
106
107\stopsection
108
109\startsection[title=Inhibiting]
110
111The next definition replaces \type {ff} by \type {f|f} in the words given and
112eventually block a ligature.
113
114\starttyping
115{
116    patterns = {
117        ff  = "f|f",
118    },
119    words = [[
120        effe
121    ]],
122}
123\stoptyping
124
125Some fonts provide the \type {ij} ligature or do some special kerning between
126these characters (something Dutch). Because it depends on the font logic if a
127dedicated replacement or kerning is used this is an example where we do this:
128
129\starttyping
130{
131    patterns = {
132        ij = "i|j",
133    },
134    actions = {
135        ["|"] = "nokern noligature",
136    },
137    words = [[
138        ijverig
139     -- fijn -- to ligature fi or ij, that's the question
140    ]],
141}
142\stoptyping
143
144A more extensive definition is the following. Here we explicitly define that only
145the first match in a word get treated. Here we not only block ligatures but also
146kerns.
147
148\starttyping
149{
150    patterns = {
151        ff  = "f|f",
152    },
153    matches = { 1 },
154    actions = {
155        ["|"] = "noligature nokern"
156    },
157    words = [[
158        effe
159        effeffe
160    ]],
161}
162\stoptyping
163
164You can also omit the pattern when you inject specifiers yourself:
165
166\starttyping
167{
168    actions = {
169        ["|"] = "noligature nokern"
170    },
171    words = [[
172        ef|fe
173        ef|fef|fe
174    ]],
175}
176\stoptyping
177
178You can also use different shortcuts:
179
180\starttyping
181{
182    actions = {
183        ["1"] = "noligature"
184        ["2"] = "nokern"
185    },
186    words = [[
187        ef1fe
188        ef1fef2fe
189    ]],
190}
191\stoptyping
192
193Although I cannot come up with a nice example, there can be reasons for
194inhibiting kerns. Here we inhibit kerns left of the upcoming character:
195
196\starttyping
197{
198    patterns = {
199        fo = "f<o",
200        rm = "r<m",
201    },
202    words = [[
203        information
204    ]],
205}
206\stoptyping
207
208And here we inhibit kerns left of the previous and upcoming character:
209
210\starttyping
211{
212    patterns = {
213        th = "t=h",
214    },
215    words = [[
216        thrive
217    ]],
218}
219\stoptyping
220
221Just look in the files in the distribution for realistic examples, like
222
223\starttyping
224{
225    patterns = {
226        fi = "f|i",
227    },
228    words = [[
229        deafish dwarfish elfish oafish selfish
230    ]],
231    suffixes = [[
232        ness ly
233    ]]
234}
235\stoptyping
236
237where we block ligatures in 15 words. There's also a \type {prefixes} key.
238
239\stopsection
240
241\startsection[title=Replacements]
242
243Replacements are probably not used that much but here is one for German. Not
244only is the uppercase variant of ß seldom used, many fonts don't provide it
245so we can best replace it:
246
247\starttyping
248{
249    characters = "", -- uppercase ß, not visible in all verbatim fonts
250    patterns   = {
251        [""] = "SS", -- key is uppercase ß
252    },
253}
254\stoptyping
255
256Here we define that character as valid, something that normally is done with the
257patterns but patterns don't have them. If we do not specify it here, the
258hyphenator will skip this word. For the record: this can also be done with a font
259feature that decomposes the character.
260
261\stopsection
262
263\startsection[title=Compound words]
264
265You might want to suppress ligatures and maybe even kerning when compound words
266are involved.
267
268\starttyping
269{
270    patterns = {
271        ff = "f+f",
272    },
273    words = [[
274        aaaaffaaaa
275        bbffbb
276    ]],
277}
278\stoptyping
279
280Again you can also say:
281
282\starttyping
283{
284    words = [[
285        aaaaf|faaaa
286        bbf|fbb
287    ]],
288}
289\stoptyping
290
291But patterns make sense when you have a large list (that might come from some
292other source than yourself).
293
294The next specification will turn two times three \type {bla}'s into a compound
295word but also make sure that we have at least 4 characters left and right of a
296potential break.
297
298\starttyping
299    {
300        left  = 4,
301        right = 4,
302        words = [[
303            blablabla+blablabla
304        ]],
305    }
306\stoptyping
307
308\stopsection
309
310\startsection[title=Performance]
311
312Although these mechanisms introduce overhead, the performance hit in \LMTX\ is
313not that large. This is because the number of words in a document is limited and
314\LUA\ is fast enough.
315
316\stopsection
317
318\startsection[title=Plugins]
319
320{\em This interface is preliminary but for the record I put an example here
321anyway.}
322
323\starttyping
324local n = 0
325function document.myhack(original)
326    n = n + 1
327    print(n,original)
328    return original
329end
330
331languages.installhandler("de","document.myhack")
332\stoptyping
333
334One can manipulate a text as in:
335
336\starttyping
337function document.myhack(original)
338    local t = utf.split(original)
339    local t = table.reverse(t)
340    local f = t[#t]
341    local l = t[1]
342    if characters.upper(f) == f then
343        t[1]  = characters.upper()
344        t[#t] = characters.lower(f)
345    end
346    local original = table.concat(t)
347    return original
348end
349
350languages.installhandler("en","document.myhack")
351\stoptyping
352
353The text will fed again into the hyphenator and treated in the normal way. There
354are some safeguards against the text being processed twice.
355
356\stopsection
357
358\startsection[title=Tracing]
359
360You can also embed definitions in the source file:
361
362\starttyping
363\startlanguageoptions[de]
364    Zapf|innovation
365\stoplanguageoptions
366\stoptyping
367
368\stopsection
369
370\startsection[title=Exceptions]
371
372When you set exceptions in a goodie file, it will use the plugin mechanism to
373check for them. This is a bit more efficient than using the internal checkerm
374which actually also goes via a\LUA\ hash.
375
376\starttyping
377{
378    exceptions = [[
379        a-very{-}{-}{w}eird{1}{2}{3}(w)ord
380    ]],
381}
382\stoptyping
383
384Watch out: when you specify a discretionary replacement three braced valued are
385passed: the pre, post and replace text. The replace text is used in the lookup,
386unless you add a string between parentheses, which then will be used instead. A
387digit between bracket will apply a penalty according to the following logic (in
388the engine): A zero digit results in \type {\hyphenpenalty}, otherwise the
389digits~1 upto~9 will be used as multiplier for \type {\exceptionpenalty} when
390that value is larger than 100000, otherwise \type {\exceptionpenalty} is used.
391
392\stopsection
393
394\startsection[title=Tracing]
395
396The following tracker can be used:
397
398\starttyping
399\enabletrackers[languages.goodies]
400\stoptyping
401
402In addition the style \type {languages-goodies} implements some tracing options.
403You can just run that one to see what it does.
404
405The engine itself has also a tracing option: \type {\tracinghyphenation}. When
406set to zero nothing is shown, when set to one redundant patterns will be
407reported. A value of two reports what words get fed into the hyphenator and if
408they got hyphenated. A value of three gives more detail: when a word gets
409hyphenated the relevant (resulting) part of the node list is shown. You need to
410set \type {\tracingonline} to a value larger than zero to get this reported to
411the console. Expects lots of extra output to the console for large documents but
412it can be revealing.
413
414\stopsection
415
416\stopchapter
417
418\stopcomponent
419
420%D Musical timestamp: end Match 2021: running into Joe Parrish's amazing
421%D interpretation of Stravinsky's "Rite of Spring" on guitars.
422%D
423%D Also on YT: The Rite of Spring by London Symphony Orchestra (conducted
424%D by Simon Rattle).
425