languages-sorting.tex /size: 10 Kb    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/languages
2
3\startcomponent languages-sorting
4
5\environment languages-environment
6
7\startchapter[title=Sorting][color=darkblue]
8
9\startsection[title=Introduction]
10
11Sorting is complex, not so much for English, Dutch, German, etc. only texts but
12there are languages and scripts that are more demanding. There are several
13complications:
14
15\startitemize
16
17    \startitem
18        There can be characters that have accents, like à, á, â, ã, ä
19        \unknown\ that have a base shape a and in an index these often end up
20        close to each other. The order can differ per language.
21    \stopitem
22
23    \startitem
24        There are upper and lowercase words and there can be different
25        expectations to them being mixed or separated.
26    \stopitem
27    \startitem
28        Some scripts have characters that are combinations, like Æ, and
29        one might want to see them as one character or two, in which the
30        second one obeys the sorting order. The shape can dominate here.
31    \stopitem
32    \startitem
33        Some scripts, like Japanese, are a combination of several scripts
34        and sorting then depends on normalization.
35    \stopitem
36    \startitem
37        When there are many glyphs, like in Chinese, the order can depend
38        on the complexity of the glyph and when we're lucky that order is
39        reflected in the numeric character order.
40    \stopitem
41\stopitemize
42
43Often the rules are somewhat strict and one can doubt of the same rules would
44have been imposed if computers had been developed earlier. Given discussions one
45can doubt if the rules are really consistent or just there because someone (or a
46group) with influence set the standard (not so much different from grammar). So,
47if we deal with sorting, we do that in such a way that users can (to some extend)
48influence the outcome. After all, one important aspect of typesetting and
49organizing content is that the users gets the feeling of control and a diversion
50from a standard can be part of that. The reader will often not notice these
51details. In the next sections we will explore the way sorting is done in
52\CONTEXT. The method evolved over a few decades. In \MKII\ sorting happened
53between runs and it was just part of the processing of a document that users
54never really saw in action. Sorting just happened and few users will have noticed
55that we moved from a \MODULA\ program to a \PERL\ script and ended up with a
56\RUBY\ script. In fact, there is a \LUA\ replacement but it never got tested well
57because we moved in to \MKIV. There all happens inside the engine using \LUA.
58Some principles stayed the same but we are more flexible now.
59
60\stopsection
61
62\startsection[title=How it works]
63
64How does sorting work out? Take these words:
65
66\startlines
67abracadabra
68abräcàdábra
69àbracádabrä
70ábracadàbra
71äbrácadabrà
72\stoplines
73
74As long as they end up in an order where the reader can find it, we're okay.
75After all we're pretty good in pattern recognition.
76
77There are probably many ways to implement a sorter but the one we uses is more or
78less a follow up on the one we had for over a decade and was the result of an
79evolution based on user demand. It boils down to cleaning up the string in such a
80way that it can be split into meaningful characters. One can argue that we should
81use some kd of standardized sorting method but the problem is that we always have
82to deal with for instance embedded tex commands and mixed content, for instance
83numbers. And users using the same language can have different opinions about the
84rules too.
85
86A word (or sequence of words) is split into characters. Because there can be
87\TEX\ commands in there some cleanup happens beforehand. After that we create
88several lists with numbers that will be compared when sorting two entries.
89
90\startluacode
91
92-- local ignoredoffset     = sorters.constants.ignoredoffset
93-- local replacementoffset = sorters.constants.replacementoffset
94-- local digitsoffset      = sorters.constants.digitsoffset
95-- local digitsmaximum     = sorters.constants.digitsmaximum
96
97local context = context
98
99local utfchar    = utf.char
100local utfyte     = utf.byte
101local concat     = table.concat
102local gsub       = string.gsub
103local formatters = string.formatters
104
105local f_char = formatters["%s"]
106local f_byte = formatters["x%02X"]
107
108local meaning = {
109    ch = "raw character",
110    mm = "minus mapping",
111    zm = "zero  mapping",
112    pm = "plus  mapping",
113    mc = "lowercase - 1",
114    zc = "lowercase",
115    pc = "lowercase + 1",
116    uc = "unicode",
117}
118
119local function show(s,key,bodyfont)
120    local c = s[key]
121    local t = { }
122    for i=1,#c do
123        local ci = c[i]
124        if type(ci) == "string" then
125            t[i] = f_char(ci)
126        else
127            t[i] = f_byte(ci)
128        end
129    end
130    t = concat(t,"~")
131    context.NC() context.maincolor() context(key)
132    context.NC() context.maincolor() context(meaning[key])
133    context.NC() if bodyfont then context.switchtobodyfont{bodyfont} end context(t)
134    context.NC() context.NR()
135end
136
137function document.ShowSortSplit(str,language,bodyfont)
138    sorters.setlanguage(language or "en")
139    local s = sorters.splitters.utf(str)
140    context.starttabulate{ "|Tl|Tlj2|Tp|" }
141        context.FL()
142        context.NC()
143        context.NC() context.maincolor() context(language)
144        context.NC() if bodyfont then context.switchtobodyfont{bodyfont} end context.maincolor() context(str)
145        context.NC() context.NR()
146        context.ML()
147        show(s,"ch",bodyfont)
148        show(s,"uc")
149        show(s,"zc")
150        show(s,"mc")
151        show(s,"pc")
152        show(s,"zm")
153        show(s,"mm")
154        show(s,"pm")
155        context.LL()
156    context.stoptabulate()
157end
158
159\stopluacode
160
161We can best demonstrate this with a few examples. As usual an English language
162example is trivial.
163
164\ctxlua{document.ShowSortSplit("abracadabra","en")}
165
166When we add an uppercase character we get a slightly different outcome:
167
168\ctxlua{document.ShowSortSplit("Abracadabra","en")}
169
170Some characters will be split, like \type {æ}:
171
172\ctxlua{document.ShowSortSplit("æsop","en")}
173
174It gets more complex when langiage specific demands kick in. Compare an English, German
175and Austrian split:
176
177\ctxlua{document.ShowSortSplit("Abräcàdábra","en")}
178\ctxlua{document.ShowSortSplit("Abräcàdábra","de")}
179\ctxlua{document.ShowSortSplit("Abräcàdábra","de-at")}
180
181The way a character gets replaced, like \type {ä} into \type {ae}, is defined in
182\type {sort-lan.lua} using \LUA\ tables. We will not explain all the obscure
183details here; most of the work is already done, so users are not bothered by
184these definitions. And new ones can often be made by copying and adapting an
185existing one.
186
187The sorting itself is specified by a sequence:
188
189\starttabulate[|TlCT{maincolor}|Tl|]
190\NC default \NC zc,pc,zm,pm,uc \NC \NR
191\NC before  \NC mm,mc,uc       \NC \NR
192\NC after   \NC pm,mc,uc       \NC \NR
193\NC first   \NC pc,mm,uc       \NC \NR
194\NC last    \NC mc,mm,uc       \NC \NR
195\stoptabulate
196
197The raw character is what we get after the (language specific) replacement has
198been applied and the unicodes are used when comparing. Lowercasing is done using
199the \UNICODE\ lowercase code, but one can define language specific ones too. The
200plus and minus variants can be used to force lowercase before or after uppercase.
201The mapping is based on an alphabet specification so this can differ per language
202and again we also provide plus and minus values that depend on case. When a
203character has no case we use shapes instead. For instance, the shape of \type
204{à} is \type {a}. Digits are treated special and currently get an offset so that
205they end up last in the sort order.
206
207\defineregister[jindex]
208
209\startbuffer
210ぱあ \jindex{ぱあ}
211ぱー \jindex{ぱー}
212ぱぁ \jindex{ぱぁ}
213\stopbuffer
214
215{\switchtobodyfont[ipaex]\startlines\typebuffer\stoplines}
216
217This three entry index\jindex{ぱあ}\jindex{ぱー}\jindex{ぱぁ} should be sorted in the order:
218{\switchtobodyfont[ipaex]\ruledhbox{ぱー}\enspace\ruledhbox{ぱぁ}\enspace\ruledhbox{ぱあ}}.
219
220{\mainlanguage[jp]\switchtobodyfont[ipaex]\placeregister[jindex][language=jp,n=1,method=default]}
221{\mainlanguage[jp]\switchtobodyfont[ipaex]\placeregister[jindex][language=jp,n=1,method=zm]}
222
223\ctxlua{document.ShowSortSplit("ぱあ","jp","ipaex")}
224\ctxlua{document.ShowSortSplit("ぱー","jp","ipaex")}
225\ctxlua{document.ShowSortSplit("ぱぁ","jp","ipaex")}
226
227{\em To be continued!}
228
229\stopsection
230
231% ぱー $\prec$ ぱぁ $\prec$ ぱあ
232
233\startsection[title=Special usage]
234
235The following example demonstrates how you can trick the sorter into doing other
236things: \footnote {The \type {replacementlist} helper is the result of a request
237by John Grasty on the mailing list.}
238
239\startbuffer
240\startluacode
241    local list = {
242        -- old testament
243        "Genesis", "Exodus", "Leviticus", "Numbers", "Deuteronomy", "Joshua",
244        "Judges", "Ruth", "1 Samuel", "2 Samuel", "1 Kings", "2 Kings",
245        "1 Chronicles", "2 Chronicles", "Ezra", "Nehemiah", "Esther", "Job",
246        "Psalms", "Proverbs", "Ecclesiastes", "Canticles", "Isaiah", "Jeremiah",
247        "Lamentations", "Ezekiel", "Daniel", "Hosea", "Joel", "Amos", "Obadiah",
248        "Jonah", "Micah", "Nahum", "Habakkuk", "Zephaniah", "Haggai",
249        "Zechariah", "Malachi",
250        -- new testament
251        "Matthew", "Mark", "Luke", "John", "Acts", "Romans", "1 Corinthians",
252        "2 Corinthians", "Galatians", "Ephesians", "Philippians", "Colossians",
253        "1 Thessalonians", "2 Thessalonians", "1 Timothy", "2 Timothy", "Titus",
254        "Philemon", "Hebrews", "James", "1 Peter", "2 Peter", "1 John", "2 John",
255        "3 John", "Jude", "Revelation",
256    }
257
258    sorters.definitions["bible"] = {
259        replacements = sorters.replacementlist(list),
260    }
261\stopluacode
262
263\defineregister
264  [booksort]
265  [language=bible,
266   n=3,
267   criterium=text,
268   indicator=no]
269\stopbuffer
270
271\typebuffer \getbuffer
272
273We use this as follows:
274
275\startbuffer
276One   \booksort{Genesis+5.2}
277Two   \booksort{Exodus+2}
278Three \booksort{Genesis+45}
279Four  \booksort{Philemon+2}
280Five  \booksort{John+45}
281Six   \booksort{1 John 1+45}
282Seven \booksort{2 John 2+45}
283
284\placeregister
285  [booksort]
286  [language=bible]
287\stopbuffer
288
289\typebuffer
290
291which gives:
292
293\getbuffer
294
295\stopsection
296
297\stopchapter
298
299\stopcomponent
300