1
2
3\startcomponent languagessorting
4
5\environment languagesenvironment
6
7\startchapter[title=Sorting][color=darkblue]
8
9\startsection[title=Introduction]
10
11Sorting is complex, not so much for English, Dutch, German, etc. only texts but
12there are languages and scripts that are more demanding. There are several
13complications:
14
15\startitemize
16
17 \startitem
18 There can be characters that have accents, like à, á, â, ã, ä
19 \unknown\ that have a base shape a and in an index these often end up
20 close to each other. The order can differ per language.
21 \stopitem
22
23 \startitem
24 There are upper and lowercase words and there can be different
25 expectations to them being mixed or separated.
26 \stopitem
27 \startitem
28 Some scripts have characters that are combinations, like Æ, and
29 one might want to see them as one character or two, in which the
30 second one obeys the sorting order. The shape can dominate here.
31 \stopitem
32 \startitem
33 Some scripts, like Japanese, are a combination of several scripts
34 and sorting then depends on normalization.
35 \stopitem
36 \startitem
37 When there are many glyphs, like in Chinese, the order can depend
38 on the complexity of the glyph and when were lucky that order is
39 reflected in the numeric character order.
40 \stopitem
41\stopitemize
42
43Often the rules are somewhat strict and one can doubt of the same rules would
44have been imposed if computers had been developed earlier. Given discussions one
45can doubt if the rules are really consistent or just there because someone (or a
46group) with influence set the standard (not so much different from grammar). So,
47if we deal with sorting, we do that in such a way that users can (to some extend)
48influence the outcome. After all, one important aspect of typesetting and
49organizing content is that the users gets the feeling of control and a diversion
50from a standard can be part of that. The reader will often not notice these
51details. In the next sections we will explore the way sorting is done in
52\CONTEXT. The method evolved over a few decades. In \MKII\ sorting happened
53between runs and it was just part of the processing of a document that users
54never really saw in action. Sorting just happened and few users will have noticed
55that we moved from a \MODULA\ program to a \PERL\ script and ended up with a
56\RUBY\ script. In fact, there is a \LUA\ replacement but it never got tested well
57because we moved in to \MKIV. There all happens inside the engine using \LUA.
58Some principles stayed the same but we are more flexible now.
59
60\stopsection
61
62\startsection[title=How it works]
63
64How does sorting work out? Take these words:
65
66\startlines
67abracadabra
68abräcàdábra
69àbracádabrä
70ábracadàbra
71äbrácadabrà
72\stoplines
73
74As long as they end up in an order where the reader can find it, were okay.
75After all were pretty good in pattern recognition.
76
77There are probably many ways to implement a sorter but the one we uses is more or
78less a follow up on the one we had for over a decade and was the result of an
79evolution based on user demand. It boils down to cleaning up the string in such a
80way that it can be split into meaningful characters. One can argue that we should
81use some kd of standardized sorting method but the problem is that we always have
82to deal with for instance embedded tex commands and mixed content, for instance
83numbers. And users using the same language can have different opinions about the
84rules too.
85
86A word (or sequence of words) is split into characters. Because there can be
87\TEX\ commands in there some cleanup happens beforehand. After that we create
88several lists with numbers that will be compared when sorting two entries.
89
90\startluacode
91
92
93
94
95
96
97local context = context
98
99local utfchar = utf.char
100local utfyte = utf.byte
101local concat = table.concat
102local gsub = string.gsub
103local formatters = string.formatters
104
105local f_char = formatters["%s"]
106local f_byte = formatters["x%02X"]
107
108local meaning = {
109 ch = "raw character",
110 mm = "minus mapping",
111 zm = "zero mapping",
112 pm = "plus mapping",
113 mc = "lowercase - 1",
114 zc = "lowercase",
115 pc = "lowercase + 1",
116 uc = "unicode",
117}
118
119local function show(s,key,bodyfont)
120 local c = s[key]
121 local t = { }
122 for i=1,#c do
123 local ci = c[i]
124 if type(ci) == "string" then
125 t[i] = f_char(ci)
126 else
127 t[i] = f_byte(ci)
128 end
129 end
130 t = concat(t,"~")
131 context.NC() context.maincolor() context(key)
132 context.NC() context.maincolor() context(meaning[key])
133 context.NC() if bodyfont then context.switchtobodyfont{bodyfont} end context(t)
134 context.NC() context.NR()
135end
136
137function document.ShowSortSplit(str,language,bodyfont)
138 sorters.setlanguage(language or "en")
139 local s = sorters.splitters.utf(str)
140 context.starttabulate{ "|Tl|Tlj2|Tp|" }
141 context.FL()
142 context.NC()
143 context.NC() context.maincolor() context(language)
144 context.NC() if bodyfont then context.switchtobodyfont{bodyfont} end context.maincolor() context(str)
145 context.NC() context.NR()
146 context.ML()
147 show(s,"ch",bodyfont)
148 show(s,"uc")
149 show(s,"zc")
150 show(s,"mc")
151 show(s,"pc")
152 show(s,"zm")
153 show(s,"mm")
154 show(s,"pm")
155 context.LL()
156 context.stoptabulate()
157end
158
159\stopluacode
160
161We can best demonstrate this with a few examples. As usual an English language
162example is trivial.
163
164\ctxlua{document.ShowSortSplit("abracadabra","en")}
165
166When we add an uppercase character we get a slightly different outcome:
167
168\ctxlua{document.ShowSortSplit("Abracadabra","en")}
169
170Some characters will be split, like \type {æ}:
171
172\ctxlua{document.ShowSortSplit("æsop","en")}
173
174It gets more complex when langiage specific demands kick in. Compare an English, German
175and Austrian split:
176
177\ctxlua{document.ShowSortSplit("Abräcàdábra","en")}
178\ctxlua{document.ShowSortSplit("Abräcàdábra","de")}
179\ctxlua{document.ShowSortSplit("Abräcàdábra","de-at")}
180
181The way a character gets replaced, like \type {ä} into \type {ae}, is defined in
182\type {sortlan.lua} using \LUA\ tables. We will not explain all the obscure
183details here; most of the work is already done, so users are not bothered by
184these definitions. And new ones can often be made by copying and adapting an
185existing one.
186
187The sorting itself is specified by a sequence:
188
189\starttabulate[TlCT{maincolor}Tl]
190\NC default \NC zc,pc,zm,pm,uc \NC \NR
191\NC before \NC mm,mc,uc \NC \NR
192\NC after \NC pm,mc,uc \NC \NR
193\NC first \NC pc,mm,uc \NC \NR
194\NC last \NC mc,mm,uc \NC \NR
195\stoptabulate
196
197The raw character is what we get after the (language specific) replacement has
198been applied and the unicodes are used when comparing. Lowercasing is done using
199the \UNICODE\ lowercase code, but one can define language specific ones too. The
200plus and minus variants can be used to force lowercase before or after uppercase.
201The mapping is based on an alphabet specification so this can differ per language
202and again we also provide plus and minus values that depend on case. When a
203character has no case we use shapes instead. For instance, the shape of \type
204{à} is \type {a}. Digits are treated special and currently get an offset so that
205they end up last in the sort order.
206
207\defineregister[jindex]
208
209\startbuffer
210ぱあ \jindex{ぱあ}
211ぱー \jindex{ぱー}
212ぱぁ \jindex{ぱぁ}
213\stopbuffer
214
215{\switchtobodyfont[ipaex]\startlines\typebuffer\stoplines}
216
217This three entry index\jindex{ぱあ}\jindex{ぱー}\jindex{ぱぁ} should be sorted in the order:
218{\switchtobodyfont[ipaex]\ruledhbox{ぱー}\enspace\ruledhbox{ぱぁ}\enspace\ruledhbox{ぱあ}}.
219
220{\mainlanguage[jp]\switchtobodyfont[ipaex]\placeregister[jindex][language=jp,n=1,method=default]}
221{\mainlanguage[jp]\switchtobodyfont[ipaex]\placeregister[jindex][language=jp,n=1,method=zm]}
222
223\ctxlua{document.ShowSortSplit("ぱあ","jp","ipaex")}
224\ctxlua{document.ShowSortSplit("ぱー","jp","ipaex")}
225\ctxlua{document.ShowSortSplit("ぱぁ","jp","ipaex")}
226
227{\em To be continued!}
228
229\stopsection
230
231
232
233\startsection[title=Special usage]
234
235The following example demonstrates how you can trick the sorter into doing other
236things: \footnote {The \type {replacementlist} helper is the result of a request
237by John Grasty on the mailing list.}
238
239\startbuffer
240\startluacode
241 local list = {
242
243 "Genesis", "Exodus", "Leviticus", "Numbers", "Deuteronomy", "Joshua",
244 "Judges", "Ruth", "1 Samuel", "2 Samuel", "1 Kings", "2 Kings",
245 "1 Chronicles", "2 Chronicles", "Ezra", "Nehemiah", "Esther", "Job",
246 "Psalms", "Proverbs", "Ecclesiastes", "Canticles", "Isaiah", "Jeremiah",
247 "Lamentations", "Ezekiel", "Daniel", "Hosea", "Joel", "Amos", "Obadiah",
248 "Jonah", "Micah", "Nahum", "Habakkuk", "Zephaniah", "Haggai",
249 "Zechariah", "Malachi",
250
251 "Matthew", "Mark", "Luke", "John", "Acts", "Romans", "1 Corinthians",
252 "2 Corinthians", "Galatians", "Ephesians", "Philippians", "Colossians",
253 "1 Thessalonians", "2 Thessalonians", "1 Timothy", "2 Timothy", "Titus",
254 "Philemon", "Hebrews", "James", "1 Peter", "2 Peter", "1 John", "2 John",
255 "3 John", "Jude", "Revelation",
256 }
257
258 sorters.definitions["bible"] = {
259 replacements = sorters.replacementlist(list),
260 }
261\stopluacode
262
263\defineregister
264 [booksort]
265 [language=bible,
266 n=3,
267 criterium=text,
268 indicator=no]
269\stopbuffer
270
271\typebuffer \getbuffer
272
273We use this as follows:
274
275\startbuffer
276One \booksort{Genesis5.2}
277Two \booksort{Exodus2}
278Three \booksort{Genesis45}
279Four \booksort{Philemon2}
280Five \booksort{John45}
281Six \booksort{1 John 145}
282Seven \booksort{2 John 245}
283
284\placeregister
285 [booksort]
286 [language=bible]
287\stopbuffer
288
289\typebuffer
290
291which gives:
292
293\getbuffer
294
295\stopsection
296
297\stopchapter
298
299\stopcomponent
300 |