xml-mkiv-filtering.tex /size: 7906 b    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/xml
2
3\environment xml-mkiv-style
4
5\startcomponent xml-mkiv-filtering
6
7\startchapter[title={Filtering content}]
8
9\startsection[title={\TEX\ versus \LUA}]
10
11It will not come as a surprise that we can access \XML\ files from \TEX\ as well
12as from \LUA. In fact there are two methods to deal with \XML\ in \LUA. First
13there are the low level \XML\ functions in the \type {xml} namespace. On top of
14those functions there is a set of functions in the \type {lxml} namespace that
15deals with \XML\ in a more \TEX ie way. Most of these have similar commands at
16the \TEX\ end.
17
18\startbuffer
19\startxmlsetups first:demo:one
20  \xmlfilter {#1} {artist/name[text()='Randy Newman']/..
21    /albums/album[position()=3]/command(first:demo:two)}
22\stopxmlsetups
23
24\startxmlsetups first:demo:two
25  \blank \start \tt
26    \xmldisplayverbatim{#1}
27  \stop \blank
28\stopxmlsetups
29
30\xmlprocessfile{demo}{music-collection.xml}{first:demo:one}
31\stopbuffer
32
33\typebuffer
34
35This gives the following snippet of verbatim \XML\ code. The indentation is
36conform the indentation in the whole \XML\ file. \footnote {The (probably
37outdated) \XML\ file contains the collection stores on my slimserver instance.
38You can use the \type {mtxrun --script flac} to generate such files.}
39
40% \doifmodeelse {atpragma} {
41%     \getbuffer
42% } {
43    \typefile{xml-mkiv-01.xml}
44% }
45
46An alternative written in \LUA\ looks as follows:
47
48\startbuffer
49\blank \start \tt \startluacode
50  local m = lxml.load("mine","music-collection.xml") -- m == lxml.id("mine")
51  local p = "artist/name[text()='Randy Newman']/../albums/album[position()=4]"
52  local l = lxml.filter(m,p) -- returns a list (with one entry)
53  lxml.displayverbatim(l[1])
54\stopluacode \stop \blank
55\stopbuffer
56
57\typebuffer
58
59This produces:
60
61% \doifmodeelse {atpragma} {
62%     \getbuffer
63% } {
64    \typefile{xml-mkiv-02.xml}
65% }
66
67You can use both methods mixed but in practice we will use the \TEX\ commands in
68regular styles and the mixture in modules, for instance in those dealing with
69\MATHML\ and cals tables. For complex matters you can write your own finalizers
70(the last action to be taken in a match) in \LUA\ and use them at the \TEX\ end.
71
72\stopsection
73
74\startsection[title={a few details}]
75
76In \CONTEXT\ setups are a rather common variant on macros (\TEX\ commands) but
77with their own namespace. An example of a setup is:
78
79\starttyping
80\startsetup doc:print
81  \setuppapersize[A4][A4]
82\stopsetup
83
84\startsetup doc:screen
85  \setuppapersize[S6][S4]
86\stopsetup
87\stoptyping
88
89Given the previous definitions, later on we can say something like:
90
91\starttyping
92\doifmodeelse {paper} {
93  \setup[doc:print]
94} {
95  \setup[doc:screen]
96}
97\stoptyping
98
99Another example is:
100
101\starttyping
102\startsetup[doc:header]
103  \marking[chapter]
104  \space
105  --
106  \space
107  \pagenumber
108\stopsetup
109\stoptyping
110
111in combination with:
112
113\starttyping
114\setupheadertexts[\setup{doc:header}]
115\stoptyping
116
117Here the advantage is that instead of ending up with an unreadable header
118definitions, we use a nicely formatted setup. An important property of setups and
119the reason why they were introduced long ago is that spaces and newlines are
120ignored in the definition. This means that we don't have to worry about so called
121spurious spaces but it also means that when we do want a space, we have to use
122the \type {\space} command.
123
124The only difference between setups and \XML\ setups is that the following ones
125get an argument (\type {#1}) that reflects the current node in the \XML\ tree.
126
127\stopsection
128
129\startsection[title={CDATA}]
130
131What to do with \type {CDATA}? There are a few methods at tle \LUA\ end for
132dealing with it but here we just mention how you can influence the rendering.
133There are four macros that play a role here:
134
135\starttyping
136\unexpanded\def\xmlcdataobeyedline {\obeyedline}
137\unexpanded\def\xmlcdataobeyedspace{\strut\obeyedspace}
138\unexpanded\def\xmlcdatabefore     {\begingroup\tt}
139\unexpanded\def\xmlcdataafter      {\endgroup}
140\stoptyping
141
142Technically you can overload them but beware of side effects. Normally you won't
143see much \type {CDATA} and whenever we do, it involves special data that needs
144very special treatment anyway.
145
146\stopsection
147
148\startsection[title={Entities}]
149
150As usual with any way of encoding documents you need escapes in order to encode
151the characters that are used in tagging the content, embedding comments, escaping
152special characters in strings (in programming languages), etc. In \XML\ this
153means that in order characters like \type {<} you need an escape like \type
154{&lt;} and in order then to encode an \type {&} you need \type {&amp;}.
155
156In a typesetting workflow using a programming language like \TEX, another problem
157shows up. There we have different special characters, like \type {$ $} for triggering
158math, but also the backslash, braces etc. Even one such special character is already
159enough to have yet another escaping mechanism at work.
160
161Ideally a user should not worry about these issues but it helps figuring out issues
162when you know what happens under the hood. Also it is good to know that in the
163code there are several ways to deal with these issues. Take the following document:
164
165\starttyping
166<text>
167    Here we have a bit of a &lt;&mess&gt;:
168
169    # &#35;
170    % &#37;
171    \ &#92;
172    { &#123;
173    | &#124;
174    } &#125;
175    ~ &#126;
176</text>
177\stoptyping
178
179When the file is read the \type {&lt;} entity will be replaced by \type {<} and
180the \type {&gt;} by \type {>}. The numeric entities will be replaced by the
181characters they refer to. The \type {&mess} is kind of special. We do preload
182a huge list of more or less standardized entities but \type {mess} is not in
183there. However, it is possible to have it defined in the document preamble, like:
184
185\starttyping
186<!DOCTYPE dummy SYSTEM "dummy.dtd" [
187    <!ENTITY mess "what a mess" >
188]>
189\stoptyping
190
191or even this:
192
193\starttyping
194<!DOCTYPE dummy SYSTEM "dummy.dtd" [
195    <!ENTITY mess "<p>what a mess</p>" >
196]>
197\stoptyping
198
199You can also define it in your document style using one of:
200
201\startxmlcmd {\cmdbasicsetup{xmlsetentity}}
202    replaces entity with name \cmdinternal {cd:name} by \cmdinternal {cd:text}
203\stopxmlcmd
204
205\startxmlcmd {\cmdbasicsetup{xmltexentity}}
206    replaces entity with name \cmdinternal {cd:name} by \cmdinternal {cd:text}
207    typeset under a \TEX\ regime
208\stopxmlcmd
209
210Such a definition will always have a higher priority than the one defined
211in the document. Anyway, when the document is read in all entities are
212resolved and those that need a special treatment because they map to some
213text are stored in such a way that we can roundtrip them. As a consequence,
214as soon as the content gets pushed into \TEX, we need not only to intercept
215special characters but also have to make sure that the following works:
216
217\starttyping
218\xmltexentity {tex} {\TEX}
219\stoptyping
220
221Here the backslash starts a control sequence while in regular content a
222backslash is just that: a backslash.
223
224Special characters are really special when we have to move text around
225in a \TEX\ ecosystem.
226
227\starttyping
228<text>
229    <title>About #3</title>
230</text>
231\stoptyping
232
233If we map and define title as follows:
234
235\starttyping
236\startxmlsetup xml:title
237    \title{\xmlflush{#1}}
238\stopxmlsetup
239\stoptyping
240
241normally something \type {\xmlflush {id::123}} will be written to the
242auxiliary file and in most cases that is quite okay, but if we have this:
243
244\starttyping
245\setuphead[title][expansion=yes]
246\stoptyping
247
248then we don't want the \type {#} to end up as hash because later on \TEX\
249can get very confused about it because it sees some argument then in a
250probably unexpected way. This is solved by escaping the hash like this:
251
252\starttyping
253About \Ux{23}3
254\stoptyping
255
256The \type {\Ux} command will convert its hexadecimal argument into a
257character. Of course one then needs to typeset such a text under a \TEX\
258character regime but that is normally the case anyway.
259
260\stopsection
261
262\stopchapter
263
264\stopcomponent
265