mk-xml.tex /size: 21 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3% \startluacode
4%     xml.trace_lpath = true
5% \stopluacode
6
7\startcomponent mk-xml
8
9\environment mk-environment
10
11\chapter{XML revisioned}
12
13{\em The code dealing with \XML\ is evolving and the following
14text might be outdated. So, in case of doubt, check the manual.}
15
16\subject{the parser}
17
18For quite a while \CONTEXT\ has built-in support for \XML\ processing and
19at \PRAGMA\ we use this extensively. One of the first things I tried to deal
20with in \LUA\ was \XML, and now that we have \LUATEX\ up and running it's
21time to investigate this a bit more. First we'll have a look at the basic
22functions, the \LUA\ side of the game.
23
24We load an \XML\ file as follows (the \type {document} namespace
25is predefined in \CONTEXT):
26
27\startbuffer
28\startluacode
29    document.xml = document.xml or { } -- define namespace
30    document.xml = xml.load("mk-xml.xml") -- load the file
31\stopluacode
32\stopbuffer
33
34\typebuffer \getbuffer
35
36The loader constructs a table representing the document structure, including
37whitespace, so let's serialize the code and see what shows up:
38
39\startbuffer
40\startluacode
41    local prn = xml.newhandlers { handle = tex.sprint }
42    tex.sprint("\\starttyping")
43    xml.serialize(document.xml, prn)
44    tex.sprint("\\stoptyping")
45\stopluacode
46\stopbuffer
47
48\typebuffer
49
50In the first version of the serializer, we could pass extra function
51arguments that controlled the way content was processed. This method
52has now been replaced by handlers. In this example we create a
53simple handler where the \type {handle} function is responsible
54for the final print.
55
56\getbuffer
57
58This already gives us a rather basic way to manipulate documents and
59this method is even not that slow because we bypass \TEX\ reading from
60file.
61
62\startbuffer
63\startluacode
64    local str = "<l> <w>hello</w> <w>world</w> </l>"
65    local prn = xml.newhandlers { handle = tex.sprint }
66    tex.sprint("\\starttyping")
67    xml.serialize(xml.convert(str),prn)
68    tex.sprint("\\stoptyping")
69\stopluacode
70\stopbuffer
71
72\typebuffer
73
74Watch the extra print argument, we need this because otherwise the
75verbatim mode will not work out well.
76
77\getbuffer
78
79You need to keep in mind that in these examples we print to \TEX\ under
80the current catcode regime.
81
82You can save a \XML\ table with the command:
83
84\starttyping
85\startluacode
86    xml.save(document.xml,"newfile.xml")
87\stopluacode
88\stoptyping
89
90These examples show that you have access to \XML\ files from
91within your document. If you want to convert the table to just a
92string, you can use \type {xml.tostring}. Actually, this method is
93automatically used for occasions where \LUA\ wants to print an
94\XML\ table or wants to join string snippets. However, as we are
95inside \TEX, we need to print to \TEX\ instead of the console or
96file. For this we use specialized handlers.
97
98The reason why I wrote the \XML\ parser is that we need it in the
99utilities (so it has to provide access to the content of elements)
100as well as in the text processing (so it needs to provide some
101manipulation features). To serve both we have implemented a subset
102of what standard \XML\ tools qualify as path based searching.
103
104\startbuffer
105\startluacode
106    xml.sprint(xml.first(document.xml, "/one/three/some"))
107\stopluacode
108\stopbuffer
109
110\typebuffer
111
112The result of this snippet is the content of the first element
113that matches the specification: \quote{\getbuffer}. As you can
114see, this comes out rather verbose. The reason for this is that we
115need to enter \XML\ mode in order to get such a snippet
116interpreted.
117
118Below we give a few more variants, this time
119we use a generic filter:
120
121\startbuffer
122\startluacode
123    xml.sprint(xml.filter(document.xml, "/one/three/some"))
124\stopluacode
125\stopbuffer
126
127\typebuffer result: \astype{\getbuffer}
128
129\startbuffer
130\startluacode
131    xml.sprint(xml.filter(document.xml, "/one/three/some/first()"))
132\stopluacode
133\stopbuffer
134
135\typebuffer result: \astype{\getbuffer}
136
137\startbuffer
138\startluacode
139    xml.sprint(xml.filter(document.xml, "/one/three/some[1]"))
140\stopluacode
141\stopbuffer
142
143\typebuffer result: \astype{\getbuffer}
144
145\startbuffer
146\startluacode
147    xml.sprint(xml.filter(document.xml, "/one/three/some[-1]"))
148\stopluacode
149\stopbuffer
150
151\typebuffer result: \astype{\getbuffer}
152
153\startbuffer
154\startluacode
155    xml.sprint(xml.filter(document.xml, "/one/three/some/texts()"))
156\stopluacode
157\stopbuffer
158
159\typebuffer result: \astype{\getbuffer}
160
161\startbuffer
162\startluacode
163    xml.sprint(xml.filter(document.xml, "/one/three/some[2]/text()"))
164\stopluacode
165\stopbuffer
166
167\typebuffer result: \astype{\getbuffer}
168
169The next lines shows some more variants. There are more than these and
170we will extend the repertoire over time. If needed you can define
171additional handlers.
172
173\subject{performance}
174
175Before we continue with more examples, a few remarks about the
176performance. The first version of the parser was an enhanced
177version of the one presented in the \LUA\ book: support for
178namespaces, processing instructions, comments, cdata and doctype,
179remapping and a few more things. When playing with the parser I
180was quite satisfied about the performance. However, when I started
181experimenting with 40~megabyte files, the preprocessing (needed
182for the special elements) started to become more noticeable. For
183smaller files its 40\% overhead is not that disturbing, but for
184large files \unknown\
185
186The current version uses \LPEG. We follow the same approach as
187before, stack and top and such but this time parsing is about
188twice as fast which is mostly due to the fact that we don't have
189to prepare the stream for cdata, doctype etc. Loading the
190mentioned large file took 12.5 seconds (1.5 for file io and the
191rest for tree building) on my laptop (a 2.3 Ghz Core Duo running
192Windows Vista). With the \LPEG\ implementation we got that down to
193less 7.3 seconds. Loading the 14 interface definition files (2.6
194meg) went down from 1.05 seconds to 0.55 seconds. Namespace
195related issues take some 10\% of this.
196
197Of course these numbers might change over time. For instance, we
198now have the second implementation of the filter mechanism which
199is more advanced and maybe somewhat slower on some tasks.
200
201\subject{patterns}
202
203We will not implement complete \XPATH\ functionality, but only the
204features that make sense for documents that are well structured
205and needs to be typeset. In addition we (will) implement text
206manipulation functions. Of course speed is also a consideration
207when implementing such mechanisms.
208
209The following list is not complete (after all here we only give an
210impression of the development) but it gives a good impression.
211
212\nonknuthmode
213
214\starttabulate[|l|c|l|]
215\NC \bf pattern                        \NC \bf supported \NC \bf comment              \NC \NR
216\HL
217\NC \type{a}                           \NC \star         \NC not anchored             \NC \NR
218\NC \type{!a}                          \NC \star         \NC not anchored,negated     \NC \NR
219\NC \type{a/b}                         \NC \star         \NC anchored on preceding    \NC \NR
220\NC \type{/a/b}                        \NC \star         \NC anchored (current root)  \NC \NR
221\NC \type{^a/c}                        \NC \star         \NC anchored (current root)  \NC \NR
222\NC \type{^^/a/c}                      \NC todo          \NC anchored (document root) \NC \NR
223\NC \type{a/*/b}                       \NC \star         \NC one wildcard             \NC \NR
224\NC \type{a//b}                        \NC \star         \NC many wildcards           \NC \NR
225\NC \type{a/**/b}                      \NC \star         \NC many wildcards           \NC \NR
226\NC \type{.}                           \NC \star         \NC ignored self             \NC \NR
227\NC \type{..}                          \NC \star         \NC parent                   \NC \NR
228\NC \type{a[5]}                        \NC \star         \NC index upwards            \NC \NR
229\NC \type{a[-5]}                       \NC \star         \NC index downwards          \NC \NR
230\NC \type{a[position()=5]}             \NC maybe         \NC                          \NC \NR
231\NC \type{a[first()]}                  \NC maybe         \NC                          \NC \NR
232\NC \type{a[last()]}                   \NC maybe         \NC                          \NC \NR
233\NC \type{(b|c|d)}                     \NC \star         \NC alternates (one of)      \NC \NR
234\NC \type{b|c|d}                       \NC \star         \NC alternates (one of)      \NC \NR
235\NC \type{!(b|c|d)}                    \NC \star         \NC not one of               \NC \NR
236\NC \type{a/(b|c|d)/e/f}               \NC \star         \NC anchored alternates      \NC \NR
237\NC \type{(c/d|e)}                     \NC not likely    \NC nested subpaths          \NC \NR
238\NC \type{a/b[@bla]}                   \NC \star         \NC any value of             \NC \NR
239\NC \type{a/b/@bla}                    \NC \star         \NC any value of             \NC \NR
240\NC \type{a/b[@bla='oeps']}            \NC \star         \NC equals value             \NC \NR
241\NC \type{a/b[@bla=='oeps']}           \NC \star         \NC equals value             \NC \NR
242\NC \type{a/b[@bla<>'oeps']}           \NC \star         \NC different value          \NC \NR
243\NC \type{a/b[@bla!='oeps']}           \NC \star         \NC different value          \NC \NR
244\TB
245\NC \type{...../attribute(id)}         \NC \star         \NC                          \NC \NR
246\NC \type{...../attributes()}          \NC \star         \NC                          \NC \NR
247\NC \type{...../text()}                \NC \star         \NC                          \NC \NR
248\NC \type{...../texts()}               \NC \star         \NC                          \NC \NR
249\NC \type{...../first()}               \NC \star         \NC                          \NC \NR
250\NC \type{...../last()}                \NC \star         \NC                          \NC \NR
251\NC \type{...../index(n)}              \NC \star         \NC                          \NC \NR
252\NC \type{...../position(n)}           \NC \star         \NC                          \NC \NR
253\TB
254\NC \type{root::}                      \NC \star         \NC                          \NC \NR
255\NC \type{parent::}                    \NC \star         \NC                          \NC \NR
256\NC \type{child::}                     \NC \star         \NC                          \NC \NR
257\NC \type{ancestor::}                  \NC \star         \NC                          \NC \NR
258\NC \type{preceding-sibling::}         \NC not soon      \NC                          \NC \NR
259\NC \type{following-sibling::}         \NC not soon      \NC                          \NC \NR
260\NC \type{preceding-sibling-of-self::} \NC not soon      \NC                          \NC \NR
261\NC \type{following-sibling-or-self::} \NC not soon      \NC                          \NC \NR
262\NC \type{descendent::}                \NC \star         \NC                          \NC \NR
263\NC \type{descendent-or-self::}        \NC \star         \NC                          \NC \NR
264\NC \type{preceding::}                 \NC not soon      \NC                          \NC \NR
265\NC \type{following::}                 \NC not soon      \NC                          \NC \NR
266\NC \type{self::node()}                \NC not soon      \NC                          \NC \NR
267\NC \type{id("tag")}                   \NC not soon      \NC                          \NC \NR
268\NC \type{node()}                      \NC not soon      \NC                          \NC \NR
269\stoptabulate
270
271This list shows that it is also possible to ask for more matches at
272once. Namespaces are supported (including a wildcard) and there are
273mechanisms for namespace remapping.
274
275\startbuffer
276\startluacode
277    lxml.concat(document.xml,"/one/(three|five)/some",", "," and ")
278\stopluacode
279\stopbuffer
280
281\typebuffer
282
283We get: \astype{\getbuffer} and if we say:
284
285\startbuffer
286\startluacode
287    lxml.concat(document.xml,"/one/(three|five)/some",", "," and ",
288        true)
289\stopluacode
290\stopbuffer
291
292\typebuffer
293
294We get: \quote {\getbuffer}.
295
296Watch how we use the \type {lxml} namespace here! Here live the
297functions that pipe the result to \TEX.
298
299\startbuffer
300\startluacode
301    lxml.count(document.xml,"/one/(three|five)/some")
302\stopluacode
303\stopbuffer
304
305There a several helper functions, like \type {xml.count} which in this case
306returns~\getbuffer.
307
308\typebuffer
309
310Functions like this gives the opportunity to loop over lists of elements
311by index.
312
313\subject{manipulations}
314
315We can manipulate elements too. The next code will add some elements
316at specific locations.
317
318\startbuffer
319\startluacode
320    xml.before(document.xml,"xml:///one/three/some","<be>ok</be>")
321    xml.after (document.xml,"xml:///one/three/some","<af>ok</af>")
322    tex.sprint("\\starttyping")
323    xml.sprint(lxml.filter(document.xml,"/one/three"))
324    tex.sprint("\\stoptyping")
325\stopluacode
326\stopbuffer
327
328\typebuffer
329
330And indeed, we suddenly have a couple of \quote {ok}'s there:
331
332\getbuffer
333
334Of course wel can also delete elements:
335
336\startbuffer
337\startluacode
338    xml.delete(document.xml,"/one/three/some")
339    xml.delete(document.xml,"/one/three/af")
340    tex.sprint("\\starttyping")
341    xml.sprint(lxml.filter(document.xml,"/one/three"))
342    tex.sprint("\\stoptyping")
343\stopluacode
344\stopbuffer
345
346\typebuffer
347
348Now we have:
349
350\getbuffer
351
352Replacing an element is also possible. The replacement can be a
353table (representing elements) or a string which is then converted
354into a table first.
355
356\startbuffer
357\startluacode
358    xml.replace(document.xml,"/one/three/be","<mid>done</mid>")
359    tex.sprint("\\starttyping")
360    xml.sprint(lxml.filter(document.xml,"/one/three"))
361    tex.sprint("\\stoptyping")
362\stopluacode
363\stopbuffer
364
365\typebuffer
366
367And indeed we get:
368
369\getbuffer
370
371These are just a few features of the library. I will add some more (rather) generic
372manipulaters and extend the functionality of the existing ones. Also, there will
373be a few manipulation functions that come in handy when preparing texts for
374processing with \TEX\ (most of the \XML\ that I deal with is rather dirty and needs
375some cleanup).
376
377\subject{streaming trees}
378
379Eventually we will provies series of convenient macros that will provide an
380alternative for most of the \MKII\ code. In \MKII\ we have a streaming parser, which
381boils down to attaching macros to elements. This includes a mechanism for saving
382an restoring data, but this is not always convenient because one also has to
383intercept elements that needs to be hidden.
384
385In \MKIV\ we do things different. First we load the complete document in memory (a
386\LUA\ table). Then we flush the elements that we want to process. We can associate
387setups with elements using the filters mentioned before. We can either use \TEX\ or
388use \LUA\ to manipulate content. Instead if a streaming parser we now have a mixture
389of streaming and tree manipulation available. Interesting is that the \XML\ loader
390is pretty fast and piping data to \TEX\ is also efficient. Since we no longer need to
391manipulate the elements in \TEX\ we gain processing time too, so in practice we have
392now much faster \XML\ processing available.
393
394To give you an idea we show a few commands:
395
396\startbuffer
397\xmlload {main}{mk-xml.xml}
398\stopbuffer
399
400\typebuffer \getbuffer
401
402So that we can do things like (there are and will be a few more):
403
404\starttabulate[|l|l|l|]
405\NC \bf command        \NC \bf arguments                        \NC \bf result                          \NC \NR
406\NC \type {\xmlfirst}  \NC \type {{main} {/one/three/some}}     \NC \xmlfirst{main}{/one/three/some}    \NC \NR
407\NC \type {\xmllast }  \NC \type {{main} {/one/three/some}}     \NC \xmllast {main}{/one/three/some}    \NC \NR
408\NC \type {\xmlindex}  \NC \type {{main} {/one/three/some} {2}} \NC \xmlindex{main}{/one/three/some}{2} \NC \NR
409\stoptabulate
410
411There is a set of about 30 commands that operates on the tree: loading, flushing,
412filtering, associating setups and code in modules to elements. For instance when
413one uses so called cals||tables, the processing is automatically activates when the
414namespace can be resolved. Processing is collected in setups and those registered
415are these are processed after loading the tree. In the following example we register
416a handler for content that needs to end up bold.
417
418\starttyping
419\startxmlsetups xml:mysetups
420    \xmlsetsetup{\xmldocument}{bold|bf}{xml:handlebold}
421\stopxmlsetups
422
423\xmlregistersetup{xml:mysetups}
424
425\startxmlsetups xml:handlebold
426    \dontleavehmode
427    \bgroup
428    \bf
429    \xmlflush{#1}
430    \egroup
431\stopxmlsetups
432\stoptyping
433
434In this example \type {#1} represents the root of the subtree. Say that we
435want to process an index entry which is coded as follows:
436
437\starttyping
438<index>
439    <entry>whatever</entry>
440    <key>whatever</key>
441</index>
442\stoptyping
443
444We register an additional handler (here the \type {*} is a shortcut for
445using the element's tag as setup name):
446
447\starttyping
448\startxmlsetups xml:mysetups
449    \xmlsetsetup{\xmldocument}{bold|bf}{xml:handlebold}
450    \xmlsetsetup{\xmldocument}{index}{*}
451\stopxmlsetups
452
453\xmlregistersetup{xml:mysetups}
454
455\startxmlsetups index
456    \index[\xmlfirst{#1}{key}]{\xmlfirst{#1}{entry}}
457\stopxmlsetups
458\stoptyping
459
460In practice \MKIV\ definitions are more compact than the comparable
461\MKII\ ones, especially for more complex constructs (tables and such).
462
463\starttyping
464\defineXMLenvironment
465  [index]
466  {\bgroup
467   \defineXMLsave[key]%
468   \defineXMLsave[entry]}
469  {\index[\XMLflush{key}]{\XMLflush{entry}}%
470   \egroup}
471\stoptyping
472
473This looks compact, but keep in mind that we also need to get rid of
474spurry spaces and when the code grows, we usually use setups to separate
475the definition from the code. In any case, the \MKII\ solution involves
476a few definitions as well as saving the content of elements. This is often
477much more costly than the \MKIV\ method where we only locate and flush
478content. Of course the document is stored in memory, but that happens
479pretty fast: storing the 14~files (2~per interface) that define the \CONTEXT\
480user interface takes .85 seconds on a 2.3 Ghz Core Duo (Windows Vista) which
481is not that bad if you take into account that we're talking of 2.7 megabytes
482of highly structured data (many elements and attributes, not that much text).
483Loading one of these files using \MKII\ code (for storing elements) takes
484many more seconds.
485
486I didn't do extensive speed tests yet but for normal streamed
487processing of simple documents the penalty of loading the tree can be
488neglected. When comparing traditional \MKII\ code like:
489
490\starttyping
491\defineXMLargument   [title][id=] {\subject[\XMLop{at}]}
492\defineXMLenvironment[p]          {} {\par}
493
494\starttext
495    \processXMLfilegrouped{testspeed.xml}
496\stoptext
497\stoptyping
498
499with its \MKIV\ counterpart:
500
501\starttyping
502\startxmlsetups document
503    \xmlsetsetup\xmldocument{title|p}{*}
504\stopxmlsetups
505
506\xmlregistersetup{document}
507
508\startxmlsetups title
509    \section[\xmlatt{#1}{id}]{\xmlcontent{#1}{/}}
510\stopxmlsetups
511
512\startxmlsetups p
513    \xmlflush{#1}\endgraf
514\stopxmlsetups
515
516\starttext
517    \processXMLfilegrouped{testspeed.xml}
518\stoptext
519
520I found that processing a one megabyte file with some 400 sections
521takes the same runtime for both approaches. However, as soon as more
522complex manipulations enter the game the \MKIV\ method starts taking
523less time. Think of the manipulations needed for \MATHML\ or converting
524tables into something that \CONTEXT\ can handle. Also, when we deal
525with documents where we need to ignore large portions of shuffle content
526around, the traditional method also has to store data in memory and in
527that case \MKII\ code always loses from \MKIV\ code. Of course any speed
528we gain in handling \XML\ is lost on processing complex fonts and
529attributes but there we gain in quality.
530
531\stoptyping
532
533Another advantage of the \MKIV\ mechanisms is that we suddenly have so called
534fully expandable \XML\ handling. All manipulations take place in \LUA\ and
535there is no interfering code at the \TEX\ end.
536
537\subject{examples}
538
539For the path freaks we now show what patterns lead to. For this we will
540use the following \XML\ data:
541
542\startbuffer[xml]
543<?xml version='1.0' ?>
544<a>
545    <?what is this?>
546    <b>
547        <c n='x'>c1</c><d>d1</d>
548    </b>
549    <b>
550        <c n='y'>c2</c><d>d2</d>
551    </b>
552    <?what is that?>
553    <c><d>d3</d></c>
554    <c n='y'><d>d4</d></c>
555    <c><d>d5</d></c>
556</a>
557\stopbuffer
558
559\typebuffer[xml]
560
561\xmlloadbuffer{xml}{xml}
562
563\startluacode
564    function document.ShowResultOfPattern(root,pattern)
565        local ok = false
566        for r,d,k in xml.elements(lxml.id(root),pattern) do
567            tex.print(xml.tostring(d[k]))
568            tex.sprint(tex.ctxcatcodes,"\\par")
569            ok = true
570        end
571        if not ok then
572            tex.sprint("no match")
573            tex.sprint(tex.ctxcatcodes,"\\par")
574        end
575    end
576\stopluacode
577
578Here come the examples:
579
580\definehead[example][subsubject]
581\setuphead[example][style=\tt,before=\blank,after=\nowhitespace]
582
583\def\ShowResultOfPattern#1#2%
584  {\example{#2}
585   \startpacked \tttf
586   \ctxlua{document.ShowResultOfPattern("#1","#2")}
587   \stoppacked}
588
589\ShowResultOfPattern{xml}{a/b/c}
590\ShowResultOfPattern{xml}{/a/b/c}
591\ShowResultOfPattern{xml}{b/c}
592\ShowResultOfPattern{xml}{c}
593\ShowResultOfPattern{xml}{a/*/c}
594\ShowResultOfPattern{xml}{a/**/c}
595\ShowResultOfPattern{xml}{a//c}
596\ShowResultOfPattern{xml}{a/*/*/c}
597\ShowResultOfPattern{xml}{*/c}
598\ShowResultOfPattern{xml}{**/c}
599\ShowResultOfPattern{xml}{a/../*/c}
600\ShowResultOfPattern{xml}{a/../c}
601\ShowResultOfPattern{xml}{c[@n='x']}
602\ShowResultOfPattern{xml}{c[@n]}
603\ShowResultOfPattern{xml}{c[@n='y']}
604\ShowResultOfPattern{xml}{c[1]}
605\ShowResultOfPattern{xml}{b/c[1]}
606\ShowResultOfPattern{xml}{a/c[1]}
607\ShowResultOfPattern{xml}{a/c[-1]}
608\ShowResultOfPattern{xml}{c[1]}
609\ShowResultOfPattern{xml}{c[-1]}
610\ShowResultOfPattern{xml}{pi::}
611\ShowResultOfPattern{xml}{pi::what}
612
613\stopcomponent
614