% language=us runpath=texruns:manuals/xml \environment xml-mkiv-style \startcomponent xml-mkiv-expressions \startchapter[title={Expressions and filters}] \startsection[title={path expressions}] In the previous chapters we used \cmdinternal {cd:lpath} expressions, which are a variant on \type {xpath} expressions as in \XSLT\ but in this case more geared towards usage in \TEX. This mechanisms will be extended when demands are there. A path is a sequence of matches. A simple path expression is: \starttyping a/b/c/d \stoptyping Here each \type {/} goes one level deeper. We can go backwards in a lookup with \type {..}: \starttyping a/b/../d \stoptyping We can also combine lookups, as in: \starttyping a/(b|c)/d \stoptyping A negated lookup is preceded by a \type {!}: \starttyping a/(b|c)/!d \stoptyping A wildcard is specified with a \type {*}: \starttyping a/(b|c)/!d/e/*/f \stoptyping In addition to these tag based lookups we can use attributes: \starttyping a/(b|c)/!d/e/*/f[@type=whatever] \stoptyping An \type {@} as first character means that we are dealing with an attribute. Within the square brackets there can be boolean expressions: \starttyping a/(b|c)/!d/e/*/f[@type=whatever and @id>100] \stoptyping You can use functions as in: \starttyping a/(b|c)/!d/e/*/f[something(text()) == "oeps"] \stoptyping There are a couple of predefined functions: \starttabulate[|l|l|p|] \NC \type{rootposition} \type{order} \NC number \NC the index of the matched root element (kind of special) \NC \NR \NC \type{position} \NC number \NC the current index of the matched element in the match list \NC \NR \NC \type{match} \NC number \NC the current index of the matched element sub list with the same parent \NC \NR \NC \type{first} \NC number \NC \NC \NR \NC \type{last} \NC number \NC \NC \NR \NC \type{index} \NC number \NC the current index of the matched element in its parent list \NC \NR \NC \type{firstindex} \NC number \NC \NC \NR \NC \type{lastindex} \NC number \NC \NC \NR \NC \type{element} \NC number \NC the element's index \NC \NR \NC \type{firstelement} \NC number \NC \NC \NR \NC \type{lastelement} \NC number \NC \NC \NR \NC \type{text} \NC string \NC the textual representation of the matched element \NC \NR \NC \type{content} \NC table \NC the node of the matched element \NC \NR \NC \type{name} \NC string \NC the full name of the matched element: namespace and tag \NC \NR \NC \type{namespace} \type{ns} \NC string \NC the namespace of the matched element \NC \NR \NC \type{tag} \NC string \NC the tag of the matched element \NC \NR \NC \type{attribute} \NC string \NC the value of the attribute with the given name of the matched element \NC \NR \stoptabulate There are fundamental differences between \type {position}, \type {match} and \type {index}. Each step results in a new list of matches. The \type {position} is the index in this new (possibly intermediate) list. The \type {match} is also an index in this list but related to the specific match of element names. The \type {index} refers to the location in the parent element. Say that we have: \starttyping .1. .1. .2. .2. .3. .3. \stoptyping The following then applies: \starttabulate[|l|l|] \NC \type {collection/resources/manual[position()==1]/paper} \NC \type{.1.} \NC \NR \NC \type {collection/resources/manual[match()==1]/paper} \NC \type{.1.} \type{.3.} \NC \NR \NC \type {collection/resources/manual/paper[index()==1]} \NC \type{.2.} \NC \NR \stoptabulate In most cases the \type {position} test is more restrictive than the \type {match} test. You can pass your own functions too. Such functions are defined in the \type {xml.expressions} namespace. We have defined a few shortcuts: \starttabulate[|l|l|] \NC \type {find(str,pattern)} \NC \type{string.find} \NC \NR \NC \type {contains(str)} \NC \type{string.find} \NC \NR \NC \type {oneof(str,...)} \NC is \type{str} in list \NC \NR \NC \type {upper(str)} \NC \type{characters.upper} \NC \NR \NC \type {lower(str)} \NC \type{characters.lower} \NC \NR \NC \type {number(str)} \NC \type{tonumber} \NC \NR \NC \type {boolean(str)} \NC \type{toboolean} \NC \NR \NC \type {idstring(str)} \NC removes leading hash \NC \NR \NC \type {name(index)} \NC full tag name \NC \NR \NC \type {tag(index)} \NC tag name \NC \NR \NC \type {namespace(index)} \NC namespace of tag \NC \NR \NC \type {text(index)} \NC content \NC \NR \NC \type {error(str)} \NC quit and show error \NC \NR \NC \type {quit()} \NC quit \NC \NR \NC \type {print()} \NC print message \NC \NR \NC \type {count(pattern)} \NC number of matches \NC \NR \NC \type {child(pattern)} \NC take child that matches \NC \NR \stoptabulate You can also use normal \LUA\ functions as long as you make sure that you pass the right arguments. There are a few predefined variables available inside such functions. \starttabulate[|Tl|l|p|] \NC \type{list} \NC table \NC the list of matches \NC \NR \NC \type{l} \NC number \NC the current index in the list of matches \NC \NR \NC \type{ll} \NC element \NC the current element that matched \NC \NR \NC \type{order} \NC number \NC the position of the root of the path \NC \NR \stoptabulate The given expression between \type {[]} is converted to a \LUA\ expression so you can use the usual operators: \starttyping == ~= <= >= < > not and or () \stoptyping In addition, \type {=} equals \type {==} and \type {!=} is the same as \type {~=}. If you mess up the expression, you quite likely get a \LUA\ error message. \stopsection \startsection[title={css selectors}] \startbuffer[selector-001] b.one b.two b.one.two b.three b#first c d e d e d e e d f @foo = bar @bar = foo @bar = foo1 @bar = foo2 @bar = foo3 @bar = foo+4 g g gg d g gg f g gg f.one g g gg f.two g gg f.three g f.one g f.three @whatever = four five six \stopbuffer \xmlloadbuffer{selector-001}{selector-001} \startxmlsetups xml:selector:demo \advance\scratchcounter\plusone \inleftmargin{\the\scratchcounter}\ignorespaces\xmlverbatim{#1}\par \stopxmlsetups \unexpanded\def\showCSSdemo#1#2% {\blank \textrule{\tttf#2} \startlines \dontcomplain \tttf \obeyspaces \scratchcounter\zerocount \xmlcommand{#1}{#2}{xml:selector:demo} \stoplines \blank} The \CSS\ approach to filtering is a bit different from the path based one and is supported too. In fact, you can combine both methods. Depending on what you select, the \CSS\ one can be a little bit faster too. It has the advantage that one can select more in one go but at the same time looks a bit less attractive. This method was added just to show that it can be done but might be useful too. A selector is given between curly braces (after all \CSS\ uses them and they have no function yet in the parser. \starttyping \xmlall{#1}{{foo bar .whatever, bar foo .whatever}} \stoptyping The following methods are supported: \starttabulate[|T||] \NC element \NC all tags element \NC \NR \NC element-1 > element-2 \NC all tags element-2 with parent tag element-1 \NC \NR \NC element-1 + element-2 \NC all tags element-2 preceded by tag element-1 \NC \NR \NC element-1 ~ element-2 \NC all tags element-2 preceded by tag element-1 \NC \NR \NC element-1 element-2 \NC all tags element-2 inside tag element-1 \NC \NR \NC [attribute] \NC has attribute \NC \NR \NC [attribute=value] \NC attribute equals value\NC \NR \NC [attribute\lettertilde =value] \NC attribute contains value (space is separator) \NC \NR \NC [attribute\letterhat ="value"] \NC attribute starts with value \NC \NR \NC [attribute\letterdollar="value"] \NC attribute ends with value \NC \NR \NC [attribute*="value"] \NC attribute contains value \NC \NR \NC .class \NC has class \NC \NR \NC \letterhash id \NC has id \NC \NR \NC :nth-child(n) \NC the child at index n \NC \NR \NC :nth-last-child(n) \NC the child at index n from the end \NC \NR \NC :first-child \NC the first child \NC \NR \NC :last-child \NC the last child \NC \NR \NC :nth-of-type(n) \NC the match at index n \NC \NR \NC :nth-last-of-type(n) \NC the match at index n from the end \NC \NR \NC :first-of-type \NC the first match \NC \NR \NC :last-of-type \NC the last match \NC \NR \NC :only-of-type \NC the only match or nothing \NC \NR \NC :only-child \NC the only child or nothing \NC \NR \NC :empty \NC only when empty \NC \NR \NC :root \NC the whole tree \NC \NR \stoptabulate The next pages show some examples. For that we use the demo file: \typebuffer[selector-001] The class and id selectors often only make sense in \HTML\ like documents but they are supported nevertheless. They are after all just shortcuts for filtering by attribute. The class filtering is special in the sense that it checks for a class in a list of classes given in an attribute. \showCSSdemo{selector-001}{{.one}} \showCSSdemo{selector-001}{{.one, .two}} \showCSSdemo{selector-001}{{.one, .two, \letterhash first}} Attributes can be filtered by presence, value, partial value and such. Quotes are optional but we advice to use them. \showCSSdemo{selector-001}{{[foo], [bar=foo]}} \showCSSdemo{selector-001}{{[bar\lettertilde=foo]}} \showCSSdemo{selector-001}{{[bar\letterhat="foo"]}} \showCSSdemo{selector-001}{{[whatever\lettertilde="five"]}} You can of course combine the methods as in: \showCSSdemo{selector-001}{{g f .one, g f .three}} \showCSSdemo{selector-001}{{g > f .one, g > f .three}} \showCSSdemo{selector-001}{{d + e}} \showCSSdemo{selector-001}{{d ~ e}} \showCSSdemo{selector-001}{{d ~ e, g f .one, g f .three}} You can also negate the result by using \type {:not} on a simple expression: \showCSSdemo{selector-001}{{:not([whatever\lettertilde="five"])}} \showCSSdemo{selector-001}{{:not(d)}} The child and match selectors are also supported: \showCSSdemo{selector-001}{{a:nth-child(3)}} \showCSSdemo{selector-001}{{a:nth-last-child(3)}} \showCSSdemo{selector-001}{{g:nth-of-type(3)}} \showCSSdemo{selector-001}{{g:nth-last-of-type(3)}} \showCSSdemo{selector-001}{{a:first-child}} \showCSSdemo{selector-001}{{a:last-child}} \showCSSdemo{selector-001}{{e:first-of-type}} \showCSSdemo{selector-001}{{gg d:only-of-type}} Instead of numbers you can also give the \type {an} and \type {an+b} formulas as well as the \type {odd} and \type {even} keywords: \showCSSdemo{selector-001}{{a:nth-child(even)}} \showCSSdemo{selector-001}{{a:nth-child(odd)}} \showCSSdemo{selector-001}{{a:nth-child(3n+1)}} \showCSSdemo{selector-001}{{a:nth-child(2n+3)}} There are a few special cases: \showCSSdemo{selector-001}{{g:empty}} \showCSSdemo{selector-001}{{g:root}} \showCSSdemo{selector-001}{{*}} Combining the \CSS\ methods with the regular ones is possible: \showCSSdemo{selector-001}{{g gg f .one}} \showCSSdemo{selector-001}{g/gg/f[@class='one']} \showCSSdemo{selector-001}{g/{gg f .one}} \startbuffer[selector-002] title 1 title 2 title 3 title 4 \stopbuffer The next examples we use this file: \typebuffer[selector-002] \xmlloadbuffer{selector-002}{selector-002} When we filter from this (not too well structured) tree we can use both methods to achieve the same: \showCSSdemo{selector-002}{{document title .one, document title .three}} \showCSSdemo{selector-002}{/document/title[(@class='one') or (@class='three')]} However, imagine this file: \startbuffer[selector-003] title 1 title 1.1 title 2 title 2.1 title 3 title 3.1 title 4 title 4.1 \stopbuffer \typebuffer[selector-003] \xmlloadbuffer{selector-003}{selector-003} The next filter in easier with the \CSS\ selector methods because these accumulate independent (simple) expressions: \showCSSdemo{selector-003}{{document title .one + subtitle, document title .two + subtitle}} Watch how we get an output in the document order. Because we render a sequential document a combined filter will trigger a sorting pass. \stopsection \startsection[title={functions as filters}] At the \LUA\ end a whole \cmdinternal {cd:lpath} expression results in a (set of) node(s) with its environment, but that is hardly usable in \TEX. Think of code like: \starttyping for e in xml.collected(xml.load('text.xml'),"title") do -- e = the element that matched end \stoptyping The older variant is still supported but you can best use the previous variant. \starttyping for r, d, k in xml.elements(xml.load('text.xml'),"title") do -- r = root of the title element -- d = data table -- k = index in data table end \stoptyping Here \type {d[k]} points to the \type {title} element and in this case all titles in the tree pass by. In practice this kind of code is encapsulated in function calls, like those returning elements one by one, or returning the first or last match. The result is then fed back into \TEX, possibly after being altered by an associated setup. We've seen the wrappers to such functions already in a previous chapter. In addition to the previously discussed expressions, one can add so called filters to the expression, for instance: \starttyping a/(b|c)/!d/e/text() \stoptyping In a filter, the last part of the \cmdinternal {cd:lpath} expression is a function call. The previous example returns the text of each element \type {e} that results from matching the expression. When running \TEX\ the following functions are available. Some are also available when using pure \LUA. In \TEX\ you can often use one of the macros like \type {\xmlfirst} instead of a \type {\xmlfilter} with finalizer \type {first()}. The filter can be somewhat faster but that is hardly noticeable. \starttabulate[|l|l|p|] \NC \type {context()} \NC string \NC the serialized text with \TEX\ catcode regime \NC \NR %NC \type {ctxtext()} \NC string \NC \NC \NR \NC \type {function()} \NC string \NC depends on the function \NC \NR % \NC \type {name()} \NC string \NC the (remapped) namespace \NC \NR \NC \type {tag()} \NC string \NC the name of the element \NC \NR \NC \type {tags()} \NC list \NC the names of the element \NC \NR % \NC \type {text()} \NC string \NC the serialized text \NC \NR \NC \type {upper()} \NC string \NC the serialized text uppercased \NC \NR \NC \type {lower()} \NC string \NC the serialized text lowercased \NC \NR \NC \type {stripped()} \NC string \NC the serialized text stripped \NC \NR \NC \type {lettered()} \NC string \NC the serialized text only letters (cf. \UNICODE) \NC \NR % \NC \type {count()} \NC number \NC the number of matches \NC \NR \NC \type {index()} \NC number \NC the matched index in the current path \NC \NR \NC \type {match()} \NC number \NC the matched index in the preceding path \NC \NR % %NC \type {lowerall()} \NC string \NC \NC \NR %NC \type {upperall()} \NC string \NC \NC \NR % \NC \type {attribute(name)} \NC content \NC returns the attribute with the given name \NC \NR \NC \type {chainattribute(name)} \NC content \NC sidem, but backtracks till one is found \NC \NR \NC \type {command(name)} \NC content \NC expands the setup with the given name for each found element \NC \NR \NC \type {position(n)} \NC content \NC processes the \type {n}\high{th} instance of the found element \NC \NR \NC \type {all()} \NC content \NC processes all instances of the found element \NC \NR %NC \type {default} \NC content \NC all \NC \NR \NC \type {reverse()} \NC content \NC idem in reverse order \NC \NR \NC \type {first()} \NC content \NC processes the first instance of the found element \NC \NR \NC \type {last()} \NC content \NC processes the last instance of the found element \NC \NR \NC \type {concat(...)} \NC content \NC concatinates the match \NC \NC \NR \NC \type {concatrange(from,to,...)} \NC content \NC concatinates a range of matches \NC \NC \NR \NC \type {depth()} \NC number \NC the depth in the tree of the found element \NC \NC \NR \stoptabulate The extra arguments of the concatinators are: \type {separator} (string), \type {lastseparator} (string) and \type {textonly} (a boolean). These filters are in fact \LUA\ functions which means that if needed more of them can be added. Indeed this happens in some of the \XML\ related \MKIV\ modules, for instance in the \MATHML\ processor. \stopsection \startsection[title={example}] The number of commands is rather large and if you want to avoid them this is often possible. Take for instance: \starttyping \xmlall{#1}{/a/b[position()>3]} \stoptyping Alternatively you can use: \starttyping \xmlfilter{#1}{/a/b[position()>3]/all()} \stoptyping and actually this is also faster as internally it avoids a function call. Of course in practice this is hardly measurable. In previous examples we've already seen quite some expressions, and it might be good to point out that the syntax is modelled after \XSLT\ but is not quite the same. The reason is that we started with a rather minimal system and have already styles in use that depend on compatibility. \starttyping namespace:// axis node(set) [expr 1]..[expr n] / ... / filter \stoptyping When we are inside a \CONTEXT\ run, the namespace is \type {tex}. Hoewever, if you want not to print back to \TEX\ you need to be more explicit. Say that we typeset examns and have a (not that logical) structure like: \starttyping ... one two three true 1 false 0 true 2 \stoptyping Say that we typeset the questions with: \starttyping \startxmlsetups question \blank score: \xmlfunction{#1}{totalscore} \blank \xmlfirst{#1}{text} \startitemize \xmlfilter{#1}{/answer/item/command(answer:item)} \stopitemize \endgraf \blank \stopxmlsetups \stoptyping Each item in the answer results in a call to: \starttyping \startxmlsetups answer:item \startitem \xmlflush{#1} \endgraf \xmlfilter{#1}{../../alternative[position()=rootposition()]/ condition/command(answer:condition)} \stopitem \stopxmlsetups \stoptyping \starttyping \startxmlsetups answer:condition \endgraf condition: \xmlflush{#1} \endgraf \stopxmlsetups \stoptyping Now, there are two rather special filters here. The first one involves calculating the total score. As we look forward we use a function to deal with this. \starttyping \startluacode function xml.functions.totalscore(root) local score = 0 for e in xml.collected(root,"/alternative") do score = score + xml.filter(e,"xml:///score/number()") or 0 end tex.write(score) end \stopluacode \stoptyping Watch how we use the namespace to keep the results at the \LUA\ end. The second special trick shown here is to limit a match using the current position of the root (\type {#}) match. As you can see, a path expression can be more than just filtering a few nodes. At the end of this manual you will find a bunch of examples. \stopsection \startsection[title={tables}] If you want to know how the internal \XML\ tables look you can print such a table: \starttyping print(table.serialize(e)) \stoptyping This produces for instance: % s = xml.convert("some text") % print(table.serialize(xml.filter(s,"demo")[1])) \starttyping t={ ["at"]={ ["label"]="whatever", }, ["dt"]={ "some text" }, ["ns"]="", ["rn"]="", ["tg"]="demo", } \stoptyping The \type {rn} entry is the renamed namespace (when renaming is applied). If you see tags like \type {@pi@} this means that we don't have an element, but (in this case) a processing instruction. \starttabulate[|l|p|] \NC \type {@rt@} \NC the root element \NC \NR \NC \type {@dd@} \NC document definition \NC \NR \NC \type {@cm@} \NC comment, like \type {} \NC \NR \NC \type {@cd@} \NC so called \type {CDATA} \NC \NR \NC \type {@pi@} \NC processing instruction, like \type {} \NC \NR \stoptabulate There are many ways to deal with the content, but in the perspective of \TEX\ only a few matter. \starttabulate[|l|p|] \NC \type {xml.sprint(e)} \NC print the content to \TEX\ and apply setups if needed \NC \NR \NC \type {xml.tprint(e)} \NC print the content to \TEX\ (serialize elements verbose) \NC \NR \NC \type {xml.cprint(e)} \NC print the content to \TEX\ (used for special content) \NC \NR \stoptabulate Keep in mind that anything low level that you uncover is not part of the official interface unless mentioned in this manual. \stopsection \stopchapter \stopcomponent