hybrid-export.tex /size: 11 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startluacode
4    job.files.context(dir.glob("exported-*.tex"),"--directives=structures.export.lessstate")
5\stopluacode
6
7\startcomponent hybrid-export
8
9\environment hybrid-environment
10
11\startchapter[title={Exporting XML}]
12
13\startsection [title={Introduction}]
14
15Every now and then on the the mailing list users ask if \CONTEXT\ can produce
16\HTML\ instead of for instance \PDF, and the answer has always been unsatisfying.
17In this chapter I will present the \MKIV\ way of doing this.
18
19\stopsection
20
21\startsection [title={The clumsy way}]
22
23My favourite answer to the question about how to produce \HTML\ (or more general
24\XML\ as it can be transformed) has always been: \quotation {I'd just typeset
25it!}. Take:
26
27\starttyping
28\def\MyChapterCommand#1#2{<h1>#2</h1>}
29\setuphead[chapter][command=\MyChapterCommand]
30\stoptyping
31
32Here \type {\chapter{Hello World}} will produce:
33
34\starttyping
35<h1>Hello World</h1>
36\stoptyping
37
38Now imagine that you hook such commands into all relevant environments and that
39you use a style with no header and footer lines. You use a large page (A2) and a
40small monospaced font (4pt) so that page breaks will not interfere too much. If
41you want columns, fine, just hook in some code that typesets the final columns as
42tables. In the end you will have an ugly looking \PDF\ file but by feeding it
43into \type {pdftotext} you will get a nicely formatted \HTML\ file.
44
45For some languages of course encoding issues would show up and there can be all
46kind of interferences, so eventually the amount of code dealing with it would
47have accumulated. This is why we don't follow this route.
48
49An alternative is to use \type {tex4ht} which does an impressive job for \LATEX,
50and supports \CONTEXT\ to some extent as well. As far as I know it overloads some
51code deep down in the kernel which is something \quote {not done} in the
52\CONTEXT\ universe if only because we cannot keep control over side effects. It
53also complicates maintainance of both systems.
54
55In \MKIV\ however, we do have the ability to export the document to a structured
56\XML\ file so let's have a look at that.
57
58\stopsection
59
60\startsection [title={Structure}]
61
62The ability to export to some more verbose format depends on the availability of
63structural information. As we already tag elements for the sake of tagged \PDF,
64it was tempting to see how well we could use those tags for exporting to \XML. In
65principle it is possible to use Acrobat Professional to export the content using
66tags but you can imagine that we get a better quality if we stay within the scope
67of the producing machinery.
68
69\starttyping
70\setupbackend[export=yes]
71\stoptyping
72
73This is all you need unless you want to fine tune the resulting \XML\ file. If
74you are familiar with tagged \PDF\ support in \CONTEXT, you will recognize the
75result. When you process the following file:
76
77\typefile{exported-001.tex}
78
79You will get a file with the suffix \type {export} that looks as follows:
80\footnote{We will omit the topmost lines in following examples.}
81
82\typefile{exported-001.export}
83
84It's no big deal to postprocess such a file. In that case one can for instance
85ignore the chapter number or combine the number and the title. Of course
86rendering information is lost here. However, sometime it makes sense to export
87some more details. Take the following table:
88
89\typefile[range=2]{exported-002.tex}
90
91Here we need to preserve the span related information as well as cell specific
92alignments as for tables this is an essential part of the structure.
93
94\typefile[range=7]{exported-002.export}
95
96The tabulate mechanism is quite handy for regular text especially when the
97content of cells has to be split over pages. As each line in a paragraph in a
98tabulate becomes a cell, we need to reconstruct the paragraphs from the (split)
99alignment cells.
100
101\typefile[range=2]{exported-003.tex}
102
103This becomes:
104
105\typefile[range=7]{exported-003.export}
106
107The \type {<break/>} elements are injected automatically between paragraphs. We
108could tag each paragraph individually but that does not work that well when we
109have for instance a quotation that spans multiple paragraphs (and maybe starts in
110the middle of one). An empty element is not sensitive for this and is still a
111signal that vertical spacing is supposed to be applied.
112
113\stopsection
114
115\startsection[title=The implementation]
116
117We implement tagging using attributes. The advantage of this is that it does not
118interfere with typesetting, but a disadvantage is that not all parent elements
119are visible. When we encounter some content, we're in the innermost element so if
120we want to do something special, we need to deduce the structure from the current
121child. This is no big deal as we have that information available at each child
122element in the tree.
123
124The first implementation just flushed the \XML\ on the fly (i.e.\ when traversing
125the node list) but when I figured out that collapsing was needed for special
126cases like tabulated paragraphs this approach was no longer valid. So, after some
127experiments I decided to build a complete structure tree in memory \footnote {We
128will see if this tree will be used for other purposes in the future.}. This
129permits us to handle situations like the following:
130
131\typefile[range=2]{exported-005.tex}
132
133Here we get:
134
135\typefile[range=7]{exported-005.export}
136
137The \type {symbol} and \type {packed} attributes are first seen at the \type
138{itemcontent} level (the innermost element) so when we flush the \type
139{itemgroup} element's attributes we need to look at the child elements (content)
140that actually carry the attribute.\footnote {Only glyph nodes are investigated
141for structure.}
142
143I already mentioned collapsing. As paragraphs in a tabulate get split into cells,
144we encounter a mixture that cannot be flushed sequentially. However, as each cell
145is tagged uniquely we can append the lines within a cell. Also, as each paragraph
146gets a unique number, we can add breaks before a new paragraph starts. Collapsing
147and adding breakpoints is done at the end, and not per page, as paragraphs can
148cross pages. Again, thanks to the fact that we have a tree, we can investigate
149content and do this kind of manipulations.
150
151Moving data like footnotes are somewhat special. When notes are put on the page
152(contrary to for instance end notes) the so called \quote {insert} mechanism is
153used where their content is kept with the line where it is defined. As a result
154we see them end up instream which is not that bad a coincidence. However, as in
155\MKIV\ notes are built on top of (enumerated) descriptions, we need to
156distinguish them somehow so that we can cross reference them in the export.
157
158\typefile[range=2]{exported-006.tex}
159
160Currently this will end up as follows:
161
162\typefile[range=7]{exported-006.export}
163
164Graphics are also tagged and the \type {image} element reflects the included
165image.
166
167\typefile[range=2]{exported-007.tex}
168
169If the image sits on another path then that path shows up in an attribute and
170when a page other than~1 is taken from the (pdf) image, it gets mentioned as
171well.
172
173\typefile[range=7]{exported-007.export}
174
175Cross references are another relevant aspect of an export. In due time we will
176export them all. It's not so much complicated because all information is there
177but we need to hook some code into the right spot and making examples for those
178cases takes a while as well.
179
180\typefile[range=2]{exported-009.tex}
181
182We export references in the \CONTEXT\ specific way, so no
183interpretation takes place.
184
185\typefile[range=7]{exported-009.export}
186
187As \CONTEXT\ has an integrated referencing system that deals with internal as
188well as external references, url's, special interactive actions like controlling
189widgets and navigations, etc.\ and we export the raw reference specification as
190well as additional attributes that provide some detail.
191
192\typefile[range=2]{exported-013.tex}
193
194Of course, when postprocessing the exported data, you need to take these variants
195into account.
196
197\typefile[range=7]{exported-013.export}
198
199\stopsection
200
201\startsection[title=Math]
202
203Of course there are limitations. For instance \TEX ies doing math might wonder if
204we can export formulas. To some extent the export works quite well.
205
206\typefile[range=2]{exported-008.tex}
207
208This results in the usual rather verbose presentation \MATHML:
209
210\typefile[range=7]{exported-008.export}
211
212More complex math (like matrices) will be dealt with in due time as for this
213Aditya and I have to take tagging into account when we revisit the relevant code
214as part of the \MKIV\ cleanup and extensions. It's not that complex but it makes
215no sense to come up with intermediate solutions.
216
217Display verbatim is also supported. In this case we tag individual lines.
218
219\typefile[range=2]{exported-010.tex}
220
221The export is not that spectacular:
222
223\typefile[range=7]{exported-010.export}
224
225A rather special case are marginal notes. We do tag them because they
226often contain usefull information.
227
228\typefile[range=2]{exported-012.tex}
229
230The output is currently as follows:
231
232\typefile[range=7]{exported-012.export}
233
234However, this might change in future versions.
235
236\stopsection
237
238\startsection[title=Formatting]
239
240The output is formatted using indentation and newlines. The extra run time needed
241for this (actually, quite some of the code is related to this) is compensated by
242the fact that inspecting the result becomes more convenient. Each environment has
243one of the properties \type {inline}, \type {mixed} and \type {display}. A
244display environment gets newlines around it and an inline environment none at
245all. The mixed variant does something in between. In the following example we tag
246some user elements, but you can as well influence the built in ones.
247
248\typefile[range=2]{exported-004.tex}
249
250This results in:
251
252\typefile[range=7]{exported-004.export}
253
254Keep in mind that elements have no influence on the typeset result apart from
255introducing spaces when used this way (this is not different from other \TEX\
256commands). In due time the formatting might improve a bit but at least we have
257less chance ending up with those megabyte long one||liners that some applications
258produce.
259
260\stopsection
261
262\startsection[title=A word of advise]
263
264In (for instance) \HTML\ class attributes are used to control rendering driven by
265stylesheets. In \CONTEXT\ you can often define derived environments and their
266names will show up in the detail attribute. So, if you want control at that level
267in the export, you'd better use the structure related options built in \CONTEXT,
268for instance:
269
270\typefile[range=2]{exported-011.tex}
271
272This gives two different sections:
273
274\typefile[range=7]{exported-011.export}
275
276\stopsection
277
278\startsection[title=Conclusion]
279
280It is an open question if such an export is useful. Personally I never needed a
281feature like this and there are several reasons for this. First of all, most of
282my work involves going from (often complex) \XML\ to \PDF\ and if you have \XML\
283as input, you can also produce \HTML\ from it. For documents that relate to
284\CONTEXT\ I don't need it either because manuals are somewhat special in the
285sense that they often depend on showing something that ends up on paper (or its
286screen counterpart) anyway. Loosing the makeup also renders the content somewhat
287obsolete. But this feature is still a nice proof of concept anyway.
288
289\stopsection
290
291\stopchapter
292
293\stopcomponent
294