hybrid-tags.tex /size: 13 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent hybrid-tags
4
5\environment hybrid-environment
6
7\startchapter[title={Tagged PDF}]
8
9\startsection [title={Introduction}]
10
11Occasionally users asked me if \CONTEXT\ can produce tagged \PDF\ and the answer
12to that has been: I'll implement it when I need it. However, users tell me that
13publishers more and more demand tagged \PDF\ files, although one might wonder
14what for, except maybe for accessibility. Another reason for not having spent too
15much time on it before is that the specification was not that inviting.
16
17At any rate, when I saw Ross Moore\footnote {He is often exploring the boundaries
18of \PDF, \UNICODE\ and evolving techniques related to math publishing so you'd
19best not miss his presentations when you are around.} presenting tagged math at
20TUG 2010, I decided to look up the spec once more and see if I could get into the
21mood to implement tagging. Before I started it was already clear that there were
22a couple of boundary conditions:
23
24\startitemize[packed]
25\startitem Tagging should not put a burden on the user but users
26           should be able to tag themselves. \stopitem
27\startitem Tagging should not slow down a run too much; this is
28           no big deal as one can postpone tagging till the last
29           run. \stopitem
30\startitem Tagging should in no way interfere with typesetting, so
31           no funny nodes should be injected. \stopitem
32\startitem Tagging should not make the code
33           look worse, neither the document source, nor the low
34           level \CONTEXT\ code. \stopitem
35\stopitemize
36
37And of course implementing it should not take more than a few days' work,
38certainly not in an exceptionally hot summer.
39
40You can \quote {google} for one of Ross's documents (like \type
41{DML_002-2009-1_12.pdf}) to see how a document source looks at his end using a
42special version of \PDFTEX. However, the version on my machine didn't support the
43shown primitives, so I could not see what was happening under the hood.
44Unfortunately it is quite hard to find a properly tagged document so we have only
45the reference manual as starting point. As the \PDFTEX\ approach didn't look that
46pleasing anyway, I just started from scratch.
47
48Tags can help Acrobat Reader when reading out the text aloud. But you cannot
49browse the structure in the no|-|cost version of Acrobat and as not all users
50have the professional version of Acrobat, the fact that a document has structure
51can go unnoticed. Add to that the fact that the overhead in terms of bytes is
52quite large as many more objects are generated, and you will understand why this
53feature is not enabled by default.
54
55\stopsection
56
57\startsection [title={Implementation}]
58
59So, what does tagging boil down to? We can best look at how tagged information is
60shown in Acrobat. \in {Figure} [fig:tagged-list] shows the content tree that has
61been added (automatically) to a document while \in {figure} [fig:tagged-order]
62shows a different view.
63
64\placefigure
65  [page]
66  [fig:tagged-list]
67  {A tag list in Acrobat.}
68  {\externalfigure[tagged-001.png][maxheight=\textheight]}
69
70\placefigure
71  [here]
72  [fig:tagged-order]
73  {Acrobat showing the tag order.}
74  {\externalfigure[tagged-004.png][maxwidth=\textwidth]}
75
76In order to get that far, we have to do the following:
77
78\startitemize[packed]
79\startitem Carry information with (typeset) text. \stopitem
80\startitem Analyse this information when shipping out pages. \stopitem
81\startitem Add a structure tree to the page. \stopitem
82\startitem Add relevant information to the document. \stopitem
83\stopitemize
84
85That first activity is rather independent of the other three and we can use that
86information for other purposes as well, like identifying where we are in the
87document. We carry the information around using attributes. The last three
88activities took a bit of experimenting mostly using the \quotation {Example of
89Logical Structure} from the \PDF\ standard 32000-1:2008.
90
91This resulted in a tagging framework that uses explicit tags, meaning the user is
92responsible for the tagging:
93
94\starttyping
95\setupstructure[state=start,method=none]
96
97\starttext
98
99\startelement[document]
100
101    \startelement[chapter]
102        \startelement[p] \input davis \stopelement \par
103    \stopelement
104
105    \startelement[chapter]
106        \startelement[p] \input zapf \stopelement \par
107        \startelement[whatever]
108            \startelement[p] \input tufte \stopelement \par
109            \startelement[p] \input knuth \stopelement \par
110        \stopelement
111    \stopelement
112
113    \startelement[chapter]
114        oeps
115        \startelement[p] \input ward \stopelement \par
116    \stopelement
117
118\stopelement
119
120\stoptext
121\stoptyping
122
123Since this is not much fun, we also provide an automated
124variant. In the previous example we explicitly turned off automated
125tagging by setting \type {method} to \type {none}. By default it has
126the value \type {auto}.
127
128\starttyping
129\setupstructure[state=start] % default is method=auto
130
131\definedescription[whatever]
132
133\starttext
134
135\startfrontmatter
136    \startchapter[title=One]
137        \startparagraph \input tufte \stopparagraph
138        \startitemize
139            \startitem first \stopitem
140            \startitem second \stopitem
141        \stopitemize
142        \startparagraph \input ward \stopparagraph
143        \startwhatever {Herman Zapf} \input zapf \stopwhatever
144    \stopchapter
145
146\stopfrontmatter
147
148\startbodymatter
149    ..................
150\stoptyping
151
152If you use commands like \type {\chapter} you will not get the desired results.
153Of course these can be supported but there is no real reason for it, as in \MKIV\
154we advise using the \type {start}|-|\type {stop} variant.
155
156It will be clear that this kind of automated tagging brings with it a couple of
157extra commands deep down in \CONTEXT\ and there (of course) we use symbolic names
158for tags, so that one can overload the built|-|in mapping.
159
160\starttyping
161\setuptaglabeltext[en][document=text]
162\stoptyping
163
164As with other features inspired by viewer functionality, the implementation of
165tagging is independent of the backend. For instance, we can tag a document and
166access the tagging information at the \TEX\ end. The backend driver code maps
167tags to relevant \PDF\ constructs. First of all, we just map the tags used at the
168\CONTEXT\ end onto themselves. But, as validators expect certain names, we use
169the \PDF\ rolemap feature to map them to (less interesting) names. The next list
170shows the currently used internal names, with the \PDF\ ones between parentheses.
171
172\blank \startalignment[flushleft,nothyphenated]
173\startluacode
174local done = false
175for k, v in table.sortedpairs(structures.tags.properties) do
176    if v.pdf then
177        if done then
178            context(", %s (%s)",k,v.pdf)
179        else
180            context("%s (%s)",k,v.pdf)
181            done = true
182        end
183    end
184end
185context(".")
186\stopluacode \par \stopalignment \blank
187
188So, the internal ones show up in the tag trees as shown in the examples but
189applications might use the rolemap which normally has less detail.
190
191Because we keep track of where we are, we can also use that information for
192making decisions.
193
194\starttyping
195\doifinelementelse{structure:section}            {yes} {no}
196\doifinelementelse{structure:chapter}            {yes} {no}
197\doifinelementelse{division:*-structure:chapter} {yes} {no}
198\doifinelementelse{division:*-structure:*}       {yes} {no}
199\stoptyping
200
201As shown, you can use \type {*} as a wildcard. The elements are separated by
202\type {-}. If you don't know what tags are used, you can always enable the tag
203related tracker:
204
205\starttyping
206\enabletrackers[structure.tags]
207\stoptyping
208
209This tracker reports the identified element chains to the console
210and log.
211
212\stopsection
213
214\startsection[title={Special care}]
215
216Of course there are a few complications. First of all the tagging model sort of
217contradicts the concept of a nicely typeset document where structure and outcome
218are not always related. Most \TEX\ users are aware of the fact that \TEX\ does
219not have space characters and does a great job on kerning and hyphenation. The
220tagging machinery on the other hand uses a rather dumb model of strings separated
221by spaces. \footnote {The search engine on the other hand is rather clever on
222recognizing words.} But we can trick \TEX\ into providing the right information
223to the backend so that words get nicely separated. The non|-|optimized function
224that does this looks as follows:
225
226\starttyping
227function injectspaces(head)
228    local p
229    for n in node.traverse(head) do
230        local id = n.id
231        if id == node.id("glue") then
232            if p and p.id == node.id("glyph") then
233                local g = node.copy(p)
234                local s = node.copy(n.spec)
235                g.char, n.spec = 32, s
236                p.next, g.prev = g, p
237                g.next, n.prev = n, g
238                s.width = s.width - g.width
239            end
240        elseif id == node.id("hlist") or id == node.id("vlist") then
241            injectspaces(n.list,attribute)
242        end
243        p = n
244    end
245end
246\stoptyping
247
248Here we squeeze in a space (given that it is in the font which it normally is
249when you use \CONTEXT) and make a compensation in the glue. Given that your page
250sits in box 255, you can do this just before shipping the page out:
251
252\starttyping
253injectspaces(tex.box[255].list)
254\stoptyping
255
256Then there are the so|-|called suspects: things on the page that are not related
257to structure at all. One is supposed to tag these specially so that the
258built|-|in reading equipment is not confused. So far we could get around them
259simply because they don't get tagged at all and therefore are not seen anyway.
260This might well be enough of a precaution.
261
262Of course we need to deal with mathematics. Fortunately the presentation \MATHML\
263model is rather close to \TEX\ and so we can map onto that. After all we don't
264need to care too much about back|-|mapping here. The currently present code is
265rather experimental and might get extended or thrown out in favour of inline
266\MATHML. \in {Figure} [fig:tagged-math] demonstrates that a first approach does
267not even look that bad. In future versions we might deal with table|-|like math
268constructs, like matrices.
269
270\placefigure
271  [here]
272  [fig:tagged-math]
273  {Experimental math tagging.}
274  {\externalfigure[tagged-005.png][maxwidth=\textwidth]}
275
276This is a typical case where more energy has to be spent on driving the voice of
277Acrobat but I will do that when we find a good reason.
278
279As mentioned, it will take a while before all relevant constructs in \CONTEXT\
280support tagging, but support is already quite complete. Some screen dumps are
281included as examples at the end.
282
283\stopsection
284
285\startsection[title={Conclusion}]
286
287Surprisingly, implementing all this didn't take that much work. Of course
288detailed automated structure support from the complete \CONTEXT\ kernel will take
289some time to get completed, but that will be done on demand and when we run into
290missing bits and pieces. It's still not decided to what extent alternate
291representations and alternate texts will be supported. Experiments with the
292reading|-|aloud machinery are not satisfying yet but maybe it just can't get any
293better. It would be nice if we could get some tags being announced without
294overloading the content, that is: without using ugly hacks.
295
296And of course, code like this is never really finished if only because \PDF\
297evolves. Also, it is yet another nice test case and torture test for \LUATEX\ and
298it helps us to find buglets and oversights.
299
300\stopsection
301
302\startsection [title=Some more examples]
303
304In \CONTEXT\ we have user definable verbatim environments. As with other user
305definable environments we show the specific instance as comment next to the
306structure component. See \in {figure} [fig:tagged-verbatim]. Some examples of
307tables are shown in \in {figure} [fig:tagged-tables]. Future versions will have a
308bit more structure. Tables of contents (see \in {figure} [fig:tagged-contents])
309and registers (see \in {figure} [fig:tagged-register]) are also tagged. (One
310might wonder what the use is of this.) In \in {figure} [fig:tagged-floats] we see
311some examples of floats. External images as well as \METAPOST\ graphics are
312tagged as such. This example also shows an example of a user environment, in this
313case:
314
315\starttyping
316\definestartstop[notabene][style=\bf]
317\stoptyping
318
319In a similar fashion, footnotes (\in {figure} [fig:tagged-footnotes]) end up in
320the structure tree, but in the typeset document they move around (normally
321forward when there is no room).
322
323\placefigure
324  [here]
325  [fig:tagged-verbatim]
326  {Verbatim, including dedicated instances.}
327  {\externalfigure[tagged-006.png][maxwidth=\textwidth]}
328
329\placefigure
330  [here]
331  [fig:tagged-tables]
332  {Natural tables as well as the tabulate mechanism is supported.}
333  {\externalfigure[tagged-008.png][maxwidth=\textwidth]}
334
335\placefigure
336  [here]
337  [fig:tagged-contents]
338  {Tables of content with specific entries tagged.}
339  {\externalfigure[tagged-007.png][maxwidth=\textwidth]}
340
341\placefigure
342  [here]
343  [fig:tagged-register]
344  {A detailed view of registered is provided.}
345  {\externalfigure[tagged-009.png][maxwidth=\textwidth]}
346
347\placefigure
348  [here]
349  [fig:tagged-floats]
350  {Floats tags end up in text stream. Watch the user defined construct.}
351  {\externalfigure[tagged-011.png][maxwidth=\textwidth]}
352
353\placefigure
354  [here]
355  [fig:tagged-footnotes]
356  {Footnotes are shown at the place in the input (flow).}
357  {\externalfigure[tagged-010.png][maxwidth=\textwidth]}
358
359\stopsection
360
361\stopcomponent
362