% language=us \startcomponent hybrid-tags \environment hybrid-environment \startchapter[title={Tagged PDF}] \startsection [title={Introduction}] Occasionally users asked me if \CONTEXT\ can produce tagged \PDF\ and the answer to that has been: I'll implement it when I need it. However, users tell me that publishers more and more demand tagged \PDF\ files, although one might wonder what for, except maybe for accessibility. Another reason for not having spent too much time on it before is that the specification was not that inviting. At any rate, when I saw Ross Moore\footnote {He is often exploring the boundaries of \PDF, \UNICODE\ and evolving techniques related to math publishing so you'd best not miss his presentations when you are around.} presenting tagged math at TUG 2010, I decided to look up the spec once more and see if I could get into the mood to implement tagging. Before I started it was already clear that there were a couple of boundary conditions: \startitemize[packed] \startitem Tagging should not put a burden on the user but users should be able to tag themselves. \stopitem \startitem Tagging should not slow down a run too much; this is no big deal as one can postpone tagging till the last run. \stopitem \startitem Tagging should in no way interfere with typesetting, so no funny nodes should be injected. \stopitem \startitem Tagging should not make the code look worse, neither the document source, nor the low level \CONTEXT\ code. \stopitem \stopitemize And of course implementing it should not take more than a few days' work, certainly not in an exceptionally hot summer. You can \quote {google} for one of Ross's documents (like \type {DML_002-2009-1_12.pdf}) to see how a document source looks at his end using a special version of \PDFTEX. However, the version on my machine didn't support the shown primitives, so I could not see what was happening under the hood. Unfortunately it is quite hard to find a properly tagged document so we have only the reference manual as starting point. As the \PDFTEX\ approach didn't look that pleasing anyway, I just started from scratch. Tags can help Acrobat Reader when reading out the text aloud. But you cannot browse the structure in the no|-|cost version of Acrobat and as not all users have the professional version of Acrobat, the fact that a document has structure can go unnoticed. Add to that the fact that the overhead in terms of bytes is quite large as many more objects are generated, and you will understand why this feature is not enabled by default. \stopsection \startsection [title={Implementation}] So, what does tagging boil down to? We can best look at how tagged information is shown in Acrobat. \in {Figure} [fig:tagged-list] shows the content tree that has been added (automatically) to a document while \in {figure} [fig:tagged-order] shows a different view. \placefigure [page] [fig:tagged-list] {A tag list in Acrobat.} {\externalfigure[tagged-001.png][maxheight=\textheight]} \placefigure [here] [fig:tagged-order] {Acrobat showing the tag order.} {\externalfigure[tagged-004.png][maxwidth=\textwidth]} In order to get that far, we have to do the following: \startitemize[packed] \startitem Carry information with (typeset) text. \stopitem \startitem Analyse this information when shipping out pages. \stopitem \startitem Add a structure tree to the page. \stopitem \startitem Add relevant information to the document. \stopitem \stopitemize That first activity is rather independent of the other three and we can use that information for other purposes as well, like identifying where we are in the document. We carry the information around using attributes. The last three activities took a bit of experimenting mostly using the \quotation {Example of Logical Structure} from the \PDF\ standard 32000-1:2008. This resulted in a tagging framework that uses explicit tags, meaning the user is responsible for the tagging: \starttyping \setupstructure[state=start,method=none] \starttext \startelement[document] \startelement[chapter] \startelement[p] \input davis \stopelement \par \stopelement \startelement[chapter] \startelement[p] \input zapf \stopelement \par \startelement[whatever] \startelement[p] \input tufte \stopelement \par \startelement[p] \input knuth \stopelement \par \stopelement \stopelement \startelement[chapter] oeps \startelement[p] \input ward \stopelement \par \stopelement \stopelement \stoptext \stoptyping Since this is not much fun, we also provide an automated variant. In the previous example we explicitly turned off automated tagging by setting \type {method} to \type {none}. By default it has the value \type {auto}. \starttyping \setupstructure[state=start] % default is method=auto \definedescription[whatever] \starttext \startfrontmatter \startchapter[title=One] \startparagraph \input tufte \stopparagraph \startitemize \startitem first \stopitem \startitem second \stopitem \stopitemize \startparagraph \input ward \stopparagraph \startwhatever {Herman Zapf} \input zapf \stopwhatever \stopchapter \stopfrontmatter \startbodymatter .................. \stoptyping If you use commands like \type {\chapter} you will not get the desired results. Of course these can be supported but there is no real reason for it, as in \MKIV\ we advise using the \type {start}|-|\type {stop} variant. It will be clear that this kind of automated tagging brings with it a couple of extra commands deep down in \CONTEXT\ and there (of course) we use symbolic names for tags, so that one can overload the built|-|in mapping. \starttyping \setuptaglabeltext[en][document=text] \stoptyping As with other features inspired by viewer functionality, the implementation of tagging is independent of the backend. For instance, we can tag a document and access the tagging information at the \TEX\ end. The backend driver code maps tags to relevant \PDF\ constructs. First of all, we just map the tags used at the \CONTEXT\ end onto themselves. But, as validators expect certain names, we use the \PDF\ rolemap feature to map them to (less interesting) names. The next list shows the currently used internal names, with the \PDF\ ones between parentheses. \blank \startalignment[flushleft,nothyphenated] \startluacode local done = false for k, v in table.sortedpairs(structures.tags.properties) do if v.pdf then if done then context(", %s (%s)",k,v.pdf) else context("%s (%s)",k,v.pdf) done = true end end end context(".") \stopluacode \par \stopalignment \blank So, the internal ones show up in the tag trees as shown in the examples but applications might use the rolemap which normally has less detail. Because we keep track of where we are, we can also use that information for making decisions. \starttyping \doifinelementelse{structure:section} {yes} {no} \doifinelementelse{structure:chapter} {yes} {no} \doifinelementelse{division:*-structure:chapter} {yes} {no} \doifinelementelse{division:*-structure:*} {yes} {no} \stoptyping As shown, you can use \type {*} as a wildcard. The elements are separated by \type {-}. If you don't know what tags are used, you can always enable the tag related tracker: \starttyping \enabletrackers[structure.tags] \stoptyping This tracker reports the identified element chains to the console and log. \stopsection \startsection[title={Special care}] Of course there are a few complications. First of all the tagging model sort of contradicts the concept of a nicely typeset document where structure and outcome are not always related. Most \TEX\ users are aware of the fact that \TEX\ does not have space characters and does a great job on kerning and hyphenation. The tagging machinery on the other hand uses a rather dumb model of strings separated by spaces. \footnote {The search engine on the other hand is rather clever on recognizing words.} But we can trick \TEX\ into providing the right information to the backend so that words get nicely separated. The non|-|optimized function that does this looks as follows: \starttyping function injectspaces(head) local p for n in node.traverse(head) do local id = n.id if id == node.id("glue") then if p and p.id == node.id("glyph") then local g = node.copy(p) local s = node.copy(n.spec) g.char, n.spec = 32, s p.next, g.prev = g, p g.next, n.prev = n, g s.width = s.width - g.width end elseif id == node.id("hlist") or id == node.id("vlist") then injectspaces(n.list,attribute) end p = n end end \stoptyping Here we squeeze in a space (given that it is in the font which it normally is when you use \CONTEXT) and make a compensation in the glue. Given that your page sits in box 255, you can do this just before shipping the page out: \starttyping injectspaces(tex.box[255].list) \stoptyping Then there are the so|-|called suspects: things on the page that are not related to structure at all. One is supposed to tag these specially so that the built|-|in reading equipment is not confused. So far we could get around them simply because they don't get tagged at all and therefore are not seen anyway. This might well be enough of a precaution. Of course we need to deal with mathematics. Fortunately the presentation \MATHML\ model is rather close to \TEX\ and so we can map onto that. After all we don't need to care too much about back|-|mapping here. The currently present code is rather experimental and might get extended or thrown out in favour of inline \MATHML. \in {Figure} [fig:tagged-math] demonstrates that a first approach does not even look that bad. In future versions we might deal with table|-|like math constructs, like matrices. \placefigure [here] [fig:tagged-math] {Experimental math tagging.} {\externalfigure[tagged-005.png][maxwidth=\textwidth]} This is a typical case where more energy has to be spent on driving the voice of Acrobat but I will do that when we find a good reason. As mentioned, it will take a while before all relevant constructs in \CONTEXT\ support tagging, but support is already quite complete. Some screen dumps are included as examples at the end. \stopsection \startsection[title={Conclusion}] Surprisingly, implementing all this didn't take that much work. Of course detailed automated structure support from the complete \CONTEXT\ kernel will take some time to get completed, but that will be done on demand and when we run into missing bits and pieces. It's still not decided to what extent alternate representations and alternate texts will be supported. Experiments with the reading|-|aloud machinery are not satisfying yet but maybe it just can't get any better. It would be nice if we could get some tags being announced without overloading the content, that is: without using ugly hacks. And of course, code like this is never really finished if only because \PDF\ evolves. Also, it is yet another nice test case and torture test for \LUATEX\ and it helps us to find buglets and oversights. \stopsection \startsection [title=Some more examples] In \CONTEXT\ we have user definable verbatim environments. As with other user definable environments we show the specific instance as comment next to the structure component. See \in {figure} [fig:tagged-verbatim]. Some examples of tables are shown in \in {figure} [fig:tagged-tables]. Future versions will have a bit more structure. Tables of contents (see \in {figure} [fig:tagged-contents]) and registers (see \in {figure} [fig:tagged-register]) are also tagged. (One might wonder what the use is of this.) In \in {figure} [fig:tagged-floats] we see some examples of floats. External images as well as \METAPOST\ graphics are tagged as such. This example also shows an example of a user environment, in this case: \starttyping \definestartstop[notabene][style=\bf] \stoptyping In a similar fashion, footnotes (\in {figure} [fig:tagged-footnotes]) end up in the structure tree, but in the typeset document they move around (normally forward when there is no room). \placefigure [here] [fig:tagged-verbatim] {Verbatim, including dedicated instances.} {\externalfigure[tagged-006.png][maxwidth=\textwidth]} \placefigure [here] [fig:tagged-tables] {Natural tables as well as the tabulate mechanism is supported.} {\externalfigure[tagged-008.png][maxwidth=\textwidth]} \placefigure [here] [fig:tagged-contents] {Tables of content with specific entries tagged.} {\externalfigure[tagged-007.png][maxwidth=\textwidth]} \placefigure [here] [fig:tagged-register] {A detailed view of registered is provided.} {\externalfigure[tagged-009.png][maxwidth=\textwidth]} \placefigure [here] [fig:tagged-floats] {Floats tags end up in text stream. Watch the user defined construct.} {\externalfigure[tagged-011.png][maxwidth=\textwidth]} \placefigure [here] [fig:tagged-footnotes] {Footnotes are shown at the place in the input (flow).} {\externalfigure[tagged-010.png][maxwidth=\textwidth]} \stopsection \stopcomponent