SourceBrowser

tagging.tex /size: 19 Kb last modification: 2025-02-21 11:03
1% language=us runpath=texruns:manuals/tagging
2
3% todo: use concrete
4
5\usemodule
6  [abbreviations-logos,scite,math-verbatim]
7
8% \showframe
9
10\setupbackend[format=pdf/ua-2]
11\setuptagging[state=start]
12% \nopdfcompression
13
14\setupbodyfont
15  [pagella,12pt]
16
17\setuplayout
18  [header=0pt,
19   width=middle]
20
21\setupheadertexts
22  []
23
24\setupfootertexts
25  [chapter]
26  [pagenumber]
27
28\setupwhitespace
29  [big]
30
31\setuphead
32  [chapter]
33  [style=\bfc,
34   interaction=all]
35
36\setuphead
37  [section]
38  [style=\bfb]
39
40\setuphead
41  [subsection]
42  [style=\bfa]
43
44\setuphead
45  [subsubsection]
46  [style=\bf,
47   after=]
48
49\setuplist
50  [interaction=all]
51
52\setupdocument
53  [before=\directsetup{document:titlepage}]
54
55% The tag shape will be improved.
56
57\startuseMPgraphic{titlepage}
58    numeric n ;
59
60    path p ; p :=
61            (1,0)
62        --- (5,0)
63        ... (6,3)
64        ... (4,4)
65        --- (4,6)
66        ... (3,7)
67        ... (2,6)
68        --- (2,4)
69        ... (0,3)
70        ... cycle
71    ;
72
73    for i=1 upto 21 :
74        for j=1 upto 30 :
75            draw image (
76                fill
77                    p
78                    withcolor .7yellow ;
79                fill
80                    fullcircle shifted (3,6)
81                    withcolor white ;
82                n := -1 randomized 2 ;
83                draw textext (
84                    if n > 0.5 :
85                        "\ttbf no\hskip.2em tag"
86                    elseif n > 0 :
87                        "\ttbf retag"
88                    elseif n > -0.5 :
89                        "\ttbf untag"
90                    else :
91                        "\ttbf tag"
92                    fi
93                ) ysized 3/2 shifted (3,3/2)
94                    withcolor white ;
95            ) rotated (-15 randomized 30)
96                shifted (10i,10j) ;
97        endfor ;
98    endfor ;
99
100    setbounds currentpicture to boundingbox currentpicture enlarged 3 ;
101
102    addbackground withcolor .8blue ;
103
104    currentpicture := currentpicture xysized(PaperWidth,PaperHeight) ;
105
106    picture q[] ;
107
108    q[1] := image (
109        draw image (
110            for i=1 upto 15 :
111                for j=1 upto 14 :
112                    fill
113                        (fullsquare xyscaled (2,3))
114                        shifted (3*i,4*j)
115                    ;
116                endfor ;
117            endfor ;
118        ) withcolor .7red ;
119        draw image (
120            for i=1 upto 15 :
121                for j=1 upto 14 :
122                    draw
123                        textext("\ttbf MCID") rotated 90 xsized .8
124                        shifted (3*i,4*j)
125                    ;
126                endfor ;
127            endfor ;
128        ) withcolor white ;
129    )
130      xsized .9PaperWidth
131    ;
132
133    q[2] := image (
134        fill
135            p
136            withcolor .7green ;
137        fill
138            fullcircle shifted (3,6)
139            withcolor white ;
140        n := -1 randomized 2 ;
141        draw textext ("\ttbf PDF")
142            ysized 3/2 shifted (3,3/2)
143            withcolor white ;
144    )
145      rotated -5
146      xsized .25PaperWidth
147    ;
148
149    q[3] := image (
150        fill
151            p
152            withcolor .7green ;
153        fill
154            fullcircle shifted (3,6)
155            withcolor white ;
156        n := -1 randomized 2 ;
157        draw textext ("\ttbf tagged")
158            ysized 3/2 shifted (3,3/2)
159            withcolor white ;
160    )
161      rotated 5
162      xsized .25PaperWidth
163    ;
164
165    q[1] := q[1]
166        shifted -center topboundary q[1]
167        shifted center topboundary currentpicture
168        shifted (0,-PaperHeight/20)
169    ;
170
171    q[2] := q[2]
172        shifted -center topboundary q[2]
173        shifted center bottomboundary q[1]
174        shifted (6.5PaperWidth/20,1.5PaperHeight/20)
175    ;
176
177    q[3] := q[3]
178        shifted -center topboundary q[3]
179        shifted center bottomboundary q[1]
180        shifted (PaperWidth/20,1PaperHeight/20)
181    ;
182
183    draw q[1] withtransparency (1,.70) ;
184    draw q[2] withtransparency (1,.85) ;
185    draw q[3] withtransparency (1,.85) ;
186
187\stopuseMPgraphic
188
189\startsetups document:titlepage
190    \startTEXpage
191        \useMPgraphic{titlepage}
192    \stopTEXpage
193\stopsetups
194
195\setuptyping[option=TEX]
196
197\startdocument[title=foo]
198
199\startchapter[title=Why do we tag]
200
201Around 2010 tagged \PDF\ showed up in \CONTEXT. Apart from demonstrating that it
202could be done it served little purpose because only full Acrobat could show a
203structure tree and in the more than a decade afterwards no other viewer did
204something with it. However for some users it was a necessity.
205
206In 2024 we picked up on tagging because due to regulations (especially in higher
207education) demands for tagged \PDF\ in the perspective of accessibility popped
208up. We will not go into details here but just mention that we want to make sure
209that users can meet these demands.
210
211As of now (2024) we have little expectations when it comes to tagging. The
212ongoing discussions about how to tag, how to interpret the specification, what to
213validate, and what to expect from applications are likely to go on for a while,
214so the best we can do is keep an eye on it and adapt when needed. If we have
215opinions, these will be exposed in other documents (and articles).
216
217We can also notice that the standard is less standard as things change, part as
218side effect of clarification (which tells us something) but also because it looks
219like some applications have problems with it. Working on this is disappointing
220and dissatisfying, but often we have a good laugh about this mess, so we try to
221keep up (adapt) anyway.
222
223\startlines
224Hans Hagen
225Mikael Sundqvist
226\stoplines
227
228\stopchapter
229
230\startchapter[title=Tagging text]
231
232As mentioned in the introduction, we need to satisfy validators that are imposed
233on those working in education (often via web interfaces with little information
234on what actually gets checked, it's business after all). It is not that hard to
235fool them and make documents compliant, so that is what we can do anyway. It is
236also possible to let these tools do some auto tagging but our experiments showed
237that this is a disaster. So, we end up with a mix of relatively rich tagging that
238we feel good with. When we're a decade down the road we expect that with a little
239help from large language models a decent verbose tagging is better than a crappy
240suboptimal one.
241
242One reason for tagging is that it could permit extraction but there are better
243solutions to that: if there is something shown in a table or graphic, why not add
244the dataset. We currently add \MATHML\ and \BIBTEX\ blobs but more can become
245possible in the future (this also depends on user demand).
246
247Another application is reflow but when that is needed, why not go \HTML\ or
248distribute different output. When accessibility is the target one has to wait
249till more is clear how that is actually supposed to work. Often the
250recommendations are to use Arial, little color, simple sectioning etc, so that
251gives little reason to use \PDF\ at all.
252
253All that said, we assume that \PDF\ level 2 is used, if only because it looks
254like validators aim for that. Also, if you find pre level 2 documents produced
255elsewhere, often tagging is so bad or weird that one can as well ignore it.
256
257Tagging in a document is enabled with:
258
259\starttyping
260\setupbackend[format=pdf/ua-2]
261\setuptagging[state=start]
262\stoptyping
263
264The first command ensures that the right data ends up in the \PDF\ file, and the
265second one enables tagging. As long as you're working on a document you can
266comment these commands which saves you some runtime and give way smaller files.
267
268We don't want to cripple proper structure \CONTEXT\ support by the limitations
269introduced in \PDF\ version 2, but we do offer users some control, as long as it
270does not backfire. Due to the fluid situation (around 2024) we delegate some
271choices to the user. By default we use robust mapping (read: not sensitive for
272limitations in nesting \PDF\ specific tags cf.\ checkers) but you can say this:
273
274\starttyping
275\enabledirectives [backend.usetags=crap]
276\stoptyping
277
278and get an you can map to an alternative set. With
279
280\starttyping
281\enabledirectives [backend.usetags=mkiv]
282\stoptyping
283
284you get the mapping used in \MKIV\ but that one fails level 2 validation. The
285\quote {crap} file has some notes on how to define things. The somewhat strange
286section title mapping is due to the fact that nested sections are not really
287supported in a way that permits the title and content to be properly tagged.
288
289\stopchapter
290
291\startchapter[title=Tagging math]
292
293Tagging math at level 2 is still experimental but works as follows. Instead of
294tagging the atoms and structures, as we do in level 1, we generate a \MATHML\
295attachment and put a so called actual text on the math structure node. This text
296can be spoken by reading machinery. The \MATHML\ is not that rich but we can enable
297more detail when needed. However, given the way (presentational) \MATHML\ evolved
298we are somewhat pessimistic. Instead of adding a few more elements that would
299help to provide structure, some features are dropped. Also, support in browsers
300comes and goes, either native or depending on \JAVASCRIPT.
301
302Because there is much freedom in how mathematical symbols and constructs are
303used, you might need to help math tagging bit. The process is driven by group
304sets that refer to domains. An example of a domain is chemistry. For now we just
305mention that this features is there and as time flies by we can expect more
306granular usage.
307
308\starttyping
309\definemathgroupset
310  [mydomain]
311  [every] % a list of dictionaries
312
313\setmathgroupset
314  [mydomain]
315\stoptyping
316
317For now you can ignore these commands because we default to \type {every}.
318
319{Todo: list all possible dictionaries.}
320
321You can control the tagger by specifying what symbols and characters actually
322mean, for instance:
323
324\starttyping
325\registermathfunction[𝑓]
326\registermathfunction[𝑔]
327
328% \registermathsymbol[default][en][𝐮][the vector]
329% \registermathsymbol[default][en][𝐯][the vector]
330% \registermathsymbol[default][en][𝖠][the matrix]
331
332\registermathsymbol[default][en][lowercasebold]           [the vector] % [of]
333\registermathsymbol[default][en][uppercasesansserifnormal][the matrix]
334\stoptyping
335
336From the language tag being used here you can deduce that this can be done per
337language.
338
339You can trace math translations with:
340
341\starttyping
342\setupnote[mathnote][location=page]
343\enabletrackers[math.textblobs]
344\stoptyping
345
346which is what we used when developing these features. In a few hundred page math
347book one easily gets thousands of notes.
348
349In \type {examples-mathmeanings} you can find a lot of examples. In due time we
350expect to offer more translations. The English and Swedish are for now the
351benchmark. \footnote {As a proof of concept, at Bacho\TeX\ 2024, the Ukrain
352translations were provided by Team Odessa, but they need some tuning.} Likely
353other languages will be served by Tomáš Hala as result of courses on typesetting.
354Feel free to contact all those involved in this.
355
356\stopchapter
357
358% time stamp next sections: Rendezvous Point - Presence, mid 2024 (a whow video too)
359
360\startchapter[title=Structure]
361
362Although today all goes to \PDF, that is not what \TEX\ macro packages started
363with. Basically they just use the \TEX\ engine to render something in the
364tradition of printing but using a target format that can be converted to
365something that a printer understands. Nowadays that just happens to be \PDF.
366
367So, although the target is \PDF, that doesn't mean that \PDF\ drives (or should
368drive) the process. If we want what is called tagged \PDF\ where tagging
369represents structure, one could argue that this is then a follow up on whatever
370structure users used in the process. In \CONTEXT\ we start from the \TEX\ input
371end, not from some tagging related \PDF\ wish list which could handicap us. So
372called tagged \PDF\ is not the objective, it is just a possible byproduct.
373
374Keep in mind that tagging related to structure serves a few purposes: reflow,
375conversion, and accessibility. We're not at all interested in reflow of \PDF,
376because is that a which one should just produce \HTML. We're also not interested
377in conversion because, again, one could just use a different workflow, maybe one
378that starts from \XML\ and can target different media. When it comes to
379accessibility this mixed bag contains options like generating different versions,
380each tuned to a specific target audience. Typesetting is about generating some
381visual representation and just like people have different food preferences, one
382can imagine different representations: there is no reason to only produce \PDF.
383And even if there are ways to help something rendered for printing, or reading on
384screen, for instance by providing audio, there is no need to do that for very
385complex documents. Given the often poor quality of simple \TEX\ documents one can
386even wonder if that tool should be used at al then. It's not like \TEX\ is the
387only system that can do math nowadays.
388
389When we look at structure, this is how \CONTEXT\ sees a section:
390
391\starttyping[option=TEX]
392\startchapter[title={This is a chapter.}
393  \startsection[title={This is a section.}
394    Some text here.
395  \stopsection
396\stopchapter
397\stoptyping
398
399In \XML\ that could be something like this with the number being optional as it
400can be generated:
401
402\starttyping[option=XML]
403<section detail="chapter">
404  <sectioncaption>
405    <sectionnumber>1</sectionnumber>
406    <sectiontitle>This is a chapter.</sectiontitle>
407  </sectioncaption>
408  <sectioncontent>
409    <section detail="section">
410      <sectioncaption>
411        <sectionnumber>1.2</sectionnumber>
412        <sectiontitle>This is a section.</sectiontitle>
413      </sectioncaption>
414      <sectioncontent>
415        Some text here.
416      </sectioncontent>
417    </section>
418  </sectioncontent>
419</section>
420\stoptyping
421
422In a \PDF\ there can also be additional rendered material, like headers and
423footers and maybe the section title is rendered in a special way but we ignore
424that for now.
425
426When I comes to the content blob, we have to look at the \TEX\ end. User input
427normally will give this:
428
429\starttyping[option=XML]
430<sectioncontent>
431  A first paragraph.
432
433  A second paragraph.
434</sectioncontent>
435\stoptyping
436
437An empty line starts a paragraph but it can also be explicitly forced (think
438\type {\par}).
439
440\starttyping[option=XML]
441<sectioncontent>
442  A first paragraph.
443  <break/>
444  A second paragraph.
445</sectioncontent>
446\stoptyping
447
448But one can also explicit encode paragraphs and then get:
449
450\starttyping[option=XML]
451<sectioncontent>
452  <paragraph>A first paragraph.</paragraph>
453  <paragraph>A second paragraph.</paragraph>
454</sectioncontent>
455\stoptyping
456
457which in \TEX\ speak is:
458
459\starttyping[option=TEX]
460\startchapter[title={This is a chapter.}
461  \startsection[title={This is a section.}
462    \startparagraph A first paragraph.  \stopparagraph
463    \startparagraph A second paragraph. \stopparagraph
464  \stopsection
465\stopchapter
466\stoptyping
467
468But it will be clear that not all users want to do that, which means that we end
469up with the \type {<break/>} variant. Now you can ask, why not infer this extra
470level of structure and the answer is: it's not how \TEX\ works. The content can
471be anything and the fact that there is no real clear solution is actually
472reflected in how \PDF\ tagging maps onto pseudo \HTML\ elements: not all can
473nest, so for instance a paragraph cannot contain a paragraph. That means that we
474cannot reliable add that level of structure automatically as it limits the
475degrees of freedom that users have. As mentioned: tagging in \PDF\ is not the
476starting point, just a possible byproduct.
477
478Let's look at another structure element:
479
480\starttyping[option=TEX]
481\startitemize
482  \startitem A first item. \stopitem
483  \startitem
484    A second item.
485    \startitemize
486      \startitem Again first item. \stopitem
487      \startitem And a last one. \stopitem
488    \stopitemize
489  \stopitem
490  \startitem A third item. \stopitem
491\stopitemize
492\stoptyping
493
494If we start from input, we can use this kind of \XML:
495
496\starttyping[option=XML]
497<itemize>
498  <item>A first item.</item>
499  <item>
500    A second item.
501    <itemize>
502      <item>Again a first item.</item>
503      <item>And a last one.</item>
504    </itemize>
505  </item>
506  <item>A third item.</item>
507</itemize>
508\stoptyping
509
510But once we're done, we actually have something typeset, so we end up with more detail:
511
512\starttyping[option=XML]
513<itemgroup>
514  <item>
515    <itemtag>1.</itemtag>
516    <itemcontent>
517      <itemhead/>
518      <itembody>A first item.</itembody>
519    </itemcontent>
520  </item>
521  <item>
522    <itemtag>2.</itemtag>
523    <itemcontent>
524      <itemhead/>
525      <itembody>
526        A second item.
527        <itemgroup>
528          <item>
529            <itemtag>a.</itemtag>
530            <itemcontent>
531               <itemhead/>
532               <itembody>Again a first item.</itembody>
533            </itemcontent>
534          </item>
535          <item>
536            <itemtag>b.</itemtag>
537            <itemcontent>
538              <itemhead/>
539              <itembody>And a last one.</itembody>
540            </itemcontent>
541          </item>
542        </itemgroup>
543      </itembody>
544    </itemcontent>
545  </item>
546  <itemtag>3.</itemtag>
547  <itemcontent>
548    <itemhead/>
549    <itembody>
550      A third item.
551    </itembody>
552  </itemcontent>
553</itemgroup>
554\stoptyping
555
556This represents what gets rendered but one can leave out the tag and let whatever
557interprets this deal with that.
558
559So, we have input that can be either explicit (given numbers and tags) or
560implicit (the system generates them) but output can also be explicit or implicit.
561And when output carries structure the question is: do we want to preserve
562abstraction or do we want the rendered results. It only makes sense to invest in
563this when it pays off, also because the resulting \PDF\ file get bloated a lot.
564
565\stopchapter
566
567\startchapter[title=PDF]
568
569The specification (say the second edition of 2020) has a section about tagged
570\PDF\ in the perspective of reflow, conversion and accessibility. As we mentioned
571already the lack of tools using any of this didn't help much in clarifying all
572this. At some point it became possible to verify a \PDF\ file and some
573rudimentary converters popped up but it looks like everyone had to interpret the
574rules laid out in the specification. Mid 2024 we also had numerous errata
575and|/|or clarifications of the specification (not only tagging) but (at least for
576us) it is not clear what criteria for changes (more restrictions) were.
577
578For instance, the ISO 32000-2 specification explicitly mentions the usefulness of
579the \type {H} mapping for classes of documents. But in ISO 14289-2:2024 we can
580read \quotation {The \type {H} structure type requires processors to track
581section depth, which adds an unnecessary burden on processors and can cause
582ambiguity.} This is a quite baffling remark given the complexity of \PDF\ and web
583technologies in general.
584
585It is this, and other vague (and changing) descriptions, take for instance \type
586{Part} and \type {Sect}, that made us decide to draw a line. We tried some in our
587opinion reasonable variants but could never satisfy the validators completely.
588Keep in mind that we started from tagging everything, not just the easy bits and
589pieces and then wrapping whatever left in artifacts or a wildcard paragraph. When
590we looked at tagging with 2024 glasses on we were willing to adapt but in the end
591it makes little sense. We can add a few mappings but in general it is too
592conflicting with our approach to structure, one that goes back decades. So, as
593long as a (audio) reader can do a reasonable job, we're okay. There are more
594interesting challenges on our plates anyway.
595
596\stopchapter
597
598\startchapter[title=Tracing]
599
600There are a few trackers that relate to tagging but these are more for ourselves
601so we just mention them:
602
603\starttyping
604\enabletrackers[structures.tags]
605\enabletrackers[structures.tags.info]
606\enabletrackers[structures.tags.math]
607\enabletrackers[structures.tags.blobs]
608\enabletrackers[structures.tags.internals]
609\enabletrackers[structures.tags.suspects]
610\stoptyping
611
612The first one is probably the most useful as it shows how \CONTEXT\ sees the
613structure of your document.
614
615\stopchapter
616
617\stopdocument
618
Source Browser ?