hybrid-characters.tex /size: 24 Kb    last modification: 2023-12-21 09:43
1% language=us
2
3\startcomponent hybrid-characters
4
5\environment hybrid-environment
6
7\startchapter[title={Characters with special meanings}]
8
9\startsection[title={Introduction}]
10
11When \TEX\ was designed \UNICODE\ was not yet available and characters were
12encoded in a seven or eight bit encoding, like \ASCII\ or \EBCDIC. Also, the
13layout of keyboards was dependent of the vendor. A lot has happened since then:
14more and more \UNICODE\ has become the standard (with \UTF\ as widely used way of
15efficiently coding it).
16
17Also at that time, fonts on computers were limited to 256 characters at most.
18This resulted in \TEX\ macro packages dealing with some form of input encoding on
19the one hand and a font encoding on the other. As a side effect of character
20nodes storing a reference to a glyph in a font hyphenation was related to font
21encodings. All this was quite okay for documents written in English but when
22\TEX\ became pupular in more countries more input as well as font encodings were
23used.
24
25Of course, with \LUATEX\ being a \UNICODE\ engine this has changed, and even more
26because wide fonts (either \TYPEONE\ or \OPENTYPE) are supported. However, as
27\TEX\ is already widely used, we cannot simply change the way characters are
28treated, certainly not special ones. Let's go back in time and see how plain
29\TEX\ set some standards, see how \CONTEXT\ does it currently, and look ahead how
30future versions will deal with it.
31
32\stopsection
33
34\startsection[title={Catcodes}]
35
36Traditional \TEX\ is an eight bit engine while \LUATEX\ extends this to \UTF\
37input and internally works with large numbers.
38
39In addition to its natural number (at most 0xFF for traditional \TEX\ and upto
400x10FFFF for \LUATEX), each character can have a so called category code, or
41catcode. This code determines how \TEX\ will treat the character when it is seen
42in the input. The category code is stored with the character so when we change
43such a code, already read characters retain theirs. Once typeset a character can
44have turned into a glyph and its catcode properties are lost.
45
46There are 16 possible catcodes that have the following meaning:
47
48\starttabulate[|l|l|p|]
49\NC 0 \NC escape \NC This starts an control sequence. The scanner
50reads the whole sequence and stores a reference to it in an
51efficient way. For instance the character sequence \type {\relax}
52starts with a backslash that has category code zero and \TEX\
53reads on till it meets non letters. In macro definitions a
54reference to the so called hash table is stored. \NC \NR
55\NC 1 \NC begin group \NC This marks the begin of a group. A group
56an be used to indicate a scope, the content of a token list, box
57or macro body, etc. \NC \NR
58\NC 2 \NC end group \NC This marks the end of a group. \NC \NR
59\NC 3 \NC math shift \NC Math starts and ends with characters
60tagged like this. Two in a row indicate display math. \NC \NR
61\NC 4 \NC alignment tab \NC Characters with this property indicate
62a next entry in an alignment. \NC \NR
63\NC 5 \NC end line \NC This one is somewhat special. As line
64endings are operating system dependent, they are normalized to
65character 13 and by default that one has this category code. \NC
66\NR
67\NC 6 \NC parameter \NC Macro parameters start with a character
68with this category code. Such characters are also used in
69alignment specifications. In nested definitions, multiple of them
70in a row are used. \NC \NR
71\NC 7 \NC superscript \NC Tagged like this, a character signals
72that the next token (or group) is to be superscripted. Two such
73characters in a row will make the parser treat the following
74character or lowercase hexadecimal number as specification for
75a replacement character. \NC \NR
76\NC 8 \NC subscript \NC Codes as such, a character signals that
77the next token (or group) is to be subscripted. \NC \NR
78\NC 9 \NC ignored \NC When a character has this category code it
79is simply ignored. \NC \NR
80\NC 10 \NC space \NC This one is also special. Any character tagged
81as such is converted to the \ASCII\ space character with code 32.
82\NC \NR
83\NC 11 \NC letter \NC Normally this are the characters that make op
84sequences with a meaning like words. Letters are special in the sense that
85macro names can only be made of letters. The hyphenation machinery will
86normally only deal with letters. \NC \NR
87\NC 12 \NC other \NC Examples of other characters are punctuation and
88special symbols. \NC \NR
89\NC 13 \NC active \NC This makes a character into a macro. Of course
90it needs to get a meaning in order not to trigger an error. \NC \NR
91\NC 14 \NC comment \NC All characters on the same line after comment
92characters are ignored. \NC \NR
93\NC 15 \NC invalid \NC An error message is issued when an invalid
94character is seen. This catcode is probably not assigned very
95often. \NC \NR
96\stoptabulate
97
98So, there is a lot to tell about these codes. We will not discuss the input
99parser here, but it is good to know that the following happens.
100
101\startitemize[packed]
102\startitem
103    The engine reads lines, and normalizes cariage return
104    and linefeed sequences.
105\stopitem
106\startitem
107    Each line gets a character with number \type {\endlinechar} appended.
108    Normally this is a character with code 13. In \LUATEX\ a value of $-1$ will
109    disable this automatism.
110\stopitem
111\startitem
112    Normally spaces (characters with the space property) at the end of a line are
113    discarded.
114\stopitem
115\startitem
116    Sequences like \type {^^A} are converted to characters with numbers depending
117    on the position in \ASCII\ vector: \type {^^@} is zero, \type {^^A} is one,
118    etc.
119\stopitem
120\startitem
121    Sequences like \type {^^1f} are converted to characters with a number similar
122    to the (lowercase) hexadecimal part.
123\stopitem
124\stopitemize
125
126Hopefully this is enough background information to get through the following
127sections so let's stick to a simple example:
128
129\starttyping
130\def\test#1{$x_{#1}$}
131\stoptyping
132
133Here there are two control sequences, starting with a backslash with category
134code zero. Then comes an category~6 character that indicates a parameter that is
135referenced later on. The outer curly braces encapsulate the definition and the
136inner two braces mark the argument to a subscript, which itself is indicated by
137an underscore with category code~8. The start and end of mathmode is indicated
138with a dollar sign that is tagged as math shift (category code~3). The character
139\type {x} is just a letter.
140
141Given the above description, how do we deal with catcodes and newlines at the
142\LUA\ end? Catcodes are easy: we can print back to \TEX\ using a specific catcode
143regime (later we will see a few of those regimes). As character~13 is used as
144default at the \TEX\ end, we should also use it at the \LUA\ end, i.e.\ we should
145use \type {\r} as line terminator (\type {\endlinechar}). On the other hand, we
146have to use \type {\n} (character 10, \type {\newlinechar}) for printing to the
147terminal, log file, of \TEX\ output handles, although in \CONTEXT\ all that
148happens via \LUA\ anyway, so we don't bother too much about it here.
149
150There is a pitfall. As \TEX\ reads lines, it depends on the file system to
151provide them: it fetches lines or whatever represents the same on block devices.
152In \LUATEX\ the implementation is similar: if you plug in a reader callback, it
153has to provide a function that returns a line. Passing two lines does not work
154out as expected as \TEX\ discards anything following the line separator (cr, lf
155or crlf) and then appends a normalized endline character (in our case
156character~13). At least, this is what \TEX\ does naturally. So, in callbacks you
157can best feed line by line without any of those characters.
158
159When you print something from \LUA\ to \TEX\ the situation is slightly different:
160
161\startbuffer
162\startluacode
163tex.print("line 1\r line 2")
164tex.print("line 3\n line 4")
165\stopluacode
166\stopbuffer
167
168\typebuffer
169
170This is what we get:
171
172\startpacked\getbuffer\stoppacked
173
174The explicit \type {\endlinechar} (\type {\r}) terminates the line and the rest
175gets discarded. However, a \type {\n} by default has category code~12 (other) and
176is turned into a space and successive spaces are (normally) ignored, which is why
177we get the third and fourth line separated by a space.
178
179Things get real hairy when we do the following:
180
181\startbuffer
182\startluacode
183tex.print("\\bgroup")
184tex.print("\\obeylines")
185tex.print("line 1\r line 2")
186tex.print("line 3\n line 4")
187tex.print("\\egroup")
188\stopluacode
189\stopbuffer
190
191\typebuffer
192
193Now we get this (the \type {tex.print} function appends an endline character
194itself):
195
196\startpacked\getbuffer\stoppacked
197
198By making the endline character active and equivalent to \type {\par} \TEX\
199nicely scans on and we get the second line as well. Now, if you're still with us,
200you're ready for the next section.
201
202\stopsection
203
204\startsection[title={Plain \TEX}]
205
206In the \TEX\ engine, some characters already have a special meaning. This is
207needed because otherwise we cannot use the macro language to set up the format.
208This is hard|-|coded so the next code is not really used.
209
210\starttyping
211\catcode `\^^@ =  9  % ascii null is ignored
212\catcode `\^^M =  5  % ascii return is end-line
213\catcode `\\   =  0  % backslash is TeX escape character
214\catcode `\%   = 14  % percent sign is comment character
215\catcode `\    = 10  % ascii space is blank space
216\catcode `\^^? = 15  % ascii delete is invalid
217\stoptyping
218
219There is no real reason for setting up the null and delete character but maybe in
220those days the input could contain them. The regular upper- and lowercase
221characters are initialized to be letters with catcode~11. All other characters
222get category code~12 (other).
223
224The plain \TEX\ format starts with setting up some characters that get a special
225meaning.
226
227\starttyping
228\catcode `\{   =  1 % left brace is begin-group character
229\catcode `\}   =  2 % right brace is end-group character
230\catcode `\$   =  3 % dollar sign is math shift
231\catcode `\&   =  4 % ampersand is alignment tab
232\catcode `\#   =  6 % hash mark is macro parameter character
233\catcode `\^   =  7 \catcode`\^^K=7 % circumflex and uparrow
234                                    % are for superscripts
235\catcode `\_   =  8 \catcode`\^^A=8 % underline and downarrow
236                                    % are for subscripts
237\catcode `\^^I = 10 % ascii tab is a blank space
238\catcode `\~   = 13 % tilde is active
239\stoptyping
240
241The fact that this happens in the format file indicates that it is not by design
242that for instance curly braces are used for grouping, or the hash for indicating
243arguments. Even math could have been set up differently. Nevertheless, all macro
244packages have adopted these conventions so they could as well have been
245hard|-|coded presets.
246
247Keep in mind that nothing prevents us to define more characters this way, so we
248could make square brackets into group characters as well. I wonder how many
249people have used the two additional special characters that can be used for
250super- and subscripts. The comment indicates that it is meant for a special
251keyboard.
252
253One way to make sure that a macro will not be overloaded is to use characters in
254it's name that are letters when defining the macro but make sure that they are
255others when the user inputs text.
256
257\starttyping
258\catcode `@ = 11
259\stoptyping
260
261Again, the fact that plain \TEX\ uses the commercial at sign has set a standard.
262After all, at that time this symbol was not as popular as it is nowadays.
263
264Further on in the format some more catcode magic happens. For instance this:
265
266\starttyping
267\catcode `\^^L = 13 \outer\def^^L{\par} % ascii form-feed is "\outer\par"
268\stoptyping
269
270So, in your input a formfeed is equivalent to an empty line which makes sense,
271although later we will see that in \CONTEXT\ we do it differently. As the tilde
272was already active it also gets defined:
273
274\starttyping \def~{\penalty10000\ } % tie \stoptyping
275
276Again, this convention is adopted and therefore a sort of standard. Nowadays we
277have special \UNICODE\ characters for this, but as they don't have a
278visualization editing is somewhat cumbersome.
279
280The change in catcode of the newline character \type {^^M} is done locally, for
281instance in \type {\obeylines}. Keep in mind that this is the character that
282\TEX\ appends to the end of an input line. The space is made active when spaces
283are to be obeyed.
284
285A few very special cases are the following.
286
287\starttyping
288\mathcode `\^^Z = "8000 % \ne
289\mathcode `\    = "8000 % \space
290\mathcode `\'   = "8000 % ^\prime
291\mathcode `\_   = "8000 % \_
292\stoptyping
293
294This flags those characters as being special in mathmode. Normally when you do
295something like this:
296
297\starttyping
298\def\test#1{$#1$} \test{x_2} \test{x''}
299\stoptyping
300
301The catcodes that are set when passing the argument to \type {\test} are frozen
302when they end up in the body of the macro. This means that when \type {'} is
303other it will be other when the math list is built. However, in math mode, plain
304\TEX\ wants to turn that character into a prime and even in a double one when
305there are two in a row. The special value \type {"8000} tells the math machinery
306that when it has an active meaning, that one will be triggered. And indeed, the
307plain format defined these active characters, but in a special way, sort of:
308
309\starttyping
310{ \catcode`\' = 13 \gdef'{....} }
311\stoptyping
312
313So, when active it has a meaning, and it happens to be only treated as active
314when in math mode.
315
316Quite some other math codes are set as well, like:
317
318\starttyping
319\mathcode`\^^@ = "2201 % \cdot
320\mathcode`\^^A = "3223 % \downarrow
321\mathcode`\^^B = "010B % \alpha
322\mathcode`\^^C = "010C % \beta
323\stoptyping
324
325In Appendix~C of The \TeX book Don Knuth explains the rationale behind this
326choice: he had a keyboard that has these shortcuts. As a consequence, one of the
327math font encodings also has that layout. It must have been a pretty classified
328keyboard as I could not find a picture on the internet. One can probably assemble
329such a keyboard from one of those keyboard that come with no imprint. Anyhow, Don
330explicitly says \quotation {Of course, designers of \TEX\ macro packages that are
331intended to be widely used should stick to the standard \ASCII\ characters.} so
332that is what we do in the next sections.
333
334\stopsection
335
336\startsection[title={How about \CONTEXT}]
337
338In \CONTEXT\ we've always used several catcode regimes and switching between them
339was a massive operation. Think of a different regime when defining macros,
340inputting text, typesetting verbatim, processing \XML, etc. When \LUATEX\
341introduced catcode tables, the existing mechanisms were rewritten to take
342advantage of this. This is the standard table for input as of December 2010.
343
344\starttyping
345\startcatcodetable \ctxcatcodes
346  \catcode \tabasciicode        \spacecatcode
347  \catcode \endoflineasciicode  \endoflinecatcode
348  \catcode \formfeedasciicode   \endoflinecatcode
349  \catcode \spaceasciicode      \spacecatcode
350  \catcode \endoffileasciicode  \ignorecatcode
351  \catcode \circumflexasciicode \superscriptcatcode
352  \catcode \underscoreasciicode \subscriptcatcode
353  \catcode \ampersandasciicode  \alignmentcatcode
354  \catcode \backslashasciicode  \escapecatcode
355  \catcode \leftbraceasciicode  \begingroupcatcode
356  \catcode \rightbraceasciicode \endgroupcatcode
357  \catcode \dollarasciicode     \mathshiftcatcode
358  \catcode \hashasciicode       \parametercatcode
359  \catcode \commentasciicode    \commentcatcode
360  \catcode \tildeasciicode      \activecatcode
361  \catcode \barasciicode        \activecatcode
362\stopcatcodetable
363\stoptyping
364
365Because the meaning of active characters can differ per table there is a related
366mechanism for switching those meanings. A careful reader might notice that the
367formfeed character is just a newline. If present at all, it often sits on its own
368line, so effectively it then behaves as in plain \TEX: triggering a new
369paragraph. Otherwise it becomes just a space in the running text.
370
371In addition to the active tilde we also have an active bar. This is actually one
372of the oldest features: we use bars for signaling special breakpoints, something
373that is really needed in Dutch (education), where we have many compound words.
374Just to show a few applications:
375
376\starttyping
377firstpart||secondpart  this|(|orthat)  one|+|two|+|three
378\stoptyping
379
380In \MKIV\ we have another way of dealing with this. There you can enable a
381special parser that deals with it at another level, the node list.
382
383\starttyping
384\setbreakpoints[compound]
385\stoptyping
386
387When \TEX ies discuss catcodes some can get quite upset, probably because they
388spend some time fighting their side effects. Personally I like the concept. They
389can be a pain to deal with but also can be fun. For instance, support of \XML\ in
390\CONTEXT\ \MKII\ was made possible by using active \type {<} and \type {&}.
391
392When dealing with all kind of inputs the fact that characters have special
393meanings can get in the way. One can argue that once a few have a special
394meaning, it does not matter that some others have. Most complaints from users
395concern \type {$}, \type {&} and \type {_}. When for symmetry we add \type {^} it
396is clear that these characters relate to math.
397
398Getting away from the \type {$} can only happen when users are willing to use for
399instance \type {\m{x}} instead of \type {$x$}. The \type {&} is an easy one
400because in \CONTEXT\ we have always discouraged its use in tables and math
401alignments. Using (short) commands is a bit more keying but also provides more
402control. That leaves the \type {_} and \type {^} and there is a nice solution for
403this: the special math tagging discussed in the previous section.
404
405For quite a while \CONTEXT\ provides two commands that makes it possible to use
406\type {&}, \type {_} and \type {^} as characters with only a special meaning
407inside math mode. The command
408
409\starttyping
410\nonknuthmode
411\stoptyping
412
413turns on this feature. The counterpart of this command is
414
415\starttyping
416\donknuthmode
417\stoptyping
418
419One step further goes the command:
420
421\starttyping
422\asciimode
423\stoptyping
424
425This only leave the backslash and curly braces a special meaning.
426
427\starttyping
428\startcatcodetable \txtcatcodes
429  \catcode \tabasciicode       \spacecatcode
430  \catcode \endoflineasciicode \endoflinecatcode
431  \catcode \formfeedasciicode  \endoflinecatcode
432  \catcode \spaceasciicode     \spacecatcode
433  \catcode \endoffileasciicode \ignorecatcode
434  \catcode \backslashasciicode \escapecatcode
435  \catcode \leftbraceasciicode \begingroupcatcode
436  \catcode \rightbraceasciicode\endgroupcatcode
437\stopcatcodetable
438\stoptyping
439
440So, even the percentage character being a comment starter is no longer there. At
441this time it's still being discussed where we draw the line. For instance, using
442the following setup renders puts \TEX\ out of action, and we happily use it deep
443down in \CONTEXT\ to deal with verbatim.
444
445\starttyping
446\startcatcodetable \vrbcatcodes
447  \catcode \tabasciicode       \othercatcode
448  \catcode \endoflineasciicode \othercatcode
449  \catcode \formfeedasciicode  \othercatcode
450  \catcode \spaceasciicode     \othercatcode
451  \catcode \endoffileasciicode \othercatcode
452\stopcatcodetable
453\stoptyping
454
455\stopsection
456
457\startsection[title={Where are we heading?}]
458
459When defining macros, in \CONTEXT\ we not only use the \type {@} to provide some
460protection against overloading, but also the \type {?} and \type {!}. There is of
461course some freedom in how to use them but there are a few rules, like:
462
463\starttyping
464\c!width         % interface neutral key
465\v!yes           % interface neutral value
466\s!default       % system constant
467\e!start         % interface specific command name snippet
468\!!depth         % width as keyword to primitive
469\!!stringa       % scratch macro
470\??ab            % namespace
471\@@abwidth       % namespace-key combination
472\stoptyping
473
474There are some more but this demonstrates the principle. When defining macros
475that use these, you need to push and pop the current catcode regime
476
477\starttyping
478\pushcatcodes
479\catcodetable \prtcatcodes
480....
481\popcatcodes
482\stoptyping
483
484or more convenient:
485
486\starttyping
487\unprotect
488....
489\protect
490\stoptyping
491
492Recently we introduced named parameters in \CONTEXT\ and files that are coded
493that way are tagged as \MKVI. Because we nowadays are less concerned about
494performance, some of the commands that define the user interface have been
495rewritten. At the cost of a bit more runtime we move towards a somewhat cleaner
496inheritance model that uses less memory. As a side effect module writers can
497define the interface to functionality with a few commands; think of defining
498instances with inheritance, setting up instances, accessing parameters etc. It
499sounds more impressive than it is in practice but the reason for mentioning it
500here is that this opportunity is also used to provide module writers an
501additional protected character: \type {_}.
502
503\starttyping
504\def\do_this_or_that#variable#index%
505  {$#variable_{#index}$}
506
507\def\thisorthat#variable#index%
508  {(\do_this_or_that{#variable}{#index})}
509\stoptyping
510
511Of course in the user macros we don't use the \type {_} if only because we want
512that character to show up as it is meant.
513
514\starttyping
515\startcatcodetable \prtcatcodes
516  \catcode \tabasciicode        \spacecatcode
517  \catcode \endoflineasciicode  \endoflinecatcode
518  \catcode \formfeedasciicode   \endoflinecatcode
519  \catcode \spaceasciicode      \spacecatcode
520  \catcode \endoffileasciicode  \ignorecatcode
521  \catcode \circumflexasciicode \superscriptcatcode
522  \catcode \underscoreasciicode \lettercatcode
523  \catcode \ampersandasciicode  \alignmentcatcode
524  \catcode \backslashasciicode  \escapecatcode
525  \catcode \leftbraceasciicode  \begingroupcatcode
526  \catcode \rightbraceasciicode \endgroupcatcode
527  \catcode \dollarasciicode     \mathshiftcatcode
528  \catcode \hashasciicode       \parametercatcode
529  \catcode \commentasciicode    \commentcatcode
530  \catcode `\@                  \lettercatcode
531  \catcode `\!                  \lettercatcode
532  \catcode `\?                  \lettercatcode
533  \catcode \tildeasciicode      \activecatcode
534  \catcode \barasciicode        \activecatcode
535\stopcatcodetable
536\stoptyping
537
538This table is currently used when defining core macros and modules. A rather
539special case is the circumflex. It still has a superscript related catcode, and
540this is only because the circumflex has an additional special meaning
541
542Instead of the symbolic names in the previous blob of code we could have
543indicated characters numbers as follows:
544
545\starttyping
546\catcode `\^^I \spacecatcode
547\stoptyping
548
549However, if at some point we decide to treat the circumflex similar as the
550underscore, i.e.\ give it a letter catcode, then we should not use this double
551circumflex method. In fact, the code base does not do that any longer, so we can
552decide on that any moment. If for some reason the double circumflex method is
553needed, for instance when defining macros like \type {\obeylines}, one can do
554this:
555
556\starttyping
557\bgroup
558  \permitcircumflexescape
559  \catcode \endoflineasciicode \activecatcode
560  \gdef\obeylines%
561    {\catcode\endoflineasciicode\activecatcode%
562     \def^^M{\par}}
563\egroup
564\stoptyping
565
566However, in the case of a newline one can also do this:
567
568\starttyping
569\bgroup
570  \catcode \endoflineasciicode \activecatcode
571  \gdef\obeylines%
572    {\catcode\endoflineasciicode\activecatcode%
573     \def
574       {\par}}
575\egroup
576\stoptyping
577
578Or just:
579
580\starttyping
581\def\obeylines{\defineactivecharacter 13 {\par}}
582\stoptyping
583
584In \CONTEXT\ we have the following variant, which is faster
585than the previous one.
586
587\starttyping
588\def\obeylines
589  {\catcode\endoflineasciicode\activecatcode
590   \expandafter\def\activeendoflinecode{\obeyedline}}
591\stoptyping
592
593So there are not circumflexes used at all. Also, we only need to change the
594meaning of \type {\obeyedline} to give this macro another effect.
595
596All this means that we are upgrading catcode tables, we also consider making
597\type {\nonknuthmode} the default, i.e.\ move the initialization to the catcode
598vectors. Interesting is that we could have done that long ago, as the mentioned
599\type {"8000} trickery has proven to be quite robust. In fact, in math mode we're
600still pretty much in knuth mode anyway.
601
602There is one pitfall. Take this:
603
604\starttyping
605\def\test{$\something_2$} % \something_
606\def\test{$\something_x$} % \something_x
607\stoptyping
608
609When we are in unprotected mode, the underscore is part of the macro name, and
610will not trigger a subscript. The solution is simple:
611
612\starttyping
613\def\test{$\something _2$}
614\def\test{$\something _x$}
615\stoptyping
616
617In the rather large \CONTEXT\ code base there were only a few spots where we had
618to add a space. When moving on to \MKIV\ we have the freedom to introduce such
619changes, although we don't want to break compatibility too much and only for the
620good. We expect this all to settle down in 2011. No matter what we decide upon,
621some characters will always have a special meaning. So in fact we always stay in
622some sort of donknuthmode, which is what \TEX\ is all about.
623
624\stopsection
625
626\stopchapter
627
628\stopcomponent
629
630% ligatures
631