% language=us runpath=texruns:manuals/ontarget

\startcomponent ontarget-active

\environment ontarget-style

\usemodule[system-tokens]

\startchapter[title={Active characters}]

Each character in \TEX\ has a so called category code. Most are of category
\quote {letter} or \quote {other character} but some have a special meaning, like
\quote {superscript} or \quote {subscript} or \quote {math shift}. Of course the
backslash is special too and it has the \quote {escape} category.

A single character can also be a command in which case it has category \quote
{active}. In \CONTEXT\ the \type {|} is an example of that. It grabs an argument
delimited by yet another such (active) bar and handles that argument as compound
character.

From the perspective of \CONTEXT\ we have a couple of challenges with respect to
active characters.

\startitemize
\startitem
    We want to limit the number of special symbols so we only really have to deal with the
    active bar and tilde. Both have a history starting with \MKII.
\stopitem
\startitem
    There are cases where we don't want them to be not active, most noticeably in
    math and verbatim. This means that we either have to make a sure that they
    are not active bit in nested exceptions, for instance when we flush a page
    halfway verbatim, made active again.
\stopitem
\startitem
    In text we always hade catcode regimes to deal with this (which is actually
    why in \LUATEX\ efficient catcode tables were one of the first native
    features to implement. This involves some namespace management.
\stopitem
\startitem
    In math we have to fall back on a different meaning which adds another
    (meaning) axis alongside catcode regimes: in math we use the same catcode
    regime as in text so we have a mode dependent meaning on top of the catcode
    regime specific one.
\stopitem
\startitem
    In math we have this special active class|/|character definition value \type
    {"8000} that makes characters active in math only. We use(d) that for permitting
    regular hat and underscore characters in text mode but let them act as
    superscript and subscript triggers in math mode.
\stopitem
\startitem
    Active characters travel in a special way trough the system: they are
    actually stored as macro calls in token lists en macro bodies. This normally
    goes unnoticed (and is not that different from other catcodes being frozen in
    macros).
\stopitem
\stopitemize

So far we could always comfortably implement whatever we wanted but sometimes the
code was not that pretty. Because part of the \LUAMETATEX\ project is to make
code cleaner, I started wondering if we could come up with a better mechanism for
dealing with active characters especially in math. Among the other reasons were:
less tracing clutter, a bit more natural approach, and less intercepts for
special cases. Of course we have to be compatible. Some first experiments were
promising but as usual it took a while to identify all the cases we have to deal
with. At moments I wondered if I should go forward but as I stepwise adapted the
\CONTEXT\ code to the experiment there was no way back. I did however reject
experiments that out active characters in the catcode table namespaces.

In \LUATEX\ (and its predecessors) internally active characters are stored as a
reference to a control sequence, although a \type {\show} or trace will report
the character as \quote {name}. For example:

\startbuffer
\catcode `!=\activecatcode
\def !{whatever} % we also have \letcharcode
\def\foo{x!x}
\stopbuffer

\typebuffer

is stored as (cs, cmd, chr):

\start
\getbuffer
\luatokentable\foo
\stop

However, when we want some more hybrid approach, a text versus math mix, we need
to postpone resolving into a control sequence. Examples are macro bodies and
token registers. When we flag a character (with \type {amcode}) as being of a
different catcode than active in math mode, we get the following:

\startbuffer
\amcode`! \othercatcode
\catcode `!=\activecatcode
\def !{whatever}
\def\foo{x!x}
\stopbuffer
\typebuffer

\start
\getbuffer
\luatokentable\foo
\stop

The difference is that here we get the active character in the body of the macro.
Interesting is that this is not something that parser is prepared for so the main
loop has now to catch active characters. This is no big deal but also not
something to neglect. The same is true for serialization of tokens.

Other situations when we need to be clever is for instance when we try to enter
math mode. In math mode we want the (in text) active character as math character
and a convenient test is checking the mode. However, when we see \type {$} we are
not yet in math mode and as \TEX\ looks for a potential next \type {$} we grab a
active character it should not resolve in a reference to an command. The reason
for that is that when \TEX\ pushes back the token (because it doesn't see a \type
{$}) we need it to be an active character and not a control sequence. If it were
a control sequence we would see it as such in math mode which is not what we
intended. It is one of these cases where \TEX\ is not roundtrip. Similar cases
occur when \TEX\ looks ahead for (what makes a) number and doesn't see one which
then results in a push back. Actually, there are many look ahead and push back
moments in the source.

\startbuffer
text: \def\foo{x|!|x}

\meaningasis\foo

\luatokentable\foo

$x\foo x$ \foo
\stopbuffer

\typebuffer \start\getbuffer\stop

\startbuffer
math: $\gdef\oof{x|!|x}$

\meaningasis\oof

\luatokentable\oof

$x\oof x$ \oof
\stopbuffer

\typebuffer \start\getbuffer\stop

\startbuffer
toks: \scratchtoks{x|!|x}

\detokenize\expandafter{\the\scratchtoks}

\luatokentable\scratchtoks

$x\the\scratchtoks x$ \the\scratchtoks
\stopbuffer

\typebuffer \start\getbuffer\stop

A good test case for \CONTEXT\ is:

\startbuffer
\def\foo{x|!|x||x}

 x|!|x||x + \foo
$x|!|x||x + \foo$
\stopbuffer

\typebuffer

Here we expect bars in math mode but the compound mechanism applied in text mode:

\startlines\getbuffer\stoplines

So the bottom line is this:

\startitemize
\startitem
    Active characters should behave as expected, which means that they get
    replaced by references to commands.
\stopitem
\startitem
    When the \type {\amcode} is set, this signal the engine to delay that
    replacement and retain the active character.
\stopitem
\startitem
    When the moment is there the engine either expands it as command (text mode)
    or injects the alternative meaning based on the catcode. There we support
    letters, other characters, super- and subscripts and alignment codes. The
    rest we simply ignore (for now).
\stopitem
\stopitemize

Of course you can abuse this mechanism and also retain the character's active
property in text mode by simply setting the \type {\amcode}. We'll see how that
works out. Actually this mechanism was provided in the first place to get around
the \type {"8000} limitations! So here is another cheat:

\starttyping
\catcode `^ \othercatcode       % so a ^ is just that
\amcode  `^ \superscriptcatcode % but a ^ in math signals a superscript
\stoptyping

So, the \type {a} in \type {\amcode} stands for both \quote {active} and \quote
{alternative}. As mentioned, because we distinguish between math and text mode we
no longer need to adapt the meaning of active commands: think of using \type
{\mathtext} in a formula where we leave math mode and then need to use the text
meaning of the bar, just as outside the formula.

In the end, because we only have a few active characters and no user ever
demanded name spaces that mechanism was declared obsolete. There is no need to
keep code around that is not really used any more.

% Although this mechanism works okay, there is a pitfall. When you define a macro, and
% \type {\amcode} is set, the active character is stored as such. That means that doing
% something like this is likely to fail:
%
% \starttyping
% \def\whatever{\let~\space}
% \stoptyping
%
% when the tilde is active as well as has a \type {\amcode} set. However,
%
% \starttyping
% \def\whatever{\letcharcode\tildeasciicode\space}
% \stoptyping
%
% will work just fine.

Internally an active character is stored in the hash that also stores regular
control sequences. The character becomes an \UTF\ string prefixed by the \UTF\
value of \type {0xFFFF} which doesn't exist in \UNICODE. The \type {\csactive}
primitive is a variant on \type {\csstring} that returns this hash. Its companion
\type {\expandactive} (a variant on \type {\expand}) can be used to inject the
related control sequence. If \type {\csactive} is not followed by an active
character it expands to just the prefix, as does \type {\Uchar"FFFF} but a bit of
abstraction makes sense.

% control sequence: xxxx
% 271731  13  126  active char
% control sequence: xxxx
% 271732  135    0  protected call  ~
% control sequence: xxxx
% 271734   12  65535  other char      ￿ (U+0FFFF)
% 408124  135      0  protected call  ~

\stopchapter

\stopcomponent