% language=us

\startcomponent about-calls

\environment about-environment

\startchapter[title={Calling Lua}]

\startsection[title=Introduction]

One evening, on Skype, Luigi and I were pondering about the somewhat
disappointing impact of jit in \LUAJITTEX\ and one of the reasons we could come
up with is that when you invoke \LUA\ from inside \TEX\ each \type {\directlua}
gets an extensive treatment. Take the following:

\starttyping
\def\SomeValue#1%
  {\directlua{tex.print(math.sin(#1)/math.cos(2*#1))}}
\stoptyping

Each time \type {\SomeValue} is expanded, the \TEX\ parser will do the following:

\startitemize[packed]
\startitem
    It sees \type {\directlua} and will jump to the related scanner.
\stopitem
\startitem
    There it will see a \type +{+ and enter a special mode in which it starts
    collecting tokens.
\stopitem
\startitem
    In the process, it will expand control sequences that are expandable.
\stopitem
\startitem
    The scanning ends when a matching \type +}+ is seen.
\stopitem
\startitem
    The collected tokens are converted into a regular (C) string.
\stopitem
\startitem
    This string is passed to the \type {lua_load} function that compiles it into
    bytecode.
\stopitem
\startitem
    The bytecode is executed and characters that are printed to \TEX\ are
    injected into the input buffer.
\stopitem
\stopitemize

In the process, some state information is set and reset and errors are dealt
with. Although it looks like a lot of actions, this all happens very fast, so
fast actually that for regular usage you don't need to bother about it.

There are however applications where you might want to see a performance boost,
for instance when you're crunching numbers that end up in tables or graphics
while processing the document. Again, this is not that typical for jobs, but with
the availability of \LUA\ more of that kind of usage will show up. And, as we now
also have \LUAJITTEX\ its jitting capabilities could be an advantage.

Back to the example: there are two calls to functions there and apart from the
fact that they need to be resolved in the \type {math} table, they also are
executed C functions. As \LUAJIT\ optimizes known functions like this, there can
be a potential speed gain but as \type {\directlua} is parsed and loaded each
time, the jit machinery will not do that, unless the same code gets exercised
lots of time. In fact, the jit related overhead would be a waste in this one time
usage.

In the next sections we will show two variants that follow a different approach
and as a consequence can speed up a bit. But, be warned: the impact is not as
large as you might expect, and as the code might look less intuitive, the good
old \type {\directlua} command is still the advised method.

Before we move on it's important to realize that a \type {\directlua} call is
in fact a function call. Say that we have this:

\starttyping
\def\SomeValue{1.23}
\stoptyping

This becomes:

\starttyping
\directlua{tex.print(math.sin(1.23)/math.cos(2*1.23))}
\stoptyping

Which in \LUA\ is wrapped up as:

\starttyping
function()
    tex.print(math.sin(1.23)/math.cos(2*1.23))
end
\stoptyping

that gets executed. So, the code is always wrapped in a function. Being a
function it is also a closure and therefore local variables are local to this
function and are invisible at the outer level.

\stopsection

\startsection[title=Indirect \LUA]

The first variant is tagged as indirect \LUA. With indirect we mean that instead
of directly parsing, compiling and executing the code, it is done in steps. This
method is not as generic a the one discussed in the next section, but for cases
where relatively constant calls are used it is fine. Consider the next call:

\starttyping
\def\NextValue
  {\indirectlua{myfunctions.nextvalue()}}
\stoptyping

This macro does not pass values and always looks the same. Of course there can be
much more code, for instance the following is equally valid:

\starttyping
\def\MoreValues {\indirectlua{
    for i=1,100 do
        myfunctions.nextvalue(i)
    end
}}
\stoptyping

Again, there is no variable information passed from \TEX. Even the next variant
is relative constant:

\starttyping
\def\SomeValues#1{\indirectlua{
    for i=1,#1 do
        myfunctions.nextvalue(i)
    end
}}
\stoptyping

especially when this macro is called many times with the same value. So how does
\type {\indirectlua} work? Well, it's behaviour is in fact undefined! It does,
like \type {\directlua}, parse the argument and makes the string, but instead of
calling \LUA\ directly, it will pass the string to a \LUA\ function \type
{lua_call}.

\starttyping
lua.call = function(s) load(s)() end
\stoptyping

The previous definition is quite okay and in fact makes \type {\indirectlua}
behave like \type {\directlua}. This definition makes

% \ctxlua{lua.savedcall = lua.call lua.call = function(s) load(s)() end}
% \testfeatureonce{10000}{\directlua  {math.sin(1.23)}}
% \testfeatureonce{10000}{\indirectlua{math.sin(1.23)}}
% \ctxlua{lua.call = lua.savedcall}

\starttyping
\directlua  {tex.print(math.sin(1.23))}
\indirectlua{tex.print(math.sin(1.23))}
\stoptyping

equivalent calls but the second one is slightly slower, which is to be expected
due to the wrapping and indirect loading. But look at this:

\starttyping
local indirectcalls = { }

function lua.call(code)
    local fun = indirectcalls[code]
    if not fun then
        fun = load(code)
        if type(fun) ~= "function" then
            fun = function() end
        end
        indirectcalls[code] = fun
    end
    fun()
end
\stoptyping

This time the code needs about one third of the runtime. How much we gain depends
on the size of the code and its complexity, but on the average its's much faster.
Of course, during a \TEX\ job only a small part of the time is spent on this, so
the overall impact is much smaller, but it makes runtime number crunching more
feasible.

If we bring jit into the picture, the situation becomes somewhat more diffuse.
When we use \LUAJITTEX\ the whole job processed faster, also this part, but
because loading and interpreting is more optimized the impact might be less. If
you enable jit, in most cases a run is slower than normal. But as soon as you
have millions of calls to e.g.\ type {math.sin} it might make a difference.

This variant of calling \LUA\ is quite intuitive and also permits us to implement
specific solutions because the \type {lua.call} function can be defined as you
with. Of course macro package writers can decide to use this feature too, so you
need to beware of unpleasant side effects if you redefine this function.

% \testfeatureonce{100000}{\directlua  {math.sin(1.23)}}
% \testfeatureonce{100000}{\indirectlua{math.sin(1.23)}}

\stopsection

\startsection[title=Calling \LUA]

In the process we did some tests with indirect calls in \CONTEXT\ core code and
indeed some gain in speed could be noticed. However, many calls get variable
input and therefore don't qualify. Also, as a mixture of \type {\directlua} and
\type {\indirectlua} calls in the source can be confusing it only makes sense to
use this feature in real time|-|critical cases, because even in moderately
complex documents there are not that many calls anyway.

The next method uses a slightly different approach. Here we stay at the \TEX\
end, parse some basic type arguments, push them on the \LUA\ stack, and call a
predefined function. The amount of parsing \TEX\ code is not less, but especially
when we pass numbers stored in registers, no tokenization (serialization of a
number value into the input stream) and stringification (converting the tokens
back to a \LUA\ number) takes place.

\starttyping
\indirectluacall 123
    {some string}
    \scratchcounter
    {another string}
    true
    \dimexpr 10pt\relax
\relax
\stoptyping

Actually, an extension like this had been on the agenda for a while, but never
really got much priority. The first number is a reference to a function to be
called.

\starttyping
lua.calls = lua.calls or { }
lua.calls[123] = function(s1,n1,s2,b,n2)
    -- do something with
    --
    -- string  s1
    -- number  n1
    -- string  s2
    -- boolean b
    -- number  n2
end
\stoptyping

The first number to \type {indirectluacall} is mandate. It can best also be a
number that has a function associated in the \type {lua.calls} table. Following
that number and before the also mandate \type {\relax}, there can be any number
of arguments: strings, numbers and booleans.

Anything surrounded by \type {{}} becomes a string. The keywords \type {true} and
\type {false} become boolean values. Spaces are skipped and everything else is
assumed to be a number. This means that if you omit the final \type {\relax}, you
get a error message mentioning a \quote {missing number}. The normal number
parser applies, so when a dimension register is passed, it is turned into a
number. The example shows that wrapping a more verbose dimension into a \type
{\dimexpr} also works.

Performance wise, each string goes from list of tokens to temporary C string to
\LUA\ string, so that adds some overhead. A number is more efficient, especially
when you pass it using a register. The booleans are simple sequences of character
tokens so they are relatively efficient too. Because \LUA\ functions accept an
arbitrary number of arguments, you can provide as many as you like, or even less
than the function expects: it is all driven by the final \type {\relax}.

An important characteristic of this kind of call is that there is no \type {load}
involved, which means that the functions in \type {lua.calls} can be subjected to
jitting.

\stopsection

\startsection[title=Name spaces]

As with \type {\indirectlua} there is a potential clash when users mess with the
\type {lua.calls} table without taking the macro package usage into account. It not
that complex to define a variant that provides namespaces:

\starttyping
\newcount\indirectmain \indirectmain=1
\newcount\indirectuser \indirectuser=2

\indirectluacall \indirectmain
    {function 1}
    {some string}
\relax

\indirectluacall \indirectuser
    {function 1}
    {some string}
\relax
\stoptyping

A matching implementation is this:

\starttyping
lua.calls = lua.calls or { }

local main = { }

lua.calls[1] = function(name,...)
    main[name](...)
end

main["function 1"] = function(a,b,c)
    -- do something with a,b,c
end

local user = { }

lua.calls[2] = function(name,...)
    user[name](...)
end

user["function 1"] = function(a,b,c)
    -- do something with a,b,c
end
\stoptyping

Of course this is also ok:

\starttyping
\indirectluacall \indirectmain 1
    {some string}
\relax

\indirectluacall \indirectuser 1
    {some string}
\relax
\stoptyping

with:

\starttyping
main[1] = function(a,b,c)
    -- do something with a,b,c
end

user[1] = function(a,b,c)
    -- do something with a,b,c
end
\stoptyping

Normally a macro package, if it wants to expose this mechanism, will provide a
more abstract interface that hides the implementation details. In that case the
user is not supposed to touch \type {lua.calls} but this is not much different
from the limitations in redefining primitives, so users can learn to live with
this.

\stopsection

\startsection[title=Practice]

There are some limitations. For instance in \CONTEXT\ we often pass tables and
this is not implemented. Providing a special interface for that is possible but
does not really help. Often the data passed that way is far from constant, so it
can as well be parsed by \LUA\ itself, which is quite efficient. We did some
experiments with the more simple calls and the outcome is somewhat disputable. If
we replace some of the \quote {critital} calls we can gain some 3\% on a run of
for instance the \type {fonts-mkiv.pdf} manual and a bit more on the command
reference \type {cont-en.pdf}. The first manual uses lots of position tracking
(an unfortunate side effect of using a specific feature that triggers continuous
tracking) and low level font switches and many of these can benefit from the
indirect call variant. The command reference manual uses \XML\ processing and
that involves many calls to the \XML\ mapper and also does quite some string
manipulations so again there is something to gain there.

The following numbers are just an indication, as only a subset of \type
{\directlua} calls has been replaced. The 166 page font manual processes in about
9~seconds which is not bad given its complexity. The timings are on a Dell
Precision M6700 with Core i7 3840QM, 16 GB memory, a fast SSD and 64 bit Windows
8. The binaries were cross compiled mingw 32 bit by Luigi. \footnote {While
testing with several function definitions we noticed that \type {math.random} in
our binaries made jit twice as slow as normal, while for instance \type
{math.sin} was 100 times faster. As the font manual uses the random function for
rendering random punk examples it might have some negative impact. Our experience
is that binaries compiled with the ms compiler are somewhat faster but as long as
the engines that we test are compiled similarly the numbers can be compared.}

% old: 8.870 8.907 9.089        / jit: 6.948 6.966 7.009         / jiton: 7.449 7.586 7.609
% new: 8.710 8.764 8.682 | 8.64 / jit: 6.935 6.969 6.967  | 6.82 / jiton: 7.412 7.223 7.481
%
% 3% on total, 6% on lua

\starttabulate[|lT|cT|cT|cT|]
\HL
\NC          \NC \LUATEX \NC \LUAJITTEX \NC \LUAJITTEX\ + jit \NC \NR
\HL
\NC direct   \NC 8.90     \NC 6.95      \NC 7.50              \NC \NR
\NC indirect \NC 8.65     \NC 6.80      \NC 7.30              \NC \NR
\HL
\stoptabulate

So, we can gain some 3\% on such a document and given that we spend probably half
the time in \LUA, this means that these new features can make \LUA\ run more than
5\% faster which is not that bad for a couple of lines of extra code. For regular
documents we can forget about jit which confirms earlier experiments. The
commands reference has these timings:

\starttabulate[|lT|cT|cT|cT|]
\HL
\NC          \NC \LUATEX \NC \LUAJITTEX \NC \NR
\HL
\NC direct   \NC 2.55    \NC 1.90       \NC \NR
\NC indirect \NC 2.40    \NC 1.80       \NC \NR
\HL
\stoptabulate

Here the differences are larger which is due to the fact that we can indirect
most of the calls used in this processing. The document is rather simple but as
mentioned is encoded in \XML\ and the \TEX||\XML\ interface qualifies for this
kind of speedups.

As Luigi is still trying to figure out why jitting doesn't work out so well, we
also did some tests with (in itself useless) calculations. After all we need
proof. The first test was a loop with 100.000 step doing a regular \type
{\directlua}:

\starttyping
\directlua {
    local t = { }
    for i=1,10000
        do t[i] = math.sin(i/10000)
    end
}
\stoptyping

The second test is a bit optimized. When we use jit this kind of optimizations
happens automatically for known (!) functions so there is not much won.

\starttyping
\directlua {
    local sin = math.sin
    local t = { }
    for i=1,10000
        do t[i] = sin(i/10000)
    end
}
\stoptyping

We also tested this with \type {\indirectlua} and therefore defined some
functions to test the call variant:

\starttyping
lua.calls[1] = function()
    -- overhead
end

lua.calls[2] = function()
    local t = { }
    for i=1,10000 do
        t[i] = math.sin(i/10000) -- naive
    end
end

lua.calls[3] = function()
    local sin = math.sin
    local t = { }
    for i=1,10000 do
        t[i] = sin(i/10000) -- normal
    end
end
\stoptyping

These are called with:

\starttyping
\indirectluacall0\relax
\indirectluacall1\relax
\indirectluacall2\relax
\stoptyping

The overhead variant demonstrated that there was hardly any: less than 0.1 second.

\starttabulate[|lT|lT|cT|cT|cT|]
\HL
\NC                 \NC        \NC \LUATEX \NC \LUAJITTEX \NC \LUAJITTEX\ + jit \NC \NR
\HL
\NC directlua       \NC normal \NC 167     \NC 64         \NC 46                \NC \NR
\NC                 \NC local  \NC 122     \NC 57         \NC 46                \NC \NR
\NC indirectlua     \NC normal \NC 166     \NC 63         \NC 45                \NC \NR
\NC                 \NC local  \NC 121     \NC 56         \NC 45                \NC \NR
\NC indirectluacall \NC normal \NC 165     \NC 66         \NC 48                \NC \NR
\NC                 \NC local  \NC 120     \NC 60         \NC 47                \NC \NR
\HL
\stoptabulate

The results are somewhat disappoint but not that unexpected. We do see a speedup
with \LUAJITTEX\ and in this case even jitting makes sense. However in a regular
typesetting run jitting will never catch up with the costs it carries for the
overall process. The indirect call is somewhat faster than the direct call.
Possible reasons are that hashing at the \LUA\ end also costs time and the
100.000 calls from \TEX\ to \LUA\ is not that big a burden. The indirect call is
therefore also not much faster because it has some additional parsing overhead at
the \TEX\ end. That one only speeds up when we pass arguments and even then not
always the same amount. It is therefore mostly a convenience feature.

We left one aspect out and that is garbage collection. It might be that in large
runs less loading has a positive impact on collecting garbage. We also need to
keep in mind that careful application can have some real impact. Take the
following example of \CONTEXT\ code:

\startntyping
\dorecurse {1000} {

    \startsection[title=section #1]

        \startitemize[n,columns]
            \startitem test \stopitem
            \startitem test \stopitem
            \startitem test \stopitem
            \startitem test \stopitem
        \stopitemize

        \starttabulate[|l|p|]
            \NC test \NC test \NC \NR
            \NC test \NC test \NC \NR
            \NC test \NC test \NC \NR
        \stoptabulate

        test {\setfontfeature{smallcaps} abc} test
        test {\setfontfeature{smallcaps} abc} test
        test {\setfontfeature{smallcaps} abc} test
        test {\setfontfeature{smallcaps} abc} test
        test {\setfontfeature{smallcaps} abc} test
        test {\setfontfeature{smallcaps} abc} test

        \framed[align={lohi,middle}]{test}

        \startembeddedxtable
            \startxrow \startxcell x \stopxcell \startxcell x \stopxcell \stopxrow
            \startxrow \startxcell x \stopxcell \startxcell x \stopxcell \stopxrow
            \startxrow \startxcell x \stopxcell \startxcell x \stopxcell \stopxrow
            \startxrow \startxcell x \stopxcell \startxcell x \stopxcell \stopxrow
            \startxrow \startxcell x \stopxcell \startxcell x \stopxcell \stopxrow
        \stopembeddedxtable

    \stopsection

    \page

}
\stopntyping

These macros happen to use mechanism that are candidates for indirectness.
However, it doesn't happen often you you process thousands of pages with mostly
tables and smallcaps (although tabular digits are a rather valid font feature in
tables). For instance, in web services squeezing out a few tens of seconds might
make sense if there is a large queue of documents.

\starttabulate[|lT|cT|cT|cT|]
\HL
\NC          \NC \LUATEX \NC \LUAJITTEX \NC \LUAJITTEX\ + jit \NC \NR
\HL
\NC direct   \NC 19.1    \NC 15.9       \NC 15.8              \NC \NR
\NC indirect \NC 18.0    \NC 15.2       \NC 15.0              \NC \NR
\HL
\stoptabulate

Surprisingly, even jitting helps a bit here. Maybe it relates the the number of
pages and the amount of calls but we didn't investigate this. By default jitting
is off anyway. The impact of indirectness is more than in previous examples.

For this test a file was loaded that redefines some core \CONTEXT\ code. This
also has some overhead which means that numbers for the indirect case will be
somewhat better if we decide to use these mechanisms in the core code. It is
tempting to do that but it involves some work and it's always the question if a
week of experimenting and coding will ever be compensated by less. After all, in
this last test, a speed of 50 pages per second is not that bad a performance.

When looking at these numbers, keep in mind that it is still not clear if we end
up using this functionality, and when \CONTEXT\ will use it, it might be in a way
that gives better or worse timings than mentioned above. For instance, storing \LUA\
code in the format is possible, but these implementations force us to serialize
the \type {lua.calls} mechanism and initialize them after format loading. For that
reason alone, a more native solution is better.

\stopsection

\startsection[title=Exploration]

In the early days of \LUATEX\ Taco and I discussed an approach similar do
registers which means that there is some \type {\...def} command available. The
biggest challenge there is to come up with a decent way to define the arguments.
On the one hand, using a hash syntax is natural to \TEX, but using names is more
natural to \LUA. So, when we picked up that thread, solutions like this came up
in a Skype session with Taco:

\starttyping
\luadef\myfunction#1#2{ tex.print(arg[1]+arg[2]) }
\stoptyping

The \LUA\ snippet becomes a function with this body:

\starttyping
local arg = { #1, #2 } -- can be preallocated and reused
-- the body as defined at the tex end
tex.print(arg[1]+arg[2])
\stoptyping

Where \type {arg} is set each time. As we wrapped it in a function we can
also put the arguments on the stack and use:

\starttyping
\luadef\myfunction#1#2{ tex.print((select(1,...))+(select(2,...)) }
\stoptyping

Given that we can make select work this way (either or not by additional
wrapping). Anyway, both these solutions are ugly and so we need to look further.
Also, the \type {arg} variant mandates building a table. So, a natural next
iteration is:

\starttyping
\luadef\myfunction a b { tex.print(a+b) }
\stoptyping

Here it becomes already more natural:

\starttyping
local a = #1
local b = #2
-- the body as defined at the tex end
tex.print(a+b)
\stoptyping

But, as we don't want to reload the body we need to push \type {#1} into the
closure. This is a more static definition equivalent:

\starttyping
local a = select(1,...)
local b = select(2,...)
tex.print(a+b)
\stoptyping

Keep in mind that we are not talking of some template that gets filled in and
loaded, but about precompiled functions! So, a \type {#1} is not really put there
but somehow pushed into the closure (we know the stack offsets).

Yet another issue is more direct alias. Say that we define a function at the
\LUA\ end and want to access it using this kind of interface.

\starttyping
function foo(a,b)
    tex.print(a+b)
end
\stoptyping

Given that we have something:

\starttyping
\luadef \myfunctiona a b { tex.print(a+b) }
\stoptyping

We can consider:

\starttyping
\luaref \myfunctionb 2 {foo}
\stoptyping

The explicit number is debatable as it can be interesting to permit
an arbitrary number of arguments here.

\starttyping
\myfunctiona{1}{2}
\myfunctionb{1}{2}
\stoptyping

So, if we go for:

\starttyping
\luaref \myfunctionb {foo}
\stoptyping

we can use \type {\relax} as terminator:

\starttyping
\myfunctiona{1}{2}
\myfunctionb{1}{2}\relax
\stoptyping

In fact, the call method discussed in a previous section can be used here as well
as it permits less arguments as well as mixed types. Think of this:

\starttyping
\luadef \myfunctiona a b c { tex.print(a or 0 + b or 0 + c or 0) }
\luaref \myfunctionb {foo}
\stoptyping

with

\starttyping
function foo(a,b,c)
    tex.print(a or 0 + b or 0 + c or 0)
end
\stoptyping

This could be all be valid:

\starttyping
\myfunctiona{1}{2}{3]\relax
\myfunctiona{1}\relax
\myfunctionb{1}{2}\relax
\stoptyping

or (as in practice we want numbers):

\starttyping
\myfunctiona 1 \scratchcounter 3\relax
\myfunctiona 1 \relax
\myfunctionb 1 2 \relax
\stoptyping

We basicaly get optional arguments for free, as long as we deal with it properly
at the \LUA\ end. The only condition with the \type {\luadef} case is that there
can be no more than the given number of arguments, because that's how the function
body gets initialized set up. In practice this is quite okay.

% After this exploration we can move on to the final implementation and see what we
% ended up with.

\stopsection

% \startsection[title=The final implementation]
%     {\em todo}
% \stopsection

\startsection[title=The follow up]

We don't know what eventually will happen with \LUATEX. We might even (at least
in \CONTEXT) stick to the current approach because there not much to gain in
terms of speed, convenience and (most of all) beauty.

{\em Note:} In \LUATEX\ 0.79 onward \type {\indirectlua} has been implemented as
\type {\luafunction} and the \type {lua.calls} table is available as \type
{lua.get_functions_table()}. A decent token parser has been discussed at the
\CONTEXT\ 2013 conference and will show up in due time. In addition, so called
\type {latelua} nodes support function assignments and \type {user} nodes support
a field for \LUA\ values. Additional information can be associated with any nodes
using the properties subsystem.

\stopsection

\stopchapter

\stopcomponent