% language=us runpath=texruns:manuals/workflows

\startcomponent workflows-parallel

\environment workflows-style

\startchapter[title={Parallel processing}]

\startsection[title={Introduction}]

\stopsection

This is just a small intermezzo. Mid April 2020 Mojca asked on the mailing list how
to best compile 5000 files, based on a template. The answer depends on the workflow
and circumstances but one can easily come up with some factors that play a role.

\startitemize
    \startitem
        How complex is the document? How many pages are generated, how many fonts
        get used? Do we need multiple runs per document? Are images involved and
        if so, what format are they in? When processing relative small files we
        normally need seconds, not minutes.
    \stopitem
    \startitem
        What machine is used? How powerful is the \CPU, how many cores are
        available and how much memory do we have? Is the filesystem on a local
        \SSD\ or on a remote file system? How well does file caching work? Again,
        we're talking seconds here.
    \stopitem
    \startitem
        What engine is used? Assuming that \MKIV\ is used, we can choose for
        \LUATEX\ or \LUAMETATEX. The former has faster backend code, the later a
        faster frontend. What is more efficient depends on the document. The
        later has some advantages that we will not mention here.
    \stopitem
\stopitemize

The tests mentioned below are run with a simple \LUA\ script that manages the
parallel runs. More about that later. As sample document we use this:

\starttyping
\setupbodyfont[dejavu]

\starttext
    \dorecurse{\getdocumentargument{noffiles}}{\input tufte\par}
\stoptext
\stoptyping

We start with 100 runs of 10 inclusions. We permit 8 runs in parallel. A \LUATEX\
run of 100 takes 32 seconds, a \LUAJITTEX\ run uses 26 seconds, and \LUAMETATEX\
does it in 25 seconds. \footnote {I used a mingw cross compiled 64 bit binary;
the GCC9 version seems somewhat slower than the previous compiler version.} An
interesting observation is memory consumption: \LUAJITTEX, which has a different
virtual machine and a limited memory model, peaks at 0.8GB for the eight parallel
runs. The \LUAMETATEX\ engine has the same demands. However, \LUATEX\ needs
1.2GB. Bumping to 20 inclusions increased the runtime a few seconds for each
engine.

The differences can be explained by a faster startup time of \LUAMETATEX; for
instance we don't use a compressed format (dump), but there are some other
optimizations too, and even when they're close to unmeasurable, they might add
up. The \LUAJITTEX\ engine speeds up \LUA\ interpretation which is reflected in
runtime because \CONTEXT\ spends half its time in \LUA.

As a next test I decided to run the test file 5000 times: Mojca's scenario.
Including 10 sample files (per run) for those 5000 files took 1320 seconds. When
we cache the included file we gain some 5~percent.

Does it matter how many jobs we run in parallel? The 2013 laptop I used for
testing has four real cores that hyperthread to eight cores. \footnote {The
machine has an Intel i7-3840QM \CPU, 16GB of memory and a 512 GB Samsung Pro
\SSD.} On 1000 jobs we need 320 seconds for 1000 files (10 inclusions) when we
use four cores. With six cores we need 270 seconds, which is much better. With
eight cores we go down to 260 seconds and ten cores, which is two more than there
are, we get about the same runtime. \footnote {On a more modern system, let alone
a desktop computer, I expect these numbers to be much lower.} A \TEX\ program is
a single core process and it makes no sense to use more cores than the \CPU\
provides.

\starttyping
\setupbodyfont[dejavu]

\starttext
    \dorecurse{\getdocumentargument{noffiles}}{\samplefile{tufte}\par}
\stoptext
\stoptyping

Again, caching the input file as above saves a little bit: 10 seconds, so we get
250 seconds. When you run these tests on the machine that you normally work on,
waiting for that many jobs to finish is no fun, so what if we (as I then normally
do) watch some music video? With a full screen high resolution video shown in the
foreground the runtime didn't change: still 250 seconds for 1000 jobs with eight
parallel runs. On the other hand, a test with Firefox, which is quite demanding,
running a video in the background, made the runtime going up by 30 seconds to
280. So, when doing some networking, decompression, all kinds of unknown tracking
using \JAVASCRIPT, etc.\ and therefore its own demands on cores and memory you
might want to limit the number of parallel runs. These tests are probably not
that meaningful but a good distraction when in lock down.

I'm still not sure if I should come up with a script for managing these parallel
runs. But one thing I have added to the \type {context} runner is the (for now
undocumented) option

\starttyping
--wipebusy
\stoptyping

which, after a run removes the file

\starttyping
context-is-busy.tmp
\stoptyping

This permits a management script to check if a run is done. Before starting a run
(in a separate process) the script can write that file and by just checking if it
is still there, the management script can decide when a next run can be started.

\stopsection

\startsection[title={Solution}]

Mid 2023 the test suite had some 1900 files and whenever the engine or \CONTEXT\
is adapted that collection of files is processed. Although one can test for
regression the main reason for doing this is to check if any of these documents
fails, which can happen due to some typo in an upgrade. We also want to see if
there is a change in performance. However, for a single run it takes some 1350
seconds (on my current development laptop), when two runs are needed we end up as
2500 seconds. Of course a modern machine would likely bring that down to some 700
seconds but there is only so much one can spend on hardware.

Because \TEX\ is a single core process, the question is \quotation{Can we use
more than one core?} and the answer is \quotation {Of course we can}. Running the
test suite is managed by a (\LUA) script and one of its tasks is to traverse the
tree of files and register success or failure so that at the end we get an
overview. Parallelizing the process can be done in two ways: we divide the whole
batch in several smaller batches where each batch is processed at the same time.
Alternatively we can have one process that runs files in different threads (sort
of).

That second option is what we prefer and the runtime results of that one are as
follows. We run on a four core mobile Xeon 1505 (with hyperthreading enabled):

\starttabulate[|r|r|]
\NC  6 \NC 800 \NC \NR
\BC  8 \BC 700 \NC \NR
\NC 12 \NC 650 \NC \NR
\NC 16 \NC 600 \NC \NR
\stoptabulate

As a comparison, running multiple batches in parallel gives (in seconds):

\starttabulate[|r|r|]
\NC 4 \NC 550 \NC \NR
\NC 6 \NC 540 \NC \NR
\stoptabulate

In the end I settled for the 8 runs in parallel, just because the machine ran
less hot (and thereby less noisy) and one can still do plenty other things during
that run. When one pipes the output to the terminal the graphic processor is also
kept busy (some 50 percent for the built|-|in graphics).

So how does that benefit users? A variant of this feature is now available in the
\CONTEXT\ runner. Of course you could always do this:

\starttyping
context test1.tex test2.tex test3.tex test4.tex
\stoptyping

Here the files are processed in sequence. But you can now get a better
performance:

\starttyping
context --parallel test1.tex test2.tex test3.tex test4.tex
\stoptyping

In case every test file needs different command line arguments you can do this
instead:

\starttyping
context --parallellist test.cmd
\stoptyping

Where (on \MSWINDOWS) that file can look like:

\starttyping
context test1 --result=test-1
context test2 --result=test-2
context test3 --result=test-3
context test4 --result=test-4
\stoptyping

Here the runner will just take the lines that start with \type {context} which
means that you can still run that command file directly.

By default the output is suppressed unless you pass \type {--terminal} as in:

\starttyping
context --parallel --terminal test1.tex test2.tex test3.tex test4.tex
\stoptyping

The only drawback in this parallel approach is that when one run stalls its
output (for instance because some huge graphic is rendered) the others will wait
a bit, just because we cycle through the runs. It also depends how console
buffering is set up. In practice this will not have much impact.

If you use the same input file with different output, you can use this setup:

\starttyping
context test-1 --mode=m-1 --forceinput=test
context test-2 --mode=m-2 --forceinput=test
context test-3 --mode=m-3 --forceinput=test
context test-5 --mode=m-5 --forceinput=test
context test-5 --mode=m-5 --forceinput=test
context test-6 --mode=m-6 --forceinput=test
\stoptyping

Here the so called jobname will be the given filename \type {test-*} while thee
input will come from \type {test.tex}. Using the same input files with a result
specification doesn't work here because the multi|-|pass data file(s) and
intermediate result will clash. Running on dedicated subpath could work out but
has its own complications (creation, cleanup, moving results) and would therefore
be more fragile.

An example of running files in a tree, like those in the test suite is:

\starttyping
context --parallel --pattern=**/whatever*.tex
\stoptyping

The files will be processed the directory where the file is found. When there is
an error the filename will be reported at the end of the run.

\stopsection

\startsection[title=Alternative]

This trickery was introduced for speeding up processing the test suite. It is
however also handy for other cases, for example in an export. There we can make
it part of a workflow where many files are involved. Below is a short summary of
what we have so far.

We have several shared command line options. With \type {--terminal} you get
output on the console, which of course slows matters down. With \type {--runners}
you specify how many jobs run in parallel; we default to eight.

The \type{--parallel} argument specifies a specification file, which has to
return a \LUA\ table. This can be a top level specification or it can have a sub
table \typ {parallel}. The parallel specification has the following fields:
optional \type {runners}, \type {terminal}, \type {sequence} and mandate \type
{commands}. Here is an example of a file:

\starttyping
return {
    author   = "Mikael S and Hans H",
    comment  = "export control file for the 𝛽1 course",
    version  = "2025.07.15",
    parallel = {
        runners  = 8,
     -- terminal = true,
        sequence = true,
        commands = {
            {
                "context --exportcontent file-1",
                "context --exportcontent file-2",
                "context --exportcontent file-3",
            },
            {
                "mtxrun --script epub --images --flat --path=foo file-1",
                "mtxrun --script epub --images --flat --path=foo file-1",
                "mtxrun --script epub --images --flat --path=foo file-1",
            }
        },
    },
    -- entries that are used by other helper programs or that drive the
    -- export (like paths)
}
\stoptyping

The \type {commands} table can be flat (one level) or have sub tables as here.
WHen \type {sequence} is set to \type {true} the sub tables will be processed
after each other. This is needed when we have dependencies. In our example we
can only process images when we have exported the content.

Alternatively one can specify a \type {--parallellist} or \type
{--parallelsequence} that takes the given files and looks for \type {context} or
\type {mtxrun} commands to process. A sequence will process in batches.

\stopsection

\stopchapter

\stopcomponent

% return {
%     author   = "Mikael S and Hans H",
%     comment  = "export control file for the 𝛽1 course",
%     version  = "2025.07.15",
%     parallel = {
%         runners  = 8,
%      -- terminal = true,
%         sequence = true,
%         commands = {
%             {
%                 "context --exportcontent file-1",
%                 "context --exportcontent file-2",
%                 "context --exportcontent file-3",
%             },
%             {
%                 "mtxrun --script epub --images --flat --path=foo file-1",
%                 "mtxrun --script epub --images --flat --path=foo file-1",
%                 "mtxrun --script epub --images --flat --path=foo file-1",
%             }
%         },
%     },