% language=us runpath=texruns:manuals/workflows \startcomponent workflows-parallel \environment workflows-style \startchapter[title={Parallel processing}] \startsection[title={Introduction}] \stopsection This is just a small intermezzo. Mid April 2020 Mojca asked on the mailing list how to best compile 5000 files, based on a template. The answer depends on the workflow and circumstances but one can easily come up with some factors that play a role. \startitemize \startitem How complex is the document? How many pages are generated, how many fonts get used? Do we need multiple runs per document? Are images involved and if so, what format are they in? When processing relative small files we normally need seconds, not minutes. \stopitem \startitem What machine is used? How powerful is the \CPU, how many cores are available and how much memory do we have? Is the filesystem on a local \SSD\ or on a remote file system? How well does file caching work? Again, we're talking seconds here. \stopitem \startitem What engine is used? Assuming that \MKIV\ is used, we can choose for \LUATEX\ or \LUAMETATEX. The former has faster backend code, the later a faster frontend. What is more efficient depends on the document. The later has some advantages that we will not mention here. \stopitem \stopitemize The tests mentioned below are run with a simple \LUA\ script that manages the parallel runs. More about that later. As sample document we use this: \starttyping \setupbodyfont[dejavu] \starttext \dorecurse{\getdocumentargument{noffiles}}{\input tufte\par} \stoptext \stoptyping We start with 100 runs of 10 inclusions. We permit 8 runs in parallel. A \LUATEX\ run of 100 takes 32 seconds, a \LUAJITTEX\ run uses 26 seconds, and \LUAMETATEX\ does it in 25 seconds. \footnote {I used a mingw cross compiled 64 bit binary; the GCC9 version seems somewhat slower than the previous compiler version.} An interesting observation is memory consumption: \LUAJITTEX, which has a different virtual machine and a limited memory model, peaks at 0.8GB for the eight parallel runs. The \LUAMETATEX\ engine has the same demands. However, \LUATEX\ needs 1.2GB. Bumping to 20 inclusions increased the runtime a few seconds for each engine. The differences can be explained by a faster startup time of \LUAMETATEX; for instance we don't use a compressed format (dump), but there are some other optimizations too, and even when they're close to unmeasurable, they might add up. The \LUAJITTEX\ engine speeds up \LUA\ interpretation which is reflected in runtime because \CONTEXT\ spends half its time in \LUA. As a next test I decided to run the test file 5000 times: Mojca's scenario. Including 10 sample files (per run) for those 5000 files took 1320 seconds. When we cache the included file we gain some 5~percent. Does it matter how many jobs we run in parallel? The 2013 laptop I used for testing has four real cores that hyperthread to eight cores. \footnote {The machine has an Intel i7-3840QM \CPU, 16GB of memory and a 512 GB Samsung Pro \SSD.} On 1000 jobs we need 320 seconds for 1000 files (10 inclusions) when we use four cores. With six cores we need 270 seconds, which is much better. With eight cores we go down to 260 seconds and ten cores, which is two more than there are, we get about the same runtime. \footnote {On a more modern system, let alone a desktop computer, I expect these numbers to be much lower.} A \TEX\ program is a single core process and it makes no sense to use more cores than the \CPU\ provides. \starttyping \setupbodyfont[dejavu] \starttext \dorecurse{\getdocumentargument{noffiles}}{\samplefile{tufte}\par} \stoptext \stoptyping Again, caching the input file as above saves a little bit: 10 seconds, so we get 250 seconds. When you run these tests on the machine that you normally work on, waiting for that many jobs to finish is no fun, so what if we (as I then normally do) watch some music video? With a full screen high resolution video shown in the foreground the runtime didn't change: still 250 seconds for 1000 jobs with eight parallel runs. On the other hand, a test with Firefox, which is quite demanding, running a video in the background, made the runtime going up by 30 seconds to 280. So, when doing some networking, decompression, all kinds of unknown tracking using \JAVASCRIPT, etc.\ and therefore its own demands on cores and memory you might want to limit the number of parallel runs. These tests are probably not that meaningful but a good distraction when in lock down. I'm still not sure if I should come up with a script for managing these parallel runs. But one thing I have added to the \type {context} runner is the (for now undocumented) option \starttyping --wipebusy \stoptyping which, after a run removes the file \starttyping context-is-busy.tmp \stoptyping This permits a management script to check if a run is done. Before starting a run (in a separate process) the script can write that file and by just checking if it is still there, the management script can decide when a next run can be started. \stopsection \startsection[title={Solution}] Mid 2023 the test suite had some 1900 files and whenever the engine or \CONTEXT\ is adapted that collection of files is processed. Although one can test for regression the main reason for doing this is to check if any of these documents fails, which can happen due to some typo in an upgrade. We also want to see if there is a change in performance. However, for a single run it takes some 1350 seconds (on my current development laptop), when two runs are needed we end up as 2500 seconds. Of course a modern machine would likely bring that down to some 700 seconds but there is only so much one can spend on hardware. Because \TEX\ is a single core process, the question is \quotation{Can we use more than one core?} and the answer is \quotation {Of course we can}. Running the test suite is managed by a (\LUA) script and one of its tasks is to traverse the tree of files and register success or failure so that at the end we get an overview. Parallelizing the process can be done in two ways: we divide the whole batch in several smaller batches where each batch is processed at the same time. Alternatively we can have one process that runs files in different threads (sort of). That second option is what we prefer and the runtime results of that one are as follows. We run on a four core mobile Xeon 1505 (with hyperthreading enabled): \starttabulate[|r|r|] \NC 6 \NC 800 \NC \NR \BC 8 \BC 700 \NC \NR \NC 12 \NC 650 \NC \NR \NC 16 \NC 600 \NC \NR \stoptabulate As a comparison, running multiple batches in parallel gives (in seconds): \starttabulate[|r|r|] \NC 4 \NC 550 \NC \NR \NC 6 \NC 540 \NC \NR \stoptabulate In the end I settled for the 8 runs in parallel, just because the machine ran less hot (and thereby less noisy) and one can still do plenty other things during that run. When one pipes the output to the terminal the graphic processor is also kept busy (some 50 percent for the built|-|in graphics). So how does that benefit users? A variant of this feature is now available in the \CONTEXT\ runner. Of course you could always do this: \starttyping context test1.tex test2.tex test3.tex test4.tex \stoptyping Here the files are processed in sequence. But you can now get a better performance: \starttyping context --parallel test1.tex test2.tex test3.tex test4.tex \stoptyping In case every test file needs different command line arguments you can do this instead: \starttyping context --parallellist test.cmd \stoptyping Where (on \MSWINDOWS) that file can look like: \starttyping context test1 --result=test-1 context test2 --result=test-2 context test3 --result=test-3 context test4 --result=test-4 \stoptyping Here the runner will just take the lines that start with \type {context} which means that you can still run that command file directly. By default the output is suppressed unless you pass \type {--terminal} as in: \starttyping context --parallel --terminal test1.tex test2.tex test3.tex test4.tex \stoptyping The only drawback in this parallel approach is that when one run stalls its output (for instance because some huge graphic is rendered) the others will wait a bit, just because we cycle through the runs. It also depends how console buffering is set up. In practice this will not have much impact. If you use the same input file with different output, you can use this setup: \starttyping context test-1 --mode=m-1 --forceinput=test context test-2 --mode=m-2 --forceinput=test context test-3 --mode=m-3 --forceinput=test context test-5 --mode=m-5 --forceinput=test context test-5 --mode=m-5 --forceinput=test context test-6 --mode=m-6 --forceinput=test \stoptyping Here the so called jobname will be the given filename \type {test-*} while thee input will come from \type {test.tex}. Using the same input files with a result specification doesn't work here because the multi|-|pass data file(s) and intermediate result will clash. Running on dedicated subpath could work out but has its own complications (creation, cleanup, moving results) and would therefore be more fragile. An example of running files in a tree, like those in the test suite is: \starttyping context --parallel --pattern=**/whatever*.tex \stoptyping The files will be processed the directory where the file is found. When there is an error the filename will be reported at the end of the run. \stopsection \stopchapter \stopcomponent