workflows-parallel.tex /size: 9641 b    last modification: 2023-12-21 09:43
1% language=us runpath=texruns:manuals/workflows
2
3\startcomponent workflows-parallel
4
5\environment workflows-style
6
7\startchapter[title={Parallel processing}]
8
9\startsection[title={Introduction}]
10
11\stopsection
12
13This is just a small intermezzo. Mid April 2020 Mojca asked on the mailing list how
14to best compile 5000 files, based on a template. The answer depends on the workflow
15and circumstances but one can easily come up with some factors that play a role.
16
17\startitemize
18    \startitem
19        How complex is the document? How many pages are generated, how many fonts
20        get used? Do we need multiple runs per document? Are images involved and
21        if so, what format are they in? When processing relative small files we
22        normally need seconds, not minutes.
23    \stopitem
24    \startitem
25        What machine is used? How powerful is the \CPU, how many cores are
26        available and how much memory do we have? Is the filesystem on a local
27        \SSD\ or on a remote file system? How well does file caching work? Again,
28        we're talking seconds here.
29    \stopitem
30    \startitem
31        What engine is used? Assuming that \MKIV\ is used, we can choose for
32        \LUATEX\ or \LUAMETATEX. The former has faster backend code, the later a
33        faster frontend. What is more efficient depends on the document. The
34        later has some advantages that we will not mention here.
35    \stopitem
36\stopitemize
37
38The tests mentioned below are run with a simple \LUA\ script that manages the
39parallel runs. More about that later. As sample document we use this:
40
41\starttyping
42\setupbodyfont[dejavu]
43
44\starttext
45    \dorecurse{\getdocumentargument{noffiles}}{\input tufte\par}
46\stoptext
47\stoptyping
48
49We start with 100 runs of 10 inclusions. We permit 8 runs in parallel. A \LUATEX\
50run of 100 takes 32 seconds, a \LUAJITTEX\ run uses 26 seconds, and \LUAMETATEX\
51does it in 25 seconds. \footnote {I used a mingw cross compiled 64 bit binary;
52the GCC9 version seems somewhat slower than the previous compiler version.} An
53interesting observation is memory consumption: \LUAJITTEX, which has a different
54virtual machine and a limited memory model, peaks at 0.8GB for the eight parallel
55runs. The \LUAMETATEX\ engine has the same demands. However, \LUATEX\ needs
561.2GB. Bumping to 20 inclusions increased the runtime a few seconds for each
57engine.
58
59The differences can be explained by a faster startup time of \LUAMETATEX; for
60instance we don't use a compressed format (dump), but there are some other
61optimizations too, and even when they're close to unmeasurable, they might add
62up. The \LUAJITTEX\ engine speeds up \LUA\ interpretation which is reflected in
63runtime because \CONTEXT\ spends half its time in \LUA.
64
65As a next test I decided to run the test file 5000 times: Mojca's scenario.
66Including 10 sample files (per run) for those 5000 files took 1320 seconds. When
67we cache the included file we gain some 5~percent.
68
69Does it matter how many jobs we run in parallel? The 2013 laptop I used for
70testing has four real cores that hyperthread to eight cores. \footnote {The
71machine has an Intel i7-3840QM \CPU, 16GB of memory and a 512 GB Samsung Pro
72\SSD.} On 1000 jobs we need 320 seconds for 1000 files (10 inclusions) when we
73use four cores. With six cores we need 270 seconds, which is much better. With
74eight cores we go down to 260 seconds and ten cores, which is two more than there
75are, we get about the same runtime. \footnote {On a more modern system, let alone
76a desktop computer, I expect these numbers to be much lower.} A \TEX\ program is
77a single core process and it makes no sense to use more cores than the \CPU\
78provides.
79
80\starttyping
81\setupbodyfont[dejavu]
82
83\starttext
84    \dorecurse{\getdocumentargument{noffiles}}{\samplefile{tufte}\par}
85\stoptext
86\stoptyping
87
88Again, caching the input file as above saves a little bit: 10 seconds, so we get
89250 seconds. When you run these tests on the machine that you normally work on,
90waiting for that many jobs to finish is no fun, so what if we (as I then normally
91do) watch some music video? With a full screen high resolution video shown in the
92foreground the runtime didn't change: still 250 seconds for 1000 jobs with eight
93parallel runs. On the other hand, a test with Firefox, which is quite demanding,
94running a video in the background, made the runtime going up by 30 seconds to
95280. So, when doing some networking, decompression, all kinds of unknown tracking
96using \JAVASCRIPT, etc.\ and therefore its own demands on cores and memory you
97might want to limit the number of parallel runs. These tests are probably not
98that meaningful but a good distraction when in lock down.
99
100I'm still not sure if I should come up with a script for managing these parallel
101runs. But one thing I have added to the \type {context} runner is the (for now
102undocumented) option
103
104\starttyping
105--wipebusy
106\stoptyping
107
108which, after a run removes the file
109
110\starttyping
111context-is-busy.tmp
112\stoptyping
113
114This permits a management script to check if a run is done. Before starting a run
115(in a separate process) the script can write that file and by just checking if it
116is still there, the management script can decide when a next run can be started.
117
118\stopsection
119
120\startsection[title={Solution}]
121
122Mid 2023 the test suite had some 1900 files and whenever the engine or \CONTEXT\
123is adapted that collection of files is processed. Although one can test for
124regression the main reason for doing this is to check if any of these documents
125fails, which can happen due to some typo in an upgrade. We also want to see if
126there is a change in performance. However, for a single run it takes some 1350
127seconds (on my current development laptop), when two runs are needed we end up as
1282500 seconds. Of course a modern machine would likely bring that down to some 700
129seconds but there is only so much one can spend on hardware.
130
131Because \TEX\ is a single core process, the question is \quotation{Can we use
132more than one core?} and the answer is \quotation {Of course we can}. Running the
133test suite is managed by a (\LUA) script and one of its tasks is to traverse the
134tree of files and register success or failure so that at the end we get an
135overview. Parallelizing the process can be done in two ways: we divide the whole
136batch in several smaller batches where each batch is processed at the same time.
137Alternatively we can have one process that runs files in different threads (sort
138of).
139
140That second option is what we prefer and the runtime results of that one are as
141follows. We run on a four core mobile Xeon 1505 (with hyperthreading enabled):
142
143\starttabulate[|r|r|]
144\NC  6 \NC 800 \NC \NR
145\BC  8 \BC 700 \NC \NR
146\NC 12 \NC 650 \NC \NR
147\NC 16 \NC 600 \NC \NR
148\stoptabulate
149
150As a comparison, running multiple batches in parallel gives (in seconds):
151
152\starttabulate[|r|r|]
153\NC 4 \NC 550 \NC \NR
154\NC 6 \NC 540 \NC \NR
155\stoptabulate
156
157In the end I settled for the 8 runs in parallel, just because the machine ran
158less hot (and thereby less noisy) and one can still do plenty other things during
159that run. When one pipes the output to the terminal the graphic processor is also
160kept busy (some 50 percent for the built|-|in graphics).
161
162So how does that benefit users? A variant of this feature is now available in the
163\CONTEXT\ runner. Of course you could always do this:
164
165\starttyping
166context test1.tex test2.tex test3.tex test4.tex
167\stoptyping
168
169Here the files are processed in sequence. But you can now get a better
170performance:
171
172\starttyping
173context --parallel test1.tex test2.tex test3.tex test4.tex
174\stoptyping
175
176In case every test file needs different command line arguments you can do this
177instead:
178
179\starttyping
180context --parallellist test.cmd
181\stoptyping
182
183Where (on \MSWINDOWS) that file can look like:
184
185\starttyping
186context test1 --result=test-1
187context test2 --result=test-2
188context test3 --result=test-3
189context test4 --result=test-4
190\stoptyping
191
192Here the runner will just take the lines that start with \type {context} which
193means that you can still run that command file directly.
194
195By default the output is suppressed unless you pass \type {--terminal} as in:
196
197\starttyping
198context --parallel --terminal test1.tex test2.tex test3.tex test4.tex
199\stoptyping
200
201The only drawback in this parallel approach is that when one run stalls its
202output (for instance because some huge graphic is rendered) the others will wait
203a bit, just because we cycle through the runs. It also depends how console
204buffering is set up. In practice this will not have much impact.
205
206If you use the same input file with different output, you can use this setup:
207
208\starttyping
209context test-1 --mode=m-1 --forceinput=test
210context test-2 --mode=m-2 --forceinput=test
211context test-3 --mode=m-3 --forceinput=test
212context test-5 --mode=m-5 --forceinput=test
213context test-5 --mode=m-5 --forceinput=test
214context test-6 --mode=m-6 --forceinput=test
215\stoptyping
216
217Here the so called jobname will be the given filename \type {test-*} while thee
218input will come from \type {test.tex}. Using the same input files with a result
219specification doesn't work here because the multi|-|pass data file(s) and
220intermediate result will clash. Running on dedicated subpath could work out but
221has its own complications (creation, cleanup, moving results) and would therefore
222be more fragile.
223
224An example of running files in a tree, like those in the test suite is:
225
226\starttyping
227context --parallel --pattern=**/whatever*.tex
228\stoptyping
229
230The files will be processed the directory where the file is found. When there is
231an error the filename will be reported at the end of the run.
232
233\stopsection
234
235\stopchapter
236
237\stopcomponent
238