% language=us runpath=texruns:manuals/ontarget \startcomponent ontarget-registers \environment ontarget-style \startchapter[title={Gaining performance}] In the meantime (2022) the \LUAMETATEX\ engine has touched many aspects of the original \TEX\ implementation. This has resulted in less memory consumption than for instance \LUATEX\ when we talk tokens, more efficient macro handing, additional storage options and numerous new features and optimizations. Of course one can disagree about all of this, but what matters to us is that it facilitates \CONTEXT\ well. That macro package went from \MKII\ to \MKIV\ to \MKXL\ (aka \LMTX). Although over the years the macros evolved the basic ideas haven't changed: it is a keyword driven macro package that is set up in a way that makes it possible to move forward. In spite of what one might think, the fundamentals didn't change much. It looks like we made the right decisions at the start, which means that we can change low level implementations to match the engine without users noticing much. Of course in the area of fonts, input encoding and languages things have changed simply because the environment in which we operate changes. A fundamental difference between \PDFTEX\ and \LUAMETATEX\ is that the later is in many aspects 32 and even 64 bit all over the place. That comes with a huge performance hit but also with possibilities (that I won't discuss here now)! On a simple document nothing can beat \PDFTEX, even with the optimizations that we can apply when using the modern engines. However, on more complex documents reality is that \LUAMETATEX\ can outperform \PDFTEX, and documents (read: user demands) have become more complex indeed. So, how does that work in practice? One can add some features to an engine but then the macro package has to be adapted. Due to the way \CONTEXT\ is organized it was not that hard to keep it in sync with new features, although not all are applied yet to full extend. Some new features improved performance, others made the machinery (or its usage) a bit slower. The first versions of \LUAMETATEX\ were some 25\percent\ slower than \LUATEX, simply because the backend is written in \LUA. But, end 2022 we can safely say that \LUAMETATEX\ can be 50\percent\ faster than its ancestor. This is due to a mix of the already mentioned optimizations and new features, for instance a more powerful macro parser. The backend has become more complex too, but also benefits from a few more helpers. Because we spend a lot of time in \LUA\ the interfaces to \TEX\ have been extended and improved too. Of course we depend on the \LUA\ interpreter being kept in optimum state by its authors. It must be said that quite some of the interfaces might look obscure but these are not really meant for the average user anyway. Also, as soon as one messes with tokens and nodes at that level one definitely need to know what one's doing! The more stable the engine becomes, the less there is to improve. Occasionally it was possible to squeeze our a few more milliseconds on run but it depends a lot of what one does. And \TEX\ is already quite fast anyway. Of course 0.005 seconds on a 5 second run is not much but hundred times such an improvement is noticeable, especially when there are multiple runs or when one processes a batch of 10.000 documents (each needing two runs). One interesting aspect of \TEX\ that it can surprise you every now and then. End 2022 I decided to play a bit more with a feature that has been around for a while: \starttyping \integerdef \fooA 123 \dimensiondef\fooB 123pt \stoptyping These primitives create a counter and a dimen where the value is stored in the hash table. The original reason was that I didn't want to spoil registers. But although these are basically constants there is more to it now. \starttyping \countdef\fooC 27 \dimendef\fooD 56 \stoptyping These primitives create a command that stores the register number (here 27 and 56) with the name. In this case a \quote {variable} is accessed in two steps: the \type {\fooC} macro expands to an register accessor with value 27. Next that accessor will kick in and fetch (or set) the value in slot 27 of the memory range bound to (in total 65K) counters. All these registers sit a the lower end of \TEX's memory which is definitely not next to the meaning of \type {\fooC}. So we have two memory accesses to get to the number. Contrary to that once we are at \type {\fooA} we are also at the value. Although memory access can be fast when the relevant slots are cached in practice it can give delays, especially in a program like \TEX\ where most data is spread all over the place. And imagine other processes competing for access too. It is for that reason that I decided to replace the more or less \quote {constant} property of \type {\fooA} by one that also supports assignments As well as the arithmic commands like \type {\advance}. This was not that hard due to the way the \LUAMETATEX\ source is organized. After that using these pseudo constants proved to be more efficient than registers, but of course I then had to adapt the source. Interestingly that should have been easy because one only needs to change the definitions of for instance \type {\newcount} but in practice that doesn't work because it will|/|can break for instance generic packages like Tikz. So, in the end a new allocator was added and just over 1000 lines in some 120 files (with some overlap) had to be adapted to this. In addition some precautions had to be made for access from \LUA\ because the quantities were no longer registers. But it was rewarding in the sense that the test suite now ran some 5\percent\ faster and processing the \LUAMETATEX\ manual went from 8.7 seconds on my laptop down to around 8.5, which is not bad. Now why do we bother so much about performance? If I really want a faster run using a decent desktop is of more help. But even then there can be reasons. When Mikael and I were discussing math engine developments at some point we noticed that a run took twice as much time as a result of (supposedly idle) background tasks. Now keep in mind that \TEX\ uses a single core so with plenty cores it should not be that bad. However, when the video chat program takes half of the CPU power, or when a mathematical manipulation program idles in the background taking 80 percent of a modern machine, or when a popular editor keeps all kind of plug ins busy for no reason, or when a supposedly closed a browser consumes gigabytes of memory and keeps dozens of supposedly idle threads busy, it becomes clear that we should not let \TEX\ put a large burden on memory access (and cache). It can get even worse when one runs on virtual machines where the host suggests that you get 16 cores so that you can run a dozen \TEX\ jobs in parallel but simple measurements show that these shared cores report a much higher ideal performance than the one you measure. So, the less demanding a \CONTEXT\ run becomes, the better: we're not so much after the .2 seconds on a 8 second run, but more after 3 seconds for that same run when using shared resources where it became 15 seconds. And this is what observations with respect to the performance of the test suite seem to indicate. In the end it's mostly about comfort: when you process a document of 300 pages, 10 seconds is quite okay for a few changes, because one can relate time to output, but 20 seconds \unknown\ And when processing a a few page document the waiting time of a second is often less than what one needs to move the mouse around to the viewer. Also, when a user starts \TEX\ on the console and afterwards opens a browser from there that second is even less noticeable. Now let's go back to improvements. A related addition was \type {\advanceby} that doesn't check for the \type {by} keyword. When there is no such keyword we can avoid pushing back the non|-|matching next token which is also noticeable. Here about 680 changes were needed. Changes like these only make a difference in performance for some very demanding mechanisms in \CONTEXT. Again one cannot overload an existing primitive because generic packages can fail (as the test suite proved). There were also a few places where a dirty trick had to be changed because we cannot alias these constants. We can give similar stories about other improvements but this one sort of stands out because it is so noticeable. Also, other changes involve more drastic low level adaptations of \CONTEXT\ so these happen over a longer period of time. Of course all has to happen in ways that don't impact users. An example of a performance primitive is \typ {\advancebyplusone} which is actually implemented but still disabled because the gain is in hundreds of seconds range and I need to (again) adapt the source in order to benefit. The mentioned register variants are implemented for count (integer), dimen (dimension), skip (gluespec) and muskip (mugluespec). Token registers are more complex as they have reference counters as well as more manipulator primitives. The same is true for boxes (although it is tempting to come up with some faster access mechanism) and attributes, that also have more diverse accessors. Also, token lists and boxes involve way more than a simple assignment or access so any gain will drown in other actions. That said, it really makes sense now to drop the maximum of 64K registers to some more reasonable 8K (or even less for mu skips). That will save a couple of megabytes which sounds like little but still puts less burden on the system. \stopchapter \stopcomponent