lowlevel-buffers.tex /size: 20 Kb    last modification: 2024-01-16 10:21
1% language=us runpath=texruns:manuals/lowlevel
2
3\environment lowlevel-style
4
5\startdocument
6  [title=buffers,
7   color=middlegreen]
8
9\startsectionlevel[title=Preamble]
10
11Buffers are not that low level but it makes sense to discuss them in this
12perspective because it relates to tokenization, internal representation and
13manipulating.
14
15{\em In due time we can describe some more commands and details here. This
16is a start. Feel free to tell me what needs to be explained.}
17
18\stopsectionlevel
19
20\startsectionlevel[title=Encoding]
21
22Normally processing a document starts with reading from file. In the past we were
23talking single bytes that were then maps onto a specific input encoding that
24itself matches the encoding of a font. When you enter an \quote {a} its (normally
25\ASCII) number 97 becomes the index into a font. That same number is also used in
26the hyphenator which is why font encoding and hyphenation are strongly related.
27If in an eight bit \TEX\ engine you need a precomposed \quote {รค} you have to use
28an encoding that has that character in some slot with again matching fonts and
29patterns. The actually used font can have the {\em shapes} in different slots and
30remapping is then done in the backend code using encoding and mapping files. When
31\OPENTYPE\ fonts are used the relationship between characters (input) and glyphs
32(rendering) also depends on the application of font features.
33
34In eight bit environments all this brings a bit of a resource management
35nightmare along with complex installation of new fonts. It also puts strain on
36the macro package, especially when you want to mix different input encodings onto
37different font encodings and thereby pattern encodings in the same document. You
38can compare this with code pages in operating system, but imagine them
39potentially being mixed in one document, which can happen when you mix multiple
40languages where the accumulated number of different characters exceeds 256. You
41end up switching between encodings. One way to deal with it is making special
42characters active and let their meaning differ per situation. That is for
43instance how in \MKII\ we handled \UTF8\ and thereby got around distributing
44multiple pattern files per language as we only needed to encoding them in \UTF\
45and then remap them to the required encoding when loading patterns. A mental
46exercise is wondering how to support \CJK\ scripts in an eight bit \MKII,
47something that actually can be done with some effort.
48
49The good news is that when we moved from \MKII\ to \MKIV\ we went exclusively
50\UTF8\ because that is what the \LUATEX\ engine expects. Upto four bytes are read
51in and translated into one \UNICODE\ character. The internal representation is a
5232 bit integer (four bytes) instead of a single byte. That also means that in the
53transition we got rid of quite some encoding related low level font and pattern
54handling. We still support input encodings (called regimes in \CONTEXT) but I'm
55pretty sure that nowadays no one uses input other than \UTF8. While \CONTEXT\ is
56normally quite upward compatible this is one area where there were fundamental
57changes.
58
59There is still some interpretation going on when reading from file: for instance,
60we need to normalize the \UNICODE\ input, and we feed the engine separate lines
61on demand. Apart from that, some characters like the backslash, dollar sign and
62curly braces have special meaning so for accessing them as characters we have to
63use commands that inject those characters. That didn't change when we went from
64\MKII\ to \MKIV. In practice it's never really a problem unless you find yourself
65in one of the following situations:
66
67\startitemize
68\startitem
69    {\em Example code has to be typeset as|-|is, so braces etc.\ are just that.}
70    This means that we have to change the way characters are interpreted.
71    Typesetting code is needed when you want to document \TEX\ and macros which
72    is why mechanisms for that have to be present right from the start.
73\stopitem
74\startitem
75    {\em Content is collected and used later.} A separation of content and usage
76    later on often helps making a source look cleaner. Examples are \quotation
77    {wrapping a table in a buffer} and \quotation {including that buffer when a
78    table is placed} using the placement macros.
79\stopitem
80\startitem
81    {\em Embedded \METAPOST\ and \LUA\ code.} These languages come with different
82    interpretation of some characters and especially \METAPOST\ code is often
83    stored first and used (processed) later.
84\stopitem
85\startitem
86    {\em The content comes from a different source.} Examples are \XML\ files
87    where angle brackets are special but for instance braces aren't. The data is
88    interpreted as a stream or as a structured tree.
89\stopitem
90\startitem
91    {\em The content is generated.} It can for instance come from \LUA, where
92    bytes (representing \UTF) is just text and no special characters are to be
93    intercepted. Or it can come from a database (using a library).
94\stopitem
95\stopitemize
96
97For these reasons \CONTEXT\ always had ways to store data in ways that makes this
98possible. The details on how that is done might have changed over versions, been
99optimized, extended with additional interfaces and features but given where we
100come from most has been there from the start.
101
102\stopsectionlevel
103
104\startsectionlevel[title=Performance]
105
106When \TEX\ came around, the bottlenecks in running \TEX\ were the processor,
107memory and disks and depending on the way one used it the speed of the console or
108terminal; so, basically the whole system. One could sit there and wait for the
109page counters (\typ {[1] [2] ..} to show up. It was possible to run \TEX\ on a
110personal computer but it was somewhat resource hungry: one needed a decent disk
111(a 10 MB hard disk was huge and with todays phone camera snapshots that sounds
112crazy). One could use memory extenders to get around the 640K limitation (keep in
113mind that the programs and operating systems also took space). This all meant
114that one could not afford to store too many tokens in memory but even using files
115for all kind of (multi|-|pass) trickery was demanding.
116
117When processors became faster and memory plenty the disk became the bottleneck,
118but that changed when \SSD's showed up. Combined with already present file
119caching that had some impact. We are now in a situation that \CPU\ cores don't
120get that much faster (at least not twice as fast per iteration) and with \TEX\
121being a single core byte cruncher we're more or less in a situation where
122performance has to come from efficient programming. That means that, given enough
123memory, in some cases storing in tokens wins over storing in files, but it is no
124rule. In practice there is not much difference so one can even more than
125yesterday choose for the most convenient method. Just assume that the \CONTEXT\
126code, combined with \LUAMETATEX\ will give you what you need with a reasonable
127performance. When in doubt, test with simple test files and it that works out
128well compared to the real code, try to figure out where \quote {mistakes} are
129made. Inefficient \LUA\ and \TEX\ code has way more impact than storing a few
130more tokens or using some files.
131
132\stopsectionlevel
133
134\startsectionlevel[title=Files]
135
136Nearly always files are read once per run. The content (mixed with commands) is
137scanned and macros are expanded and|/|or text is typeset as we go. Internally the
138\LUAMETATEX\ engine is in \quotation {scanning from file}, \quotation {scanning
139from token lists}, or \quotation {scanning from \LUA\ output} mode. The first
140mode is (in principle) the slowest because \UTF\ sequences are converted to
141tokens (numbers) but there is no way around it. The second method is fast because
142we already have these numbers, but we need to take into account where the linked
143list of tokens comes from. If it is converted runtime from for instance file
144input or macro expansion we need to add the involved overhead. But scanning a
145stored macro body is pretty efficient especially when the macro is part of the
146loaded macro package (format file). The third method is comparable with reading
147from file but here we need to add the overhead involved with storing the \LUA\
148output into data structures suitable for \TEX's input mechanism, which can
149involve memory allocation outside the reserved pool of tokens. On modern systems
150that is not really a problem. It is good to keep in mind that when \TEX\ was
151written much attention was paid to optimization and in \LUAMETATEX\ we even went
152a bit further, also because we know what kind of input, processing and output
153we're dealing with.
154
155When reading from file or \LUA\ output we interpret bytes turned \UTF\ numbers
156and that is when catcode regimes kick in: characters are interpreted according to
157the catcode properties: escape character (backslash), curly braces (grouping and
158arguments), dollars (math), etc.\ While with reading from token lists these
159catcodes are already taken care of and we're basically interpreting meanings
160instead of characters. By changing the catcode regime we can for instance typeset
161content verbatim from files and \LUA\ strings but when reading from token lists
162we're sort of frozen. There are tricks to reinterpret the token list but that
163comes with overhead and limitations.
164
165\stopsectionlevel
166
167\startsectionlevel[title=Macros]
168
169A macro can be seen as a named token with a meaning attached. In \LUAMETATEX\
170macros can take up to 15 arguments (six more than regular \TEX) that can be
171separated by so called delimiters. A token has a command property (operator) and
172a value (operand). Because a \UNICODE\ character doesn't need all four bytes of
173an integer and because in the engine numbers, dimensions and pointers are limited
174in size we can store all of these efficiently with the command code. Here the
175body of \type {\foo} is a list of three tokens:
176
177\starttyping
178\def\foo{abc} \foo \foo \foo
179\stoptyping
180
181When the engine fetches a token from a list it will interpret the command and
182when it fetches from file it will create tokens on the fly and then interpret
183those. When a file or list is exhausted the engine pops the stack and continues
184at the previous level. Because macros are already tokenized they are more
185efficient than file input. For more about macros you can consult the low level
186document about them.
187
188The more you use a macro, the more it pays off compared to a file. However don't
189overestimate this, because in the end the typesetting and expanding all kind of
190other involved macros might reduce the file overhead to noise.
191
192\stopsectionlevel
193
194\startsectionlevel[title=Token lists]
195
196A token list is like a macro but is part of the variable (register) system. It
197is just a list (so no arguments) and you can append and prepend to that list.
198
199\starttyping
200\toks123={abc}    \the\toks123
201\scratchtoks{abc} \the\scratchtoks
202\stoptyping
203
204Here \type {\scratchtoks} is defined with \type {\newtoks} which creates an
205efficient reference to a list so that, contrary to the first line, no register
206number has to be scanned. There are low level manuals about tokens and registers
207that you can read if you want to know more about this. As with macros the list in
208this example is three tokens long. Contrary to macros there is no macro overhead
209as there is no need to check for arguments. \footnote {In \LUAMETATEX\ a macro
210without arguments is also quite efficient.}
211
212Because they use more or less the same storage method macros and token list
213registers perform the same. The power of registers comes from some additional
214manipulators in \LUATEX\ (and \LUAMETATEX) and the fact that one can control
215expansion with \type {\the}, although that later advantage is compensated with
216extensions to the macro language (like \type {\protected} macro definitions).
217
218\stopsectionlevel
219
220\startsectionlevel[title=Buffers]
221
222Buffers are something specific for \CONTEXT\ and they have always been part of
223this system. A buffer is defined as follows:
224
225\startbuffer
226\startbuffer[one]
227line 1
228line 2
229\stopbuffer
230\stopbuffer
231
232\typebuffer
233
234Among the operations on buffers the next two are used most often:
235
236\starttyping
237\typebuffer[one]
238\getbuffer[one]
239\stoptyping
240
241Scanning a buffer at the \TEX\ end takes a little effort because when we start
242reading the catcodes are ignored and for instance backslashes and curly braces
243are retained. Hardly any interpretation takes place. The same is true for
244spacing, so multiple spaces are not collapsed and newlines stay. The tokenized
245content of a buffer is converted back to a string and that content is then read
246in as a pseudo file when we need it. So, basically buffers are files! In \MKII\
247they actually were files (in the \type {\jobname} name space and suffix \type
248{tmp}), but in \MKIV\ they are stored in and managed by \LUA. That also means
249that you can set them very efficiently at the \LUA\ end:
250
251\starttyping
252\startluacode
253buffers.assign("one",[[
254line 1
255line 2
256]])
257\stopluacode
258\stoptyping
259
260Always keep in mind that buffers eventually are read as files: character by
261character, and at that time the content gets (as with other files) tokenized. A
262buffer name is optional. You can nest buffers, with and without names.
263
264Because \CONTEXT\ is very much about re-use of content and selective processing
265we have an (already old) subsystem for defining named blocks of text (using \type
266{\begin...} and \type {\end...} tagging. These blocks are stored just like
267buffers but selective flushing is part of the concept. Think of coding an
268educational document with explanations, questions, answers and then typesetting
269only the explanations, or the explanation along width some questions. Other
270components can be typeset later so one can make for instance a special book(let)
271with answers that either of not repeats the questions. Here we need features like
272synchronization of numbers so that's why we cannot really use buffers. An
273alternative is to use \XML\ and filter from that.
274
275The \typ {\definebuffer} command defines a new buffer environment. When you set
276buffers in \LUA\ you don't need to define a buffer because likely you don't need
277the \type {\start} and \type {\stop} commands. Instead of \typ {\getbuffer} you
278can also use \typ {\getdefinedbuffer} with defined buffers. In that case the
279\type {before} and \type {after} keys of that specific instance are used.
280
281The \typ {\getinlinebuffer} command, which like the getters takes a list of
282buffer names, ignores leading and trailing spaces. When multiple buffers are
283flushed this way, spacing between buffers is retained.
284
285The most important aspect of buffers is that the content is {\em not} interpreted
286and tokenized: the bytes stay as they are.
287
288\startbuffer
289\definebuffer[MyBuffer]
290
291\startMyBuffer
292\bold{this is
293a buffer}
294\stopMyBuffer
295
296\typeMyBuffer \getMyBuffer
297\stopbuffer
298
299\typebuffer
300
301These commands result in:
302
303\getbuffer
304
305There are not that many parameters that can be set: \type {before}, \type {after}
306and \type {strip} (when set to \type {no} leading and trailing spacing will be
307kept. The \type {\stop...} command, in our example \typ {\stopMyBuffer}, can be
308defined independent to so something after the buffer has be read and stored but
309by default nothing is done.
310
311You can test if a buffer exists with \typ {\doifelsebuffer} (expandable) and \typ
312{\doifelsebufferempty} (unexpandable). A buffer is kept in memory unless it gets
313wiped clean with \typ {resetbuffer}.
314
315\starttyping
316\savebuffer      [MyBuffer][temp]     % gets name: jobname-temp.tmp
317\savebufferinfile[MyBuffer][temp.log] % gets name: temp.log
318\stoptyping
319
320You can also stepwise fill such a buffer:
321
322\starttyping
323\definesavebuffer[slide]
324
325\startslide
326    \starttext
327\stopslide
328\startslide
329    slide 1
330\stopslide
331text 1 \par
332\startslide
333    slide 2
334\stopslide
335text 2 \par
336\startslide
337    \stoptext
338\stopslide
339\stoptyping
340
341After this you will have a file \type {\jobname-slide.tex} that has the two lines
342wrapped as text. You can set up a \quote {save buffer} to use a different
343filename (with the \type {file} key), a different prefix using \type {prefix} and
344you can set up a \type {directory}. A different name is set with the \type {list}
345key.
346
347You can assign content to a buffer with a somewhat clumsy interface where we use
348the delimiter \type {\endbuffer}. The only restriction is that this delimiter
349cannot be part of the content:
350
351\starttyping
352\setbuffer[name]here comes some text\endbuffer
353\stoptyping
354
355For more details and obscure commands that are used in other commands
356you can peek into the source.
357
358% These are somewhat obscure:
359%
360% \getbufferdata{...}
361% \grabbufferdatadirect % name start stop
362% \grabbufferdata % was: \dostartbuffer
363% \thebuffernumber
364% \thedefinedbuffer
365
366Using buffers in the \CLD\  interface is tricky because of the catcode magick that is
367involved but there are setters and getters:
368
369\starttabulate[|T|T|]
370\BC function               \BC arguments \NC \NR
371\ML
372\NC buffers.assign         \NC name, content [,catcodes] \NC \NR
373%NC buffers.raw            \NC \NC \NR
374\NC buffers.erase          \NC name \NC \NR
375\NC buffers.prepend        \NC name, content \NC \NR
376\NC buffers.append         \NC name, content \NC \NR
377\NC buffers.exists         \NC name \NC \NR
378\NC buffers.empty          \NC name \NC \NR
379\NC buffers.getcontent     \NC name \NC \NR
380\NC buffers.getlines       \NC name \NC \NR
381%NC buffers.collectcontent \NC \NC \NR
382%NC buffers.loadcontent    \NC \NC \NR
383%NC buffers.get            \NC \NC \NR
384%NC buffers.getmkiv        \NC \NC \NR
385%NC buffers.gettexbuffer   \NC \NC \NR
386%NC buffers.run            \NC \NC \NR
387\stoptabulate
388
389There are a few more helpers that are used in other (low level) commands. Their
390functionality might adapt to their usage there. The \typ {context.startbuffer}
391and \typ {context.stopbuffer} are somewhat differently defined than regular
392\CLD\ commands.
393
394\stopsectionlevel
395
396\startsectionlevel[title=Setups]
397
398A setup is basically a macro but is stored and accessed in a namespace separated
399from ordinary macros. One important characteristic is that inside setups newlines
400are ignored.
401
402\startbuffer
403\startsetups MySetupA
404    This is line 1
405    and this is line 2
406\stopsetups
407
408\setup{MySetupA}
409\stopbuffer
410
411\typebuffer {\bf \getbuffer}
412
413A simple way out is to add a comment character preceded by a space. Instead you
414can also use \type {\space}:
415
416\startbuffer
417\startsetups [MySetupB]
418    This is line 1 %
419    and this is line 2\space
420    while here we have line 3
421\stopsetups
422
423\setup[MySetupB]
424\stopbuffer
425
426\typebuffer {\bf \getbuffer}
427
428You can use square brackets instead of space delimited names in definitions and
429also in calling up a (list of) setup(s). The \type {\directsetup} command takes a
430single setup name and is therefore more efficient.
431
432Setups are basically simple macros although there is some magic involved that
433comes from their usage in for instance \XML\ where we pass an argument. That
434means we can do the following:
435
436\startbuffer
437\startsetups MySetupC
438    before#1after
439\stopsetups
440
441\setupwithargument{MySetupC}{ {\em and} }
442\stopbuffer
443
444\typebuffer {\bf \getbuffer}
445
446Because a setup is a macro, the body is a linked list of tokens where each token
447takes 8 bytes of memory, so \type {MySetupC} has 12 tokens that take 96 bytes of
448memory (plus some overhead related to macro management).
449
450\stopsectionlevel
451
452\startsectionlevel[title=\XML]
453
454Discussing \XML\ is outside the scope of this document but it is worth mentioning
455that once an \XML\ tree is read is, the content is stored in strings and can be
456filtered into \TEX, where it is interpreted as if coming from files (in this case
457\LUA\ strings). If needed the content can be interpreted as \TEX\ input.
458
459\stopsectionlevel
460
461\startsectionlevel[title=\LUA]
462
463As mentioned already, output from \LUA\ is stored and when a \LUA\ call finishes
464it ends up on the so called input stack. Every time the engine needs a token it
465will fetch from the input stack and the top of the stack can represent a file,
466token list or \LUA\ output. Interpreting bytes from files or \LUA\ strings
467results in tokens. As a side note: \LUA\ output can also be already tokenized,
468because we can actually write tokens and nodes from \LUA, but that's more an
469implementation detail that makes the \LUA\ input stack entries a bit more
470complex. It is normally not something users will do when they use \LUA\ in their
471documents.
472
473\stopsectionlevel
474
475\startsectionlevel[title=Protection]
476
477When you define macros there is the danger of overloading some defined by the
478system. Best use CamelCase so that you stay away from clashes. You can enable
479some checking:
480
481\starttyping
482\enabledirectives[overloadmode=warning]
483\stoptyping
484
485or when you want to quit on a clash:
486
487\starttyping
488\enabledirectives[overloadmode=error]
489\stoptyping
490
491When these trackers are enabled you can get around the check with:
492
493\starttyping
494\pushoverloadmode
495  ...
496\popoverloadmode
497\stoptyping
498
499But delay that till you're sure that redefining is okay.
500
501\stopsectionlevel
502
503% efficiency
504
505\stopdocument
506
507