luametatex-pdf.tex /size: 19 Kb    last modification: 2023-12-21 09:43
1% language=us runpath=texruns:manuals/luametatex
2
3% lua.newtable
4
5\environment luametatex-style
6
7\startcomponent luametatex-pdf
8
9\startchapter[reference=pdf,title={The \PDF\ related libraries}]
10
11\startsection[title={The \type {pdfe} library}][library=pdfe]
12
13\startsubsection[title={Introduction}]
14
15\topicindex{\PDF+objects}
16
17\topicindex{\PDF+analyze}
18\topicindex{\PDF+\type{pdfe}}
19
20The \type {pdfe} library replaces the \type {epdf} library and provides an
21interface to \PDF\ files. It uses the same code as is used for \PDF\ image
22inclusion. The \type {pplib} library by Paweł Jackowski replaces the \type
23{poppler} (derived from \type {xpdf}) library.
24
25A \PDF\ file is basically a tree of objects and one descends into the tree via
26dictionaries (key/value) and arrays (index/value). There are a few topmost
27dictionaries that start at root that are accessed more directly.
28
29Although everything in \PDF\ is basically an object we only wrap a few in so
30called userdata \LUA\ objects.
31
32\starttabulate[|l|l|]
33\DB type          \BC mapping \NC \NR
34\TB
35\BC \PDF          \BC \LUA \NC \NR
36\NC null          \NC nil \NC \NR
37\NC boolean       \NC boolean \NC \NR
38\NC integer       \NC integer \NC \NR
39\NC float         \NC number \NC \NR
40\NC name          \NC string \NC \NR
41\NC string        \NC string \NC \NR
42\NC array         \NC array userdatum \NC \NR
43\NC dictionary    \NC dictionary userdatum \NC \NR
44\NC stream        \NC stream userdatum (with related dictionary) \NC \NR
45\NC reference     \NC reference userdatum \NC \NR
46\LL
47\stoptabulate
48
49The regular getters return these \LUA\ data types but one can also get more
50detailed information.
51
52\stopsubsection
53
54\startsubsection[title={\type {open}, \type {openfile}, \type {new}, \type {getstatus}, \type {close}, \type {unencrypt}}]
55
56\libindex {open}
57\libindex {new}
58\libindex {new}
59\libindex {getstatus}
60\libindex {close}
61\libindex {unencrypt}
62
63A document is loaded from a file (by name or handle) or string:
64
65\starttyping
66<pdfe document> = pdfe.open(filename)
67<pdfe document> = pdfe.openfile(filehandle)
68<pdfe document> = pdfe.new(somestring,somelength)
69\stoptyping
70
71Such a document is closed with:
72
73\starttyping
74pdfe.close(<pdfe document>)
75\stoptyping
76
77You can check if a document opened well by:
78
79\starttyping
80pdfe.getstatus(<pdfe document>)
81\stoptyping
82
83The returned codes are:
84
85\starttabulate[|c|l|]
86\DB value      \BC explanation \NC \NR
87\TB
88\NC \type {-2} \NC the document is (still) protected \NC \NR
89\NC \type {-1} \NC the document failed to open \NC \NR
90\NC \type  {0} \NC the document is not encrypted \NC \NR
91\NC \type  {1} \NC the document has been unencrypted \NC \NR
92\LL
93\stoptabulate
94
95An encrypted document can be unencrypted by the next command where instead of
96either password you can give \type {nil}:
97
98\starttyping
99pdfe.unencrypt(<pdfe document>,userpassword,ownerpassword)
100\stoptyping
101
102\stopsubsection
103
104\startsubsection[title={\type {getsize}, \type {getversion}, \type {getnofobjects}, \type {getnofpages}}]
105
106\libindex {getsize}
107\libindex {getversion}
108\libindex {getnofobjects}
109\libindex {getnofpages}
110
111A successfully opened document can provide some information:
112
113\starttyping
114bytes = getsize(<pdfe document>)
115major, minor = getversion(<pdfe document>)
116n = getnofobjects(<pdfe document>)
117n = getnofpages(<pdfe document>)
118bytes, waste = getnofpages(<pdfe document>)
119\stoptyping
120
121\stopsubsection
122
123\startsubsection[title={\type {get[catalog|trailer|info]}}]
124
125\libindex {getcatalog}
126\libindex {gettrailer}
127\libindex {getinfo}
128
129For accessing the document structure you start with the so called catalog, a
130dictionary:
131
132\starttyping
133<pdfe dictionary> = pdfe.getcatalog(<pdfe document>)
134\stoptyping
135
136The other two root dictionaries are accessed with:
137
138\starttyping
139<pdfe dictionary> = pdfe.gettrailer(<pdfe document>)
140<pdfe dictionary> = pdfe.getinfo(<pdfe document>)
141\stoptyping
142
143\stopsubsection
144
145\startsubsection[title={\type {getpage}, \type {getbox}}]
146
147\libindex {getpage}
148\libindex {getbox}
149
150A specific page can conveniently be reached with the next command, which
151returns a dictionary.
152
153\starttyping
154<pdfe dictionary> = pdfe.getpage(<pdfe document>,pagenumber)
155\stoptyping
156
157Another convenience command gives you the (bounding) box of a (normally page)
158which can be inherited from the document itself. An example of a valid box name
159is \type {MediaBox}.
160
161\starttyping
162pages = pdfe.getbox(<pdfe dictionary>,boxname)
163\stoptyping
164
165\stopsubsection
166
167\startsubsection[title={\type {get[string|integer|number|boolean|name]}}]
168
169\libindex {getstring}
170\libindex {getinteger}
171\libindex {getnumber}
172\libindex {getboolean}
173\libindex {getname}
174
175Common values in dictionaries and arrays are strings, integers, floats, booleans
176and names (which are also strings) and these are also normal \LUA\ objects:
177
178\starttyping
179s = getstring (<pdfe array|dictionary>,index|key)
180i = getinteger(<pdfe array|dictionary>,index|key)
181n = getnumber (<pdfe array|dictionary>,index|key)
182b = getboolean(<pdfe array|dictionary>,index|key)
183n = getname   (<pdfe array|dictionary>,index|key)
184\stoptyping
185
186The \type {getstring} function has two extra variants:
187
188\starttyping
189s, h = getstring (<pdfe array|dictionary>,index|key,false)
190s    = getstring (<pdfe array|dictionary>,index|key,true)
191\stoptyping
192
193The first call returns the original string plus a boolean indicating if the
194string is hex encoded. The second call returns the unencoded string.
195
196\stopsubsection
197
198\startsubsection[title={\type {get[dictionary|array|stream]}}]
199
200\libindex {getdictionary} \libindex {getfromdictionary}
201\libindex {getarray}      \libindex {getfromarray}
202\libindex {getstream}     \libindex {getfromstream}
203
204Normally you will use an index in an array and key in a dictionary but dictionaries
205also accept an index. The size of an array or dictionary is available with the
206usual \type {#} operator.
207
208\starttyping
209<pdfe dictionary>   = getdictionary(<pdfe array|dictionary>,index|key)
210<pdfe array>        = getarray     (<pdfe array|dictionary>,index|key)
211<pdfe stream>,
212<pdfe dictionary>   = getstream    (<pdfe array|dictionary>,index|key)
213\stoptyping
214
215These commands return dictionaries, arrays and streams, which are dictionaries
216with a blob of data attached.
217
218Before we come to an alternative access mode, we mention that the objects provide
219access in a different way too, for instance this is valid:
220
221\starttyping
222print(pdfe.open("foo.pdf").Catalog.Type)
223\stoptyping
224
225At the topmost level there are \type {Catalog}, \type {Info}, \type {Trailer}
226and \type {Pages}, so this is also okay:
227
228\starttyping
229print(pdfe.open("foo.pdf").Pages[1])
230\stoptyping
231
232\stopsubsection
233
234\startsubsection[title={\type {[open|close|readfrom|whole|]stream}}]
235
236\libindex {openstream}
237\libindex {closestream}
238\libindex {readfromstream}
239\libindex {readfromwholestream}
240
241Streams are sort of special. When your index or key hits a stream you get back a
242stream object and dictionary object. The dictionary you can access in the usual
243way and for the stream there are the following methods:
244
245\starttyping
246okay   = openstream(<pdfe stream>,[decode])
247         closestream(<pdfe stream>)
248str, n = readfromstream(<pdfe stream>)
249str, n = readwholestream(<pdfe stream>,[decode])
250\stoptyping
251
252You either read in chunks, or you ask for the whole. When reading in chunks, you
253need to open and close the stream yourself. The \type {n} value indicates the
254length read. The \type {decode} parameter controls if the stream data gets
255uncompressed.
256
257As with dictionaries, you can access fields in a stream dictionary in the usual
258\LUA\ way too. You get the content when you \quote {call} the stream. You can
259pass a boolean that indicates if the stream has to be decompressed.
260
261% pdfe.objectcodes      = objectcodes
262% pdfe.stringcodes      = stringcodes
263% pdfe.encryptioncodes  = encryptioncodes
264
265\stopsubsection
266
267\startsubsection[title={\type {getfrom[dictionary|array]}}]
268
269\libindex {getfromdictionary}
270\libindex {getfromarray}
271
272In addition to the interface described before, there is also a bit lower level
273interface available.
274
275\starttyping
276key, type, value, detail = getfromdictionary(<pdfe dictionary>,index)
277type, value, detail = getfromarray(<pdfe array>,index)
278\stoptyping
279
280\starttabulate[|c|l|l|l|]
281\DB type       \BC meaning    \BC value            \BC detail \NC \NR
282\NC \type {0}  \NC none       \NC nil              \NC \NC \NR
283\NC \type {1}  \NC null       \NC nil              \NC \NC \NR
284\NC \type {2}  \NC boolean    \NC boolean          \NC \NC \NR
285\NC \type {3}  \NC integer    \NC integer          \NC \NC \NR
286\NC \type {4}  \NC number     \NC float            \NC \NC \NR
287\NC \type {5}  \NC name       \NC string           \NC \NC \NR
288\NC \type {6}  \NC string     \NC string           \NC hex \NC \NR
289\NC \type {7}  \NC array      \NC arrayobject      \NC size \NC \NR
290\NC \type {8}  \NC dictionary \NC dictionaryobject \NC size \NC \NR
291\NC \type {9}  \NC stream     \NC streamobject     \NC dictionary size \NC \NR
292\NC \type {10} \NC reference  \NC integer          \NC \NC \NR
293\LL
294\stoptabulate
295
296A \type {hex} string is (in the \PDF\ file) surrounded by \type {<>} while plain
297strings are bounded by \type {<>}.
298
299\stopsubsection
300
301\startsubsection[title={\type {[dictionary|array]totable}}]
302
303\libindex {dictionarytotable}
304\libindex {arraytotable}
305
306All entries in a dictionary or table can be fetched with the following commands
307where the return values are a hashed or indexed table.
308
309\starttyping
310hash = dictionarytotable(<pdfe dictionary>)
311list = arraytotable(<pdfe array>)
312\stoptyping
313
314You can get a list of pages with:
315
316\starttyping
317{ { <pdfe dictionary>, size, objnum }, ... } = pagestotable(<pdfe document>)
318\stoptyping
319
320\stopsubsection
321
322\startsubsection[title={\type {getfromreference}}]
323
324\libindex {getfromreference}
325
326Because you can have unresolved references, a reference object can be resolved
327with:
328
329\starttyping
330type, <pdfe dictionary|array|stream>, detail = getfromreference(<pdfe reference>)
331\stoptyping
332
333So, as second value you get back a new \type {pdfe} userdata object that you can
334query.
335
336\stopsubsection
337
338\stopsection
339
340\startsection[title={Memory streams}][library=pdfe]
341
342\topicindex{\PDF+memory streams}
343
344\libindex {new}
345
346The \type {pdfe.new} function takes three arguments:
347
348\starttabulate
349\DB value           \BC explanation      \NC \NR
350\TB
351\NC \type {stream}  \NC this is a (in low level \LUA\ speak) light userdata
352                        object, i.e.\ a pointer to a sequence of bytes \NC \NR
353\NC \type {length}  \NC this is the length of the stream in bytes (the stream can
354                        have embedded zeros) \NC \NR
355\NC \type {name}    \NC optional, this is a unique identifier that is used for
356                        hashing the stream \NC \NR
357\LL
358\stoptabulate
359
360The third argument is optional. When it is not given the function will return a
361\type {pdfe} document object as with a regular file, otherwise it will return a
362filename that can be used elsewhere (e.g.\ in the image library) to reference the
363stream as pseudo file.
364
365Instead of a light userdata stream (which is actually fragile but handy when you
366come from a library) you can also pass a \LUA\ string, in which case the given
367length is (at most) the string length.
368
369The function returns a \type {pdfe} object and a string. The string can be used in
370the \type {img} library instead of a filename. You need to prevent garbage
371collection of the object when you use it as image (for instance by storing it
372somewhere).
373
374Both the memory stream and it's use in the image library is experimental and can
375change. In case you wonder where this can be used: when you use the swiglib
376library for \type {graphicmagick}, it can return such a userdata object. This
377permits conversion in memory and passing the result directly to the backend. This
378might save some runtime in one|-|pass workflows. This feature is currently not
379meant for production and we might come up with a better implementation.
380
381\stopsection
382
383\startsection[title={The \type {pdfscanner} library}][library=pdfscanner]
384
385This library is not available in \LUAMETATEX.
386
387\stopsection
388
389% \startsection[title={The \type {pdfscanner} library}][library=pdfscanner]
390%
391% \topicindex{\PDF+scanner}
392%
393% \libindex {scan}
394%
395% The \type {pdfscanner} library allows interpretation of \PDF\ content streams and
396% \type {/ToUnicode} (cmap) streams. You can get those streams from the \type
397% {pdfe} library, as explained in an earlier section. There is only a single
398% top|-|level function in this library:
399%
400% \startfunctioncall
401% pdfscanner.scan (<pdfe stream>, <table> operatortable, <table> info)
402% pdfscanner.scan (<pdfe array>, <table> operatortable, <table> info)
403% pdfscanner.scan (<string>, <table> operatortable, <table> info)
404% \stopfunctioncall
405%
406% The first argument should be a \LUA\ string or a stream or array onject coming
407% from the \type {pdfe} library. The second argument, \type {operatortable}, should
408% be a \LUA\ table where the keys are \PDF\ operator name strings and the values
409% are \LUA\ functions (defined by you) that are used to process those operators.
410% The functions are called whenever the scanner finds one of these \PDF\ operators
411% in the content stream(s). The functions are called with two arguments: the \type
412% {scanner} object itself, and the \type {info} table that was passed are the third
413% argument to \type {pdfscanner.scan}.
414%
415% Internally, \type {pdfscanner.scan} loops over the \PDF\ operators in the
416% stream(s), collecting operands on an internal stack until it finds a \PDF\
417% operator. If that \PDF\ operator's name exists in \type {operatortable}, then the
418% associated function is executed. After the function has run (or when there is no
419% function to execute) the internal operand stack is cleared in preparation for the
420% next operator, and processing continues.
421%
422% The \type {scanner} argument to the processing functions is needed because it
423% offers various methods to get the actual operands from the internal operand
424% stack.
425%
426% A simple example of processing a \PDF's document stream could look like this:
427%
428% \starttyping
429% local operatortable = { }
430%
431% operatortable.Do = function(scanner,info)
432%     local resources = info.resources
433%     if resources then
434%         local val     = scanner:pop()
435%         local name    = val[2]
436%         local xobject = resources.XObject
437%         print(info.space .. "Uses XObject " .. name)
438%         local resources = xobject.Resources
439%         if resources then
440%             local newinfo =  {
441%                 space     = info.space .. "  ",
442%                 resources = resources,
443%             }
444%             pdfscanner.scan(entry, operatortable, newinfo)
445%         end
446%     end
447% end
448%
449% local function Analyze(filename)
450%     local doc = pdfe.open(filename)
451%     if doc then
452%         local pages = doc.Pages
453%         for i=1,#pages do
454%             local page = pages[i]
455%             local info = {
456%               space     = "  " ,
457%               resources = page.Resources,
458%             }
459%             print("Page " .. i)
460%          -- pdfscanner.scan(page.Contents,operatortable,info)
461%             pdfscanner.scan(page.Contents(),operatortable,info)
462%         end
463%     end
464% end
465%
466% Analyze("foo.pdf")
467% \stoptyping
468%
469% This example iterates over all the actual content in the \PDF, and prints out the
470% found \type {XObject} names. While the code demonstrates quite some of the \type
471% {pdfe} functions, let's focus on the type \type {pdfscanner} specific code
472% instead.
473%
474% From the bottom up, the following line runs the scanner with the \PDF\ page's
475% top|-|level content given in the first argument.
476%
477% The third argument, \type {info}, contains two entries: \type {space} is used to
478% indent the printed output, and \type {resources} is needed so that embedded \type
479% {XForms} can find their own content.
480%
481% The second argument, \type {operatortable} defines a processing function for a
482% single \PDF\ operator, \type {Do}.
483%
484% The function \type {Do} prints the name of the current \type {XObject}, and then
485% starts a new scanner for that object's content stream, under the condition that
486% the \type {XObject} is in fact a \type {/Form}. That nested scanner is called
487% with new \type {info} argument with an updated \type {space} value so that the
488% indentation of the output nicely nests, and with a new \type {resources} field
489% to help the next iteration down to properly process any other, embedded \type
490% {XObject}s.
491%
492% Of course, this is not a very useful example in practice, but for the purpose of
493% demonstrating \type {pdfscanner}, it is just long enough. It makes use of only
494% one \type {scanner} method: \type {scanner:pop()}. That function pops the top
495% operand of the internal stack, and returns a \LUA\ table where the object at index
496% one is a string representing the type of the operand, and object two is its
497% value.
498%
499% The list of possible operand types and associated \LUA\ value types is:
500%
501% \starttabulate[|l|l|]
502% \DB types           \BC type      \NC \NR
503% \TB
504% \NC \type{integer}  \NC <number>  \NC \NR
505% \NC \type{real}     \NC <number>  \NC \NR
506% \NC \type{boolean}  \NC <boolean> \NC \NR
507% \NC \type{name}     \NC <string>  \NC \NR
508% \NC \type{operator} \NC <string>  \NC \NR
509% \NC \type{string}   \NC <string>  \NC \NR
510% \NC \type{array}    \NC <table>   \NC \NR
511% \NC \type{dict}     \NC <table>   \NC \NR
512% \LL
513% \stoptabulate
514%
515% In case of \type {integer} or \type {real}, the value is always a \LUA\ (floating
516% point) number. In case of \type {name}, the leading slash is always stripped.
517%
518% In case of \type {string}, please bear in mind that \PDF\ actually supports
519% different types of strings (with different encodings) in different parts of the
520% \PDF\ document, so you may need to reencode some of the results; \type {pdfscanner}
521% always outputs the byte stream without reencoding anything. \type {pdfscanner}
522% does not differentiate between literal strings and hexadecimal strings (the
523% hexadecimal values are decoded), and it treats the stream data for inline images
524% as a string that is the single operand for \type {EI}.
525%
526% In case of \type {array}, the table content is a list of \type {pop} return
527% values and in case of \type {dict}, the table keys are \PDF\ name strings and the
528% values are \type {pop} return values.
529%
530% \libindex{pop}
531% \libindex{popnumber}
532% \libindex{popname}
533% \libindex{popstring}
534% \libindex{poparray}
535% \libindex{popdictionary}
536% \libindex{popboolean}
537% \libindex{done}
538%
539% There are a few more methods defined that you can ask \type {scanner}:
540%
541% \starttabulate[|l|p|]
542% \DB method               \BC explanation \NC \NR
543% \TB
544% \NC \type{pop}           \NC see above \NC \NR
545% \NC \type{popnumber}     \NC return only the value of a \type {real} or \type {integer} \NC \NR
546% \NC \type{popname}       \NC return only the value of a \type {name} \NC \NR
547% \NC \type{popstring}     \NC return only the value of a \type {string} \NC \NR
548% \NC \type{poparray}      \NC return only the value of a \type {array} \NC \NR
549% \NC \type{popdictionary} \NC return only the value of a \type {dict} \NC \NR
550% \NC \type{popboolean}    \NC return only the value of a \type {boolean} \NC \NR
551% \NC \type{done}          \NC abort further processing of this \type {scan()} call \NC \NR
552% \LL
553% \stoptabulate
554%
555% The \type {pop*} are convenience functions, and come in handy when you know the
556% type of the operands beforehand (which you usually do, in \PDF). For example, the
557% \type {Do} function could have used \type {local name = scanner:popname()}
558% instead, because the single operand to the \type {Do} operator is always a \PDF\
559% name object.
560%
561% The \type {done} function allows you to abort processing of a stream once you
562% have learned everything you want to learn. This comes in handy while parsing
563% \type {/ToUnicode}, because there usually is trailing garbage that you are not
564% interested in. Without \type {done}, processing only ends at the end of the
565% stream, possibly wasting \CPU\ cycles.
566%
567% {\em We keep the older names \type {popNumber}, \type {popName}, \type
568% {popString}, \type {popArray}, \type {popDict} and \type {popBool} around.}
569%
570% \stopsection
571
572\stopchapter
573
574\stopcomponent
575