1
2
3
4
5\environment luametatexstyle
6
7\startcomponent luametatexpdf
8
9\startchapter[reference=pdf,title={The \PDF\ related libraries}]
10
11\startsection[title={The \type {pdfe} library}][library=pdfe]
12
13\startsubsection[title={Introduction}]
14
15\topicindex{\PDFobjects}
16
17\topicindex{\PDFanalyze}
18\topicindex{\PDF\type{pdfe}}
19
20The \type {pdfe} library replaces the \type {epdf} library and provides an
21interface to \PDF\ files. It uses the same code as is used for \PDF\ image
22inclusion. The \type {pplib} library by Paweł Jackowski replaces the \type
23{poppler} (derived from \type {xpdf}) library.
24
25A \PDF\ file is basically a tree of objects and one descends into the tree via
26dictionaries (keyvalue) and arrays (indexvalue). There are a few topmost
27dictionaries that start at root that are accessed more directly.
28
29Although everything in \PDF\ is basically an object we only wrap a few in so
30called userdata \LUA\ objects.
31
32\starttabulate[ll]
33\DB type \BC mapping \NC \NR
34\TB
35\BC \PDF \BC \LUA \NC \NR
36\NC null \NC nil \NC \NR
37\NC boolean \NC boolean \NC \NR
38\NC integer \NC integer \NC \NR
39\NC float \NC number \NC \NR
40\NC name \NC string \NC \NR
41\NC string \NC string \NC \NR
42\NC array \NC array userdatum \NC \NR
43\NC dictionary \NC dictionary userdatum \NC \NR
44\NC stream \NC stream userdatum (with related dictionary) \NC \NR
45\NC reference \NC reference userdatum \NC \NR
46\LL
47\stoptabulate
48
49The regular getters return these \LUA\ data types but one can also get more
50detailed information.
51
52\stopsubsection
53
54\startsubsection[title={\type {open}, \type {openfile}, \type {new}, \type {getstatus}, \type {close}, \type {unencrypt}}]
55
56\libindex {open}
57\libindex {new}
58\libindex {new}
59\libindex {getstatus}
60\libindex {close}
61\libindex {unencrypt}
62
63A document is loaded from a file (by name or handle) or string:
64
65\starttyping
66<pdfe document> = pdfe.open(filename)
67<pdfe document> = pdfe.openfile(filehandle)
68<pdfe document> = pdfe.new(somestring,somelength)
69\stoptyping
70
71Such a document is closed with:
72
73\starttyping
74pdfe.close(<pdfe document>)
75\stoptyping
76
77You can check if a document opened well by:
78
79\starttyping
80pdfe.getstatus(<pdfe document>)
81\stoptyping
82
83The returned codes are:
84
85\starttabulate[cl]
86\DB value \BC explanation \NC \NR
87\TB
88\NC \type {2} \NC the document is (still) protected \NC \NR
89\NC \type {1} \NC the document failed to open \NC \NR
90\NC \type {0} \NC the document is not encrypted \NC \NR
91\NC \type {1} \NC the document has been unencrypted \NC \NR
92\LL
93\stoptabulate
94
95An encrypted document can be unencrypted by the next command where instead of
96either password you can give \type {nil}:
97
98\starttyping
99pdfe.unencrypt(<pdfe document>,userpassword,ownerpassword)
100\stoptyping
101
102\stopsubsection
103
104\startsubsection[title={\type {getsize}, \type {getversion}, \type {getnofobjects}, \type {getnofpages}}]
105
106\libindex {getsize}
107\libindex {getversion}
108\libindex {getnofobjects}
109\libindex {getnofpages}
110
111A successfully opened document can provide some information:
112
113\starttyping
114bytes = getsize(<pdfe document>)
115major, minor = getversion(<pdfe document>)
116n = getnofobjects(<pdfe document>)
117n = getnofpages(<pdfe document>)
118bytes, waste = getnofpages(<pdfe document>)
119\stoptyping
120
121\stopsubsection
122
123\startsubsection[title={\type {get[catalogtrailerinfo]}}]
124
125\libindex {getcatalog}
126\libindex {gettrailer}
127\libindex {getinfo}
128
129For accessing the document structure you start with the so called catalog, a
130dictionary:
131
132\starttyping
133<pdfe dictionary> = pdfe.getcatalog(<pdfe document>)
134\stoptyping
135
136The other two root dictionaries are accessed with:
137
138\starttyping
139<pdfe dictionary> = pdfe.gettrailer(<pdfe document>)
140<pdfe dictionary> = pdfe.getinfo(<pdfe document>)
141\stoptyping
142
143\stopsubsection
144
145\startsubsection[title={\type {getpage}, \type {getbox}}]
146
147\libindex {getpage}
148\libindex {getbox}
149
150A specific page can conveniently be reached with the next command, which
151returns a dictionary.
152
153\starttyping
154<pdfe dictionary> = pdfe.getpage(<pdfe document>,pagenumber)
155\stoptyping
156
157Another convenience command gives you the (bounding) box of a (normally page)
158which can be inherited from the document itself. An example of a valid box name
159is \type {MediaBox}.
160
161\starttyping
162pages = pdfe.getbox(<pdfe dictionary>,boxname)
163\stoptyping
164
165\stopsubsection
166
167\startsubsection[title={\type {get[stringintegernumberbooleanname]}}]
168
169\libindex {getstring}
170\libindex {getinteger}
171\libindex {getnumber}
172\libindex {getboolean}
173\libindex {getname}
174
175Common values in dictionaries and arrays are strings, integers, floats, booleans
176and names (which are also strings) and these are also normal \LUA\ objects:
177
178\starttyping
179s = getstring (<pdfe arraydictionary>,indexkey)
180i = getinteger(<pdfe arraydictionary>,indexkey)
181n = getnumber (<pdfe arraydictionary>,indexkey)
182b = getboolean(<pdfe arraydictionary>,indexkey)
183n = getname (<pdfe arraydictionary>,indexkey)
184\stoptyping
185
186The \type {getstring} function has two extra variants:
187
188\starttyping
189s, h = getstring (<pdfe arraydictionary>,indexkey,false)
190s = getstring (<pdfe arraydictionary>,indexkey,true)
191\stoptyping
192
193The first call returns the original string plus a boolean indicating if the
194string is hex encoded. The second call returns the unencoded string.
195
196\stopsubsection
197
198\startsubsection[title={\type {get[dictionaryarraystream]}}]
199
200\libindex {getdictionary} \libindex {getfromdictionary}
201\libindex {getarray} \libindex {getfromarray}
202\libindex {getstream} \libindex {getfromstream}
203
204Normally you will use an index in an array and key in a dictionary but dictionaries
205also accept an index. The size of an array or dictionary is available with the
206usual \type {#} operator.
207
208\starttyping
209<pdfe dictionary> = getdictionary(<pdfe arraydictionary>,indexkey)
210<pdfe array> = getarray (<pdfe arraydictionary>,indexkey)
211<pdfe stream>,
212<pdfe dictionary> = getstream (<pdfe arraydictionary>,indexkey)
213\stoptyping
214
215These commands return dictionaries, arrays and streams, which are dictionaries
216with a blob of data attached.
217
218Before we come to an alternative access mode, we mention that the objects provide
219access in a different way too, for instance this is valid:
220
221\starttyping
222print(pdfe.open("foo.pdf").Catalog.Type)
223\stoptyping
224
225At the topmost level there are \type {Catalog}, \type {Info}, \type {Trailer}
226and \type {Pages}, so this is also okay:
227
228\starttyping
229print(pdfe.open("foo.pdf").Pages[1])
230\stoptyping
231
232\stopsubsection
233
234\startsubsection[title={\type {[openclosereadfromwhole]stream}}]
235
236\libindex {openstream}
237\libindex {closestream}
238\libindex {readfromstream}
239\libindex {readfromwholestream}
240
241Streams are sort of special. When your index or key hits a stream you get back a
242stream object and dictionary object. The dictionary you can access in the usual
243way and for the stream there are the following methods:
244
245\starttyping
246okay = openstream(<pdfe stream>,[decode])
247 closestream(<pdfe stream>)
248str, n = readfromstream(<pdfe stream>)
249str, n = readwholestream(<pdfe stream>,[decode])
250\stoptyping
251
252You either read in chunks, or you ask for the whole. When reading in chunks, you
253need to open and close the stream yourself. The \type {n} value indicates the
254length read. The \type {decode} parameter controls if the stream data gets
255uncompressed.
256
257As with dictionaries, you can access fields in a stream dictionary in the usual
258\LUA\ way too. You get the content when you \quote {call} the stream. You can
259pass a boolean that indicates if the stream has to be decompressed.
260
261
262
263
264
265\stopsubsection
266
267\startsubsection[title={\type {getfrom[dictionaryarray]}}]
268
269\libindex {getfromdictionary}
270\libindex {getfromarray}
271
272In addition to the interface described before, there is also a bit lower level
273interface available.
274
275\starttyping
276key, type, value, detail = getfromdictionary(<pdfe dictionary>,index)
277type, value, detail = getfromarray(<pdfe array>,index)
278\stoptyping
279
280\starttabulate[clll]
281\DB type \BC meaning \BC value \BC detail \NC \NR
282\NC \type {0} \NC none \NC nil \NC \NC \NR
283\NC \type {1} \NC null \NC nil \NC \NC \NR
284\NC \type {2} \NC boolean \NC boolean \NC \NC \NR
285\NC \type {3} \NC integer \NC integer \NC \NC \NR
286\NC \type {4} \NC number \NC float \NC \NC \NR
287\NC \type {5} \NC name \NC string \NC \NC \NR
288\NC \type {6} \NC string \NC string \NC hex \NC \NR
289\NC \type {7} \NC array \NC arrayobject \NC size \NC \NR
290\NC \type {8} \NC dictionary \NC dictionaryobject \NC size \NC \NR
291\NC \type {9} \NC stream \NC streamobject \NC dictionary size \NC \NR
292\NC \type {10} \NC reference \NC integer \NC \NC \NR
293\LL
294\stoptabulate
295
296A \type {hex} string is (in the \PDF\ file) surrounded by \type {<>} while plain
297strings are bounded by \type {<>}.
298
299\stopsubsection
300
301\startsubsection[title={\type {[dictionaryarray]totable}}]
302
303\libindex {dictionarytotable}
304\libindex {arraytotable}
305
306All entries in a dictionary or table can be fetched with the following commands
307where the return values are a hashed or indexed table.
308
309\starttyping
310hash = dictionarytotable(<pdfe dictionary>)
311list = arraytotable(<pdfe array>)
312\stoptyping
313
314You can get a list of pages with:
315
316\starttyping
317{ { <pdfe dictionary>, size, objnum }, ... } = pagestotable(<pdfe document>)
318\stoptyping
319
320\stopsubsection
321
322\startsubsection[title={\type {getfromreference}}]
323
324\libindex {getfromreference}
325
326Because you can have unresolved references, a reference object can be resolved
327with:
328
329\starttyping
330type, <pdfe dictionaryarraystream>, detail = getfromreference(<pdfe reference>)
331\stoptyping
332
333So, as second value you get back a new \type {pdfe} userdata object that you can
334query.
335
336\stopsubsection
337
338\stopsection
339
340\startsection[title={Memory streams}][library=pdfe]
341
342\topicindex{\PDFmemory streams}
343
344\libindex {new}
345
346The \type {pdfe.new} function takes three arguments:
347
348\starttabulate
349\DB value \BC explanation \NC \NR
350\TB
351\NC \type {stream} \NC this is a (in low level \LUA\ speak) light userdata
352 object, i.e.\ a pointer to a sequence of bytes \NC \NR
353\NC \type {length} \NC this is the length of the stream in bytes (the stream can
354 have embedded zeros) \NC \NR
355\NC \type {name} \NC optional, this is a unique identifier that is used for
356 hashing the stream \NC \NR
357\LL
358\stoptabulate
359
360The third argument is optional. When it is not given the function will return a
361\type {pdfe} document object as with a regular file, otherwise it will return a
362filename that can be used elsewhere (e.g.\ in the image library) to reference the
363stream as pseudo file.
364
365Instead of a light userdata stream (which is actually fragile but handy when you
366come from a library) you can also pass a \LUA\ string, in which case the given
367length is (at most) the string length.
368
369The function returns a \type {pdfe} object and a string. The string can be used in
370the \type {img} library instead of a filename. You need to prevent garbage
371collection of the object when you use it as image (for instance by storing it
372somewhere).
373
374Both the memory stream and its use in the image library is experimental and can
375change. In case you wonder where this can be used: when you use the swiglib
376library for \type {graphicmagick}, it can return such a userdata object. This
377permits conversion in memory and passing the result directly to the backend. This
378might save some runtime in onepass workflows. This feature is currently not
379meant for production and we might come up with a better implementation.
380
381\stopsection
382
383\startsection[title={The \type {pdfscanner} library}][library=pdfscanner]
384
385This library is not available in \LUAMETATEX.
386
387\stopsection
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572\stopchapter
573
574\stopcomponent
575 |