workflows-hashed.tex /size: 5487 b    last modification: 2021-10-28 13:50
1% language=us runpath=texruns:manuals/workflows
2
3% Musical timestamp: Welcome 2 America by Prince, with a pretty good lineup,
4% August 2021.
5
6\environment workflows-style
7
8\startcomponent workflows-hashed
9
10\startchapter[title=Hashed files]
11
12In a (basically free content) project we had to deal with tens of thousands of
13files. Most are in \XML\ format, but there are also thousands of \PNG, \JPG\ and
14\SVG\ images. In a large project like this, which covers a large part of Dutch
15school math, images can be shared. All the content is available for schools as
16\HTML\ but can also be turned into printable form and because schools want to
17have stable content over specified periods one has to make a regular snapshot of
18this corpus. Also, distributing a few gigabytes of data is not much fun.
19
20So, in order to bring the amount down a dedicated mechanism for handling files
21has been introduced. After playing with a \SQLITE\ database we finally settled on
22just \LUA, simply because it was faster and it also makes the solution
23independent.
24
25The process comes down to creating a file database once in a while, loading a
26relatively small hash mapping at runtime and accessing files from a large
27data file on demand. Optionally files can be compressed, which makes sense for
28the textual files.
29
30A database is created with one of the \CONTEXT\ extras, for instance:
31
32\starttyping
33context --extra=hashed --database=m4 --pattern=m4all/**.xml --compress
34context --extra=hashed --database=m4 --pattern=m4all/**.svg --compress
35context --extra=hashed --database=m4 --pattern=m4all/**.jpg
36context --extra=hashed --database=m4 --pattern=m4all/**.png
37\stoptyping
38
39The database uses two files: a small \type {m4.lua} file (some 11 megabytes) and
40a large \type {m4.dat} (about 820 megabytes, coming from 1850 megabytes
41originals). Alternatively you can use a specification, say \type {m4all.lua}:
42
43\starttyping
44return {
45    { pattern  = "m4all/**.xml$", compress = true  },
46    { pattern  = "m4all/**.svg$", compress = true  },
47    { pattern  = "m4all/**.jpg$", compress = false },
48    { pattern  = "m4all/**.png$", compress = false },
49}
50\stoptyping
51
52\starttyping
53context --extra=hashed --database=m4 --patterns=m4all.lua
54\stoptyping
55
56You should see something like this on the console:
57
58\starttyping
59hashed > database 'hasheddata', 1627 paths, 46141 names,
60    36935 unique blobs, 29674 compressed blobs
61\stoptyping
62
63So here we share some ten thousand files (all images). In case you wonder why we
64keep the duplicates: they have unique names (copies) so that when a section is
65updated there is no interference with other sections. The tree structure is
66mostly six deep (sometimes there is an additional level).
67
68% \startluacode
69%     if not resolvers.finders.helpers.validhashed("hasheddata") then
70%         resolvers.finders.helpers.createhashed {
71%             database = "hasheddata",
72%             pattern  = "m4all/**.jpg$",
73%             compress = false,
74%         }
75%         resolvers.finders.helpers.createhashed {
76%             database = "hasheddata",
77%             pattern  = "m4all/**.png$",
78%             compress = false,
79%         }
80%         resolvers.finders.helpers.createhashed {
81%             database = "hasheddata",
82%             pattern  = "m4all/**.xml$",
83%             compress = true,
84%         }
85%     end
86% \stopluacode
87
88% \startluacode
89%     if not resolvers.finders.helpers.validhashed("hasheddata") then
90%         resolvers.finders.helpers.createhashed {
91%             database = "hasheddata",
92%             patterns = {
93%                 { pattern  = "m4all/**.jpg$", compress = false },
94%                 { pattern  = "m4all/**.png$", compress = false },
95%                 { pattern  = "m4all/**.svg$", compress = true  },
96%                 { pattern  = "m4all/**.xml$", compress = true  },
97%             },
98%         }
99%     end
100% \stopluacode
101
102Accessing files is the same as with files on the system, but one has to register
103a database first:
104
105\starttyping
106\registerhashedfiles[m4]
107\stoptyping
108
109A fully qualified specifier looks like this (not too different from other
110specifiers):
111
112\starttyping
113\externalfigure
114  [hashed:///m4all/books/chapters/h3/h3-if1/images/casino.jpg]
115\externalfigure
116  [hashed:///m4all/books/chapters/ha/ha-c4/images/ha-c44-ex2-s1.png]
117\stoptyping
118
119but nicer would be :
120
121\starttyping
122\externalfigure
123  [m4all/books/chapters/h3/h3-if1/images/casino.jpg]
124\externalfigure
125  [m4all/books/chapters/ha/ha-c4/images/ha-c44-ex2-s1.png]
126\stoptyping
127
128This is possible when we also specify:
129
130\starttyping
131\registerfilescheme[hashed]
132\stoptyping
133
134This makes the given scheme based resolver kick in first, while the normal
135file lookup is used as last resort.
136
137This mechanism is written on top of the infrastructure that has been part of
138\CONTEXT\ \MKIV\ right from the start but this particular feature is only
139available in \LMTX\ (backporting is likely a waste of time).
140
141Just for the record: this mechanism is kept simple, so the database has no update
142and replace features. One can just generate a new one. You can test for a valid database
143and act upon the outcome:
144
145\starttyping
146\doifelsevalidhashedfiles {m4} {
147    \writestatus{hashed}{using hashed data}
148    \registerhashedfiles[m4]
149    \registerfilescheme[hashed]
150} {
151    \writestatus{hashed}{no hashed data}
152}
153\stoptyping
154
155Future versions might introduce filename normalization (lowercase, cleanup) so
156consider this as a first step. First we need test it for a while.
157
158\stopchapter
159
160\stopcomponent
161