publications-database.tex /size: 22 Kb    last modification: 2020-07-01 14:35
1\environment publications-style
2
3\startcomponent publications-database
4
5\startchapter[title=The database]
6
7The bibliography subsystem uses a database (or a set of databases) to construct a
8list of citations to be used in a scholarly work. However, it will be shown later
9that the database system can be used (and abused) to many ends having little or
10nothing at all to do with citations and bibliographies. Nevertheless, at first we
11shall remain focused on the use of bibliography databases.
12
13The data to be used must have a source and a structure. In the next sections we
14describe the possible input.
15
16\startsection[title=\BibTeX]
17
18The \BIBTEX\ format is rather popular in the \TEX\ community and even with its
19shortcomings it will stay around for a while. Many publication websites can
20export and many tools are available to work with this database format. It is
21rather simple and looks a bit like \index [LUA table] {\LUA\ table}\LUA\ tables.
22Indeed, it is said that the \BIBTEX\ format was one of the inspirations for the
23constructor syntax in \LUA\ \cite [alternative=num,
24righttext={\btxcomma Chapter\nbsp 12.}] [default::Ierusalimschy2006].
25
26Unfortunately the content can be (and usually is) polluted with
27non|-|standardized \TEX\ commands which complicates pre- or post|-|processing
28outside \TEX. In that sense a \BIBTEX\ database is often not coded neutrally.
29Some limitations, like the use of commands to encode accented characters root in
30the \ASCII\ world and can be bypassed by using \index [UTF] {\UTF}\UTF\ instead
31(as handled somewhat in \LATEX\ through extensions such as \Tindex {bibtex8}).
32
33The normal way to deal with a bibliography is to refer to entries using a unique
34\Index {tag} or key. When a text containing a list of entries is typeset, this
35reference can be used for linking purposes. The list can be processed and sorted
36using the \Tindex {bibtex} program that converts the database into something more
37\TEX\ friendly (a \Tindex {.bbl} file).
38
39In \CONTEXT\ we no longer use the (external) \goto {\Tindex {bibtex} program}
40[url(https://www.ctan.org/pkg/bibtex)] at all: we simply parse the database files
41in \LUA\ and deal with the necessary manipulations directly in \CONTEXT. One or
42more such databases can be used and combined with additional entries defined
43within the document. We can have several such datasets active at the same time.
44
45\startaside
46\emphasis {On the name \Tindex {btx}:} many of the \CONTEXT\ commands that will be
47used in the following contain the label \TEXcode {btx} in their name. This
48identifier was retained despite the fact that \CONTEXT\ \MKIV\ is now completely
49independent of \BIBTEX; it reflects the role still played by \BIBTEX\ data as a
50preferred source format and serves as a handy, unique identifier, both internally
51in the programming as well as for the user. This three|-|letter label is
52systematically used in commands that otherwise attempt to avoid cryptic|-|styled
53names.
54\stopaside
55
56A \BIBTEX\ file entry looks like this:
57
58\startBTX
59@Article {sometag,
60    author  = "An Author and Another One",
61    title   = "A hopefully meaningful title",
62    journal = maps,
63    volume  = "25",
64    number  = "2",
65    pages   = "5--9",
66    month   = mar,
67    year    = "2013",
68    ISSN    = "1234-5678",
69}
70\stopBTX
71
72Entries are of the form: \index {category}\BTXcode {@category{...}}
73
74Anything outside of a valid \BTXcode {@category{...}} construction is ignored and
75is taken to be a comment. Within an entry, there are to be no comments but one
76can prefix field names, for example, to have them ignored.
77
78There is a special entry type named \index {@comment}\BTXcode {@comment{...}}.
79The main use of such an entry type is to comment a large part of the bibliography
80easily, since anything outside an entry is already a comment, and commenting out
81one entry may be achieved by just removing its initial~\BTXcode {@}.  The \index
82{@comment}\BTXcode {@comment{...}} entry is perhaps of some use, although this is
83not very elegant! As one can input multiple bibliography data files, as will be
84seen below, it is much better practice to split datafiles for optional loading.
85
86Many \BIBTEX\ data management tools such as \Tindex {jabref} (see below) will
87ignore and then throw|-|away all such handily|-|crafted comments and data entries
88turned into comments. So one must beware!
89
90The field names are all cast to lowercase so capitalization is irrelevant;
91Spacing is not important and should be used advantageously for readability. The
92leading \Index {tag} (\BTXcode {sometag} in the example above) cannot contain
93spaces and \emphasis {must} be followed by a comma.
94
95The entry \Index {tag} (\BTXcode {@category{sometag,...}}) is not to be confused
96with the optional field \BTXcode {key=sortkey,} that may also be present.
97
98Normally a value is given between quotes (or curly brackets) but single words are
99also valid (as there is no real benefit in not using quotes or curly brackets, we
100advise to always use them, contrary to our example above). The order of the
101fields in an entry is inconsequential and there can be many more fields than
102those shown above. Instead of string values one can also use predefined
103shortcuts. The title for example might quite often contain \TEX\ macros, and some
104fields, like \BTXcode {pages} have funny characters such as the endash (typically
105entered as \BTXcode {--}) so we have a mixture of data and typesetting
106directives. Furthermore, if you are covering non||English references, you often
107need characters that are not in the \ASCII\ subset. Note that \CONTEXT\ is quite
108happy with \UTF, but if your database file uses old|-|fashioned \TEX\ accent
109combinations then these will be internally converted automatically to \UTF.
110
111Commands (macros) found in a database file are converted to an indirect call,
112which is quite robust. The use of commands in the database file will be described
113in \in {section} [sec:Commands].
114
115The \Tindex {author} (and \Tindex {editor}) fields are parsed separating multiple
116authors identified by the conjunction \quote {and}. Each name is assumed to be in
117the form:
118
119\definetyping
120  [NameSyntax]
121  [margin=1em]
122
123\startNameSyntax
124Firstname(s) Lastname
125\stopNameSyntax
126
127\seeindex {vons} {particule}
128
129where \type {Lastname} is a single word but may include an optional (nobility)
130\Index {particule}: lower|-|case word(s) such as \quotation {von}, \quotation
131{de}, \quotation {de la}, etc.) \emphasis {unless} specifically in the two- or
132three|-|token form:
133
134\index {suffix}
135
136\startNameSyntax
137Lastname(s), Firstname(s)
138Lastnames(s), Suffix(es), Firstname(s)
139\stopNameSyntax
140
141separated explicitly using comma(s) thus allowing multi|-|word \type {Lastnames}.
142
143\startaside
144An \BTXcode {author} field is sometimes abused in traditional \BIBTEX\ usage to
145hold not a name but rather an entity. Other fields, such as \BTXcode
146{organization} or \BTXcode {collaboration}, for example, should be used in such
147cases.
148\stopaside
149
150\BIBTEX\ also (obscurely) supports the syntax:
151
152\seeindex {juniors}{suffix}
153\index {suffix}
154
155\startNameSyntax
156Firstname(s) \{Lastname(s), Suffix(es)\}
157\stopNameSyntax
158
159we may (or may not) support this in the future, so don't use this!
160
161We extend \BIBTEX\ by optionally parsing each name in terms of four or five
162tokens:
163
164\index {particule} \index {suffix} \index {initial}
165
166\startNameSyntax
167Particule(s), Lastname(s), Suffix(es), Firstname(s)
168Particule(s), Lastname(s), Suffix(es), Firstname(s), Initial(s)
169\stopNameSyntax
170
171in order to allow a free form for the particules, irrespective of capitalization,
172thus avoiding the need to resort to any sort of \TEX\ trickery \cite [num]
173[default::Patashnik1988,Markey2009]. In fact, an optional sixth token is parsed
174whose meaning is presently reserved for future directives describing how the name
175is to be interpreted:
176
177\index {particule} \index {suffix} \index {initial}
178
179\startNameSyntax
180Particule(s), Lastname(s), Suffix(es), Firstname(s), Initial(s), directives
181\stopNameSyntax
182
183\BIBTEX\ additionally accepts the special token \Tindex {others} to be used
184(sparingly) to indicate an incomplete author list. Note that most style
185specifications will handle the truncation of long author lists in a systematic
186fashion. The \index [others] {\tt and others}\BTXcode {and others} construction
187finds its use when the complete author list is not well known or ill|-|defined.
188
189Sometimes, or even often, the database might contain variants of an author's
190name that we would like to identify as a single, unique author. Indeed, certain
191bibliographic styles (as will be seen later) as well as an index of authors, for
192example, will depend on this identification. A command \Cindex {btxremapauthor}
193allows establishing this identity:
194
195\startbuffer
196\btxremapauthor [Donald Knuth] [Donald E. Knuth]
197\btxremapauthor [Don Knuth]    [Donald E. Knuth]
198\stopbuffer
199\getbuffer
200
201\cindex {btxremapauthor}
202\typebuffer [option=TEX]
203
204Fields other than \Tindex {author} and \Tindex {editor}, for example \Tindex
205{artist} or \Tindex {director} if one desires, can be declared to be of type
206\quote {author} and thus interpreted as names, but this is a subject for
207specialists.
208
209The \BTXcode {keywords} field can also be split into tokens separated by
210semicolons (keyword; keyword; \unknown). This can be useful, as will be seen
211later, in the creation of keyword indexes, for example.
212
213Other string values such as \BTXcode {title} are kept literally (except for an
214internal automatic conversion to \UTF\ of certain \TEX\ strings such as accent
215combinations, endash, quotations, etc.). Note that the bibliography rendering
216style (see below) might specify a capitalization of the title (using the
217\CONTEXT\ commands \TEXcode {\Word} or \TEXcode {\Words}, for example).
218Capitalized Names and acronyms are respected removing a need for the \BIBTEX\
219practice of \quote {protecting} such words or letters with surrounding curly
220brackets (which here are simply stripped off). (Furthermore, since \CONTEXT\ uses
221\UTF, it does not suffer from all of the complicated \Index {sorting} issues that
222plague \BIBTEX|/|\LATEX.) As some styles might not specify the capitalization of
223words in the title whereas other styles might, it is recommended that strings be
224written in lower case except where upper case is explicitly required so as to be
225compatible with all such capitalization styles.
226
227\startaside
228Some bibliographic database sources can be quite sloppy and return strings
229(titles and even authors) in all capitals, for example. We have made the design
230choice \emphasis {not} to follow the \BIBTEX\ practice/feature of explicitly
231formatting all string values, as we did not want to require the protection
232through enclosing curly brackets that would have been a necessary consequence.
233Thus, some cleaning of these database files might be needed. Furthermore, we
234attempt to use all the power of \CONTEXT\ and \LUA, thus making unnecessary much
235(most?) of the \TEX-like encoding of the data. We encourage users to clean|-|up
236their \Tindex {.bib} database files as much as possible so that they contain only
237the necessary data, with a minimum of explicit formatting directives.
238\stopaside
239
240String values, as described above, can be enclosed indifferently between matching
241curly brackets: \BTXcode {{}} or pairs of quotation marks: \BTXcode {""}.
242Multiple string values can be \index {string concatenation}concatenated using the
243operator \BTXcode {\#}, as will be illustrated in \in {table}
244[tab:mkiv-publications.bib].
245
246Everything outside of a valid entry is ignored and treated as a \Index {comment}.
247Syntactic errors (such as a missing comma or some unbalanced quotes or
248parenthesis) are also skipped over, i.e. ignored. This is to attempt to continue
249on to valid data but may lead to unexpected results. It is therefore the user's
250responsibility to insure the correctness of the data files. Whereas some checks
251and warnings are issued, the system is purposefully not too verbose.
252
253Data is handled on a \quote {first come, first served} basis: duplicate \index
254{duplicate+fields}\emphasis {fields} in an entry are ignored \startfootnote Note
255that some \BIBTEX\ practice allows for the concatenation of duplicate name \index
256{duplicate+fields}fields (i.e. \BTXcode {author} and \BTXcode {editor}) through
257\BTXcode {and}, but (silently) ignores duplicate other fields. We choose to have
258a consistant behavior and disallow duplicate field occurrences. \stopfootnote
259though duplicate \index {duplicate+entries}\emphasis {entries} (having the same
260\index {duplicate+tags}tag) are retained, but the subsequent identical \Index
261{tag}s will be modified by adding a suffix $-n$ for the $n$\high {th} duplicate.
262The presence of duplicate \index {duplicate+fields}fields or \index
263{duplicate+tags}tags will be flagged as such with warnings in the log file.
264Duplicate \index {duplicate+entries}entries using different \Index {tag}s will
265not be treated as duplicates.
266
267A special provision has been made to declare author \Index {synonyms}, that is
268names that might occur with a variation of spellings or aliases. This shall be
269discussed later.
270
271We have attempted to remain compatible with the \BIBTEX\ format, and any new
272bibliography extensions that we introduce here were designed in a way to remain
273compatible with \BIBTEX, being simply ignored rather than potentially generating
274a \BIBTEX\ error.
275
276The \BIBTEX\ files are loaded in memory as \LUA\ table but can be converted to
277\XML\ so that we can access them in a more flexible way, but that is another
278subject for specialists.
279
280\stopsection
281
282\startsection [reference=sec:Commands,title=Commands in entries]
283
284One unfortunate aspect commonly found in \BIBTEX\ files is that they may contain
285\TEX\ commands. Even worse is that there is no standard on what these commands
286can be and what they mean, at least not formally, as \BIBTEX\ is a program
287intended to be used with many variants of \TEX\ style: plain, \LATEX, and others.
288This means that we need to define our use of these typesetting commands. (In
289particular, one might need to redefine those that are too \LATEX|-|centric.)
290However, in most cases, they are just abbreviations or font switches and these
291are often well known. Therefore, \CONTEXT\ will try to resolve them before
292reporting an issue. The log file will announce the commands that have been seen
293in the loaded databases. For instance, loading \Tindex {tugboat.bib} (distributed
294with \TEXLIVE) gives a long list of commands of which we show a small set of the
295five most frequently encountered ones here:
296
297\startbuffer
298\definebtxdataset[tugboat]
299\usebtxdataset[tugboat][tugboat.bib]
300\stopbuffer
301
302\getbuffer
303
304\starttyping
305publications > tugboat  tt     134 known
306publications > tugboat  Dash   136 unknown
307publications > tugboat  acro   137 known
308publications > tugboat  LaTeX  209 known
309publications > tugboat  TeX    856 known
310\stoptyping
311
312Some are flagged as known and others as unknown. You can define unknown commands,
313or overload existing definitions in the standard way (\emphasis {e.g.} \TEXcode
314{\def\Dash{}}), the \CONTEXT\ way (\TEXcode {\define\Dash{}}) or,
315alternatively, in the following way:
316
317\cindex {definebtxcommand}
318
319\startTEX
320\definebtxcommand\TUB {TUGboat}
321\definebtxcommand\MP  {METAPOST}
322\definebtxcommand\sltt{\tt}
323\definebtxcommand\<#1>{\type{#1}}
324\stopTEX
325
326\definebtxcommand\MP  {METAPOST} % to be used silently below
327
328Custom commands created using \Cindex {definebtxcommand} have the advantage of
329using a separate name space thus allowing \Index {isolation} from other \CONTEXT\
330commands. (The \Index {isolation} of \Cindex {btxcommand} allows the \Tindex
331{.bib} files to safely contain \TEX\ and \LATEX\ idiosyncrasies that might
332conflict with proper \CONTEXT\ syntax.) Unknown commands do not stall processing,
333but their names are then typeset in a mono|-|spaced font so they probably stand
334out for proofreading. You can access the commands using \index
335{btxcommand}\TEXcode {\btxcommand{...}} (or \Cindex {btxcmd}), as in:
336
337\startbuffer
338commands like \btxcommand{MySpecialCommand} are handled in an indirect way
339\stopbuffer
340
341\cindex {btxcommand}
342
343\typeTEXbuffer
344
345As this is an undefined command we get: \quotation {\inlinebuffer}.
346
347Often, these embedded \TEX\ commands are present in \Tindex {.bib} files in order
348to trick \BIBTEX\ into certain behavior. Since this will generally not be
349necessary here, we strongly encourage users to clean|-|up such unnecessary
350extras. Indeed, the idea is to keep the data clean, using styles and parameter
351settings instead to handle rendering issues. Indeed, we don't see it as challenge
352nor as a duty to support all kinds of messy definitions. Of course, we try to be
353somewhat tolerant, but you will be sure to get better results if you use nicely
354setup, consistent databases.
355
356Finally, the \BIBTEX\ entry \tindex {@string}\BTXcode {@String{}} is preprocessed
357as expected.
358
359\tindex {@string}
360
361\startTEX
362@String{j-TUGboat = "TUGboat"}
363\stopTEX
364
365\startaside
366Notice that \Tindex {tugboat.bib} also contains: \tindex {@preamble}
367\startBTX
368@Preamble{"\input tugboat.def"}
369@Preamble{"\input path.sty"}
370\stopBTX
371These are silently ignored as many such commands are most likely not to be
372compatible with \CONTEXT. Indeed, the examples shown here are not!
373\stopaside
374
375\stopsection
376
377\startsection[title=\MKII\ definitions]
378
379In the old \MKII\ setup we have two kinds of entries: the ones that come from the
380\BIBTEX\ run and additional user|-|supplied ones. We no longer rely on \BIBTEX\
381output but we do still support the user supplied definitions. These were in fact
382prepared in a way that suits the processing of the \BIBTEX\ generated entries;
383The next variant reflects the \CONTEXT\ recoding of the old \BIBTEX\ output. For
384this reason, some users refer to this as \Tindex {.bbl} format.
385
386\cindex {startpublication}
387\cindex {stoppublication}
388
389\startTEX
390\startpublication[k=Hagen:Second,t=article,a={Hans Hagen},y=2013,s=HH01]
391    \artauthor[] {Hans}[H.]{}{Hagen}
392    \arttitle {Who knows more?}
393    \journal  {MyJournal}
394    \pubyear  {2013}
395    \month    {8}
396    \volume   {1}
397    \issue    {3}
398    \issn     {1234-5678}
399    \pages    {123--126}
400\stoppublication
401\stopTEX
402
403The split \TEXcode {\artauthor} fields will be collapsed into a single \TEXcode
404{author} field as we handle the splitting later when it gets parsed in \LUA. The
405\TEXcode {\artauthor} syntax is only kept around for backward compatibility with
406the previous use of \BIBTEX.
407
408In the new setup we support these variants:
409
410\cindex {startpublication}
411\cindex {stoppublication}
412
413\startTEX
414\startpublication[k=Hagen:Third,t=article]
415    \author{Hans Hagen}
416    \title {Who knows who?}
417    ...
418\stoppublication
419\stopTEX
420
421as well as
422
423\cindex {startpublication}
424\cindex {stoppublication}
425
426\startTEX
427\startpublication[tag=Hagen:Third,category=article]
428    \author{Hans Hagen}
429    \title {Who knows who?}
430    ...
431\stoppublication
432\stopTEX
433
434and
435
436\cindex {startpublication}
437\cindex {stoppublication}
438
439\startTEX
440\startpublication
441    \tag     {Hagen:Third}
442    \category{article}
443    \author  {Hans Hagen}
444    \title   {Who knows who?}
445    ...
446\stoppublication
447\stopTEX
448
449The use of this format will be illustrated later a means to export the database
450which may be of great use in converting collections of \MKII\ bibliography files.
451
452\showsetup[startpublication]
453
454\stopsection
455
456\startsection[title=\LUA\ tables]
457
458Because internally the entries are \index [LUA table] {\LUA\ table}\LUA\ tables,
459we also support the loading of \LUA\ based definitions:
460
461\startLUA
462return {
463    ["Hagen:First"] = {
464        author   = "Hans Hagen",
465        category = "article",
466        issn     = "1234-5678",
467        issue    = "3",
468        journal  = "MyJournal",
469        month    = "8",
470        pages    = "123--126",
471        tag      = "Hagen:First",
472        title    = "Who knows nothing?",
473        volume   = "1",
474        year     = "2013",
475    },
476}
477\stopLUA
478
479Notice that the \Index {tag} is redundantly specified; it is \quote {pushed} into
480the table so that one can access it without having to know the \Index {tag} of the
481original table.
482
483\stopsection
484
485\startsection[title=\XML]
486
487The following \index [XML] {\XML}\XML\ input is rather close in structure, and is
488also accepted as input.
489
490\startXML
491<?xml version="2.0" standalone="yes" ?>
492<bibtex>
493    <entry tag="Hagen:First" category="article">
494        <field name="author">Hans Hagen</field>
495        <field name="category">article</field>
496        <field name="issn">1234-5678</field>
497        <field name="issue">3</field>
498        <field name="journal">MyJournal</field>
499        <field name="month">8</field>
500        <field name="pages">123--126</field>
501        <field name="tag">Hagen:First</field>
502        <field name="title">Who knows nothing?</field>
503        <field name="volume">1</field>
504        <field name="year">2013</field>
505    </entry>
506</bibtex>
507\stopXML
508
509We shall focus on the use of \BIBTEX\ \Tindex {.bib} files as the input data
510format of reference. Keep in mind, however, that the \index [LUA table] {\LUA\
511table}\LUA\ table format and the \index [XML] {\XML}\XML\ format might prove to
512be more flexible for future expansion of functionality.
513
514\stopsection
515
516\startsection[title=Other formats]
517
518Various other bibliographic data file formats are in common use, such as:
519
520\starttabulate [|Tl|p|]
521\NC savedrecs.txt     \NC Institute of Scientific Information (ISI) tagged format
522                          (e.g. Thomson Reuters™ Web of Science™), \NC \NR
523\NC filename.enw      \NC Thomson Reuters™ Endnote™ export format
524                          (there is also an Endnote \type {.xml} export), \NC \NR
525\NC filename.ris      \NC Research Information Systems, Incorporated, now
526                          Thomson Reuters™ Reference Manager™, and \NC \NR
527\NC pubmed_result.txt \NC The National Library of Medicine® (NLM®)
528                          MEDLINE®|/|PubMed® data format \NC \NR
529\stoptabulate
530
531just to name a few (amongst many more). Filters can be easily written in \LUA\ to
532read these and other bibliography data formats, although no such filters are
533provided. This is because the user has a choice of a certain number of
534bibliography database management programs that can easily convert from these to
535the \BIBTEX\ format. (Notable, open source examples are \index {jabref} \goto
536{jabref} [url(http://jabref.sourceforge.net)] and \index {zotero} \goto {zotero}
537[url(http://www.zotero.org)].) Indeed, it is not the vocation of the present
538\CONTEXT\ bibliography subsystem to fully manage the bibliography data sources,
539only to be able to use such data in the production of documents.
540
541\startaside
542\emphasis {A note on database management programs:} these are very valuable tools
543for the manipulation of bibliography database information, which is why the
544\BIBTEX\ format has so much importance for us here. However, one must be aware
545that these programs are not standards and many of them may introduce invalid
546extensions that might not even be handled correctly by \BIBTEX\ itself.
547\stopaside
548
549\stopsection
550
551\stopchapter
552
553\stopcomponent
554