\environment publications-style

\startcomponent publications-database

\startchapter[title=The database]

The bibliography subsystem uses a database (or a set of databases) to construct a
list of citations to be used in a scholarly work. However, it will be shown later
that the database system can be used (and abused) to many ends having little or
nothing at all to do with citations and bibliographies. Nevertheless, at first we
shall remain focused on the use of bibliography databases.

The data to be used must have a source and a structure. In the next sections we
describe the possible input.

\startsection[title=\BibTeX]

The \BIBTEX\ format is rather popular in the \TEX\ community and even with its
shortcomings it will stay around for a while. Many publication websites can
export and many tools are available to work with this database format. It is
rather simple and looks a bit like \index [LUA table] {\LUA\ table}\LUA\ tables.
Indeed, it is said that the \BIBTEX\ format was one of the inspirations for the
constructor syntax in \LUA\ \cite [alternative=num,
righttext={\btxcomma Chapter\nbsp 12.}] [default::Ierusalimschy2006].

Unfortunately the content can be (and usually is) polluted with
non|-|standardized \TEX\ commands which complicates pre- or post|-|processing
outside \TEX. In that sense a \BIBTEX\ database is often not coded neutrally.
Some limitations, like the use of commands to encode accented characters root in
the \ASCII\ world and can be bypassed by using \index [UTF] {\UTF}\UTF\ instead
(as handled somewhat in \LATEX\ through extensions such as \Tindex {bibtex8}).

The normal way to deal with a bibliography is to refer to entries using a unique
\Index {tag} or key. When a text containing a list of entries is typeset, this
reference can be used for linking purposes. The list can be processed and sorted
using the \Tindex {bibtex} program that converts the database into something more
\TEX\ friendly (a \Tindex {.bbl} file).

In \CONTEXT\ we no longer use the (external) \goto {\Tindex {bibtex} program}
[url(https://www.ctan.org/pkg/bibtex)] at all: we simply parse the database files
in \LUA\ and deal with the necessary manipulations directly in \CONTEXT. One or
more such databases can be used and combined with additional entries defined
within the document. We can have several such datasets active at the same time.

\startaside
\emphasis {On the name \Tindex {btx}:} many of the \CONTEXT\ commands that will be
used in the following contain the label \TEXcode {btx} in their name. This
identifier was retained despite the fact that \CONTEXT\ \MKIV\ is now completely
independent of \BIBTEX; it reflects the role still played by \BIBTEX\ data as a
preferred source format and serves as a handy, unique identifier, both internally
in the programming as well as for the user. This three|-|letter label is
systematically used in commands that otherwise attempt to avoid cryptic|-|styled
names.
\stopaside

A \BIBTEX\ file entry looks like this:

\startBTX
@Article {sometag,
    author  = "An Author and Another One",
    title   = "A hopefully meaningful title",
    journal = maps,
    volume  = "25",
    number  = "2",
    pages   = "5--9",
    month   = mar,
    year    = "2013",
    ISSN    = "1234-5678",
}
\stopBTX

Entries are of the form: \index {category}\BTXcode {@category{...}}

Anything outside of a valid \BTXcode {@category{...}} construction is ignored and
is taken to be a comment. Within an entry, there are to be no comments but one
can prefix field names, for example, to have them ignored.

There is a special entry type named \index {@comment}\BTXcode {@comment{...}}.
The main use of such an entry type is to comment a large part of the bibliography
easily, since anything outside an entry is already a comment, and commenting out
one entry may be achieved by just removing its initial~\BTXcode {@}. — The \index
{@comment}\BTXcode {@comment{...}} entry is perhaps of some use, although this is
not very elegant! As one can input multiple bibliography data files, as will be
seen below, it is much better practice to split datafiles for optional loading.

Many \BIBTEX\ data management tools such as \Tindex {jabref} (see below) will
ignore and then throw|-|away all such handily|-|crafted comments and data entries
turned into comments. So one must beware!

The field names are all cast to lowercase so capitalization is irrelevant;
Spacing is not important and should be used advantageously for readability. The
leading \Index {tag} (\BTXcode {sometag} in the example above) cannot contain
spaces and \emphasis {must} be followed by a comma.

The entry \Index {tag} (\BTXcode {@category{sometag,...}}) is not to be confused
with the optional field \BTXcode {key=sortkey,} that may also be present.

Normally a value is given between quotes (or curly brackets) but single words are
also valid (as there is no real benefit in not using quotes or curly brackets, we
advise to always use them, contrary to our example above). The order of the
fields in an entry is inconsequential and there can be many more fields than
those shown above. Instead of string values one can also use predefined
shortcuts. The title for example might quite often contain \TEX\ macros, and some
fields, like \BTXcode {pages} have funny characters such as the endash (typically
entered as \BTXcode {--}) so we have a mixture of data and typesetting
directives. Furthermore, if you are covering non||English references, you often
need characters that are not in the \ASCII\ subset. Note that \CONTEXT\ is quite
happy with \UTF, but if your database file uses old|-|fashioned \TEX\ accent
combinations then these will be internally converted automatically to \UTF.

Commands (macros) found in a database file are converted to an indirect call,
which is quite robust. The use of commands in the database file will be described
in \in {section} [sec:Commands].

The \Tindex {author} (and \Tindex {editor}) fields are parsed separating multiple
authors identified by the conjunction \quote {and}. Each name is assumed to be in
the form:

\definetyping
  [NameSyntax]
  [margin=1em]

\startNameSyntax
Firstname(s) Lastname
\stopNameSyntax

\seeindex {vons} {particule}

where \type {Lastname} is a single word but may include an optional (nobility)
\Index {particule}: lower|-|case word(s) such as \quotation {von}, \quotation
{de}, \quotation {de la}, etc.) \emphasis {unless} specifically in the two- or
three|-|token form:

\index {suffix}

\startNameSyntax
Lastname(s), Firstname(s)
Lastnames(s), Suffix(es), Firstname(s)
\stopNameSyntax

separated explicitly using comma(s) thus allowing multi|-|word \type {Lastnames}.

\startaside
An \BTXcode {author} field is sometimes abused in traditional \BIBTEX\ usage to
hold not a name but rather an entity. Other fields, such as \BTXcode
{organization} or \BTXcode {collaboration}, for example, should be used in such
cases.
\stopaside

\BIBTEX\ also (obscurely) supports the syntax:

\seeindex {juniors}{suffix}
\index {suffix}

\startNameSyntax
Firstname(s) \{Lastname(s), Suffix(es)\}
\stopNameSyntax

we may (or may not) support this in the future, so don't use this!

We extend \BIBTEX\ by optionally parsing each name in terms of four or five
tokens:

\index {particule} \index {suffix} \index {initial}

\startNameSyntax
Particule(s), Lastname(s), Suffix(es), Firstname(s)
Particule(s), Lastname(s), Suffix(es), Firstname(s), Initial(s)
\stopNameSyntax

in order to allow a free form for the particules, irrespective of capitalization,
thus avoiding the need to resort to any sort of \TEX\ trickery \cite [num]
[default::Patashnik1988,Markey2009]. In fact, an optional sixth token is parsed
whose meaning is presently reserved for future directives describing how the name
is to be interpreted:

\index {particule} \index {suffix} \index {initial}

\startNameSyntax
Particule(s), Lastname(s), Suffix(es), Firstname(s), Initial(s), directives
\stopNameSyntax

\BIBTEX\ additionally accepts the special token \Tindex {others} to be used
(sparingly) to indicate an incomplete author list. Note that most style
specifications will handle the truncation of long author lists in a systematic
fashion. The \index [others] {\tt and others}\BTXcode {and others} construction
finds its use when the complete author list is not well known or ill|-|defined.

Sometimes, or even often, the database might contain variants of an author's
name that we would like to identify as a single, unique author. Indeed, certain
bibliographic styles (as will be seen later) as well as an index of authors, for
example, will depend on this identification. A command \Cindex {btxremapauthor}
allows establishing this identity:

\startbuffer
\btxremapauthor [Donald Knuth] [Donald E. Knuth]
\btxremapauthor [Don Knuth]    [Donald E. Knuth]
\stopbuffer
\getbuffer

\cindex {btxremapauthor}
\typebuffer [option=TEX]

Fields other than \Tindex {author} and \Tindex {editor}, for example \Tindex
{artist} or \Tindex {director} if one desires, can be declared to be of type
\quote {author} and thus interpreted as names, but this is a subject for
specialists.

The \BTXcode {keywords} field can also be split into tokens separated by
semicolons (keyword; keyword; \unknown). This can be useful, as will be seen
later, in the creation of keyword indexes, for example.

Other string values such as \BTXcode {title} are kept literally (except for an
internal automatic conversion to \UTF\ of certain \TEX\ strings such as accent
combinations, endash, quotations, etc.). Note that the bibliography rendering
style (see below) might specify a capitalization of the title (using the
\CONTEXT\ commands \TEXcode {\Word} or \TEXcode {\Words}, for example).
Capitalized Names and acronyms are respected removing a need for the \BIBTEX\
practice of \quote {protecting} such words or letters with surrounding curly
brackets (which here are simply stripped off). (Furthermore, since \CONTEXT\ uses
\UTF, it does not suffer from all of the complicated \Index {sorting} issues that
plague \BIBTEX|/|\LATEX.) As some styles might not specify the capitalization of
words in the title whereas other styles might, it is recommended that strings be
written in lower case except where upper case is explicitly required so as to be
compatible with all such capitalization styles.

\startaside
Some bibliographic database sources can be quite sloppy and return strings
(titles and even authors) in all capitals, for example. We have made the design
choice \emphasis {not} to follow the \BIBTEX\ practice/feature of explicitly
formatting all string values, as we did not want to require the protection
through enclosing curly brackets that would have been a necessary consequence.
Thus, some cleaning of these database files might be needed. Furthermore, we
attempt to use all the power of \CONTEXT\ and \LUA, thus making unnecessary much
(most?) of the \TEX-like encoding of the data. We encourage users to clean|-|up
their \Tindex {.bib} database files as much as possible so that they contain only
the necessary data, with a minimum of explicit formatting directives.
\stopaside

String values, as described above, can be enclosed indifferently between matching
curly brackets: \BTXcode {{}} or pairs of quotation marks: \BTXcode {""}.
Multiple string values can be \index {string concatenation}concatenated using the
operator \BTXcode {\#}, as will be illustrated in \in {table}
[tab:mkiv-publications.bib].

Everything outside of a valid entry is ignored and treated as a \Index {comment}.
Syntactic errors (such as a missing comma or some unbalanced quotes or
parenthesis) are also skipped over, i.e. ignored. This is to attempt to continue
on to valid data but may lead to unexpected results. It is therefore the user's
responsibility to insure the correctness of the data files. Whereas some checks
and warnings are issued, the system is purposefully not too verbose.

Data is handled on a \quote {first come, first served} basis: duplicate \index
{duplicate+fields}\emphasis {fields} in an entry are ignored \startfootnote Note
that some \BIBTEX\ practice allows for the concatenation of duplicate name \index
{duplicate+fields}fields (i.e. \BTXcode {author} and \BTXcode {editor}) through
\BTXcode {and}, but (silently) ignores duplicate other fields. We choose to have
a consistant behavior and disallow duplicate field occurrences. \stopfootnote
though duplicate \index {duplicate+entries}\emphasis {entries} (having the same
\index {duplicate+tags}tag) are retained, but the subsequent identical \Index
{tag}s will be modified by adding a suffix $-n$ for the $n$\high {th} duplicate.
The presence of duplicate \index {duplicate+fields}fields or \index
{duplicate+tags}tags will be flagged as such with warnings in the log file.
Duplicate \index {duplicate+entries}entries using different \Index {tag}s will
not be treated as duplicates.

A special provision has been made to declare author \Index {synonyms}, that is
names that might occur with a variation of spellings or aliases. This shall be
discussed later.

We have attempted to remain compatible with the \BIBTEX\ format, and any new
bibliography extensions that we introduce here were designed in a way to remain
compatible with \BIBTEX, being simply ignored rather than potentially generating
a \BIBTEX\ error.

The \BIBTEX\ files are loaded in memory as \LUA\ table but can be converted to
\XML\ so that we can access them in a more flexible way, but that is another
subject for specialists.

\stopsection

\startsection [reference=sec:Commands,title=Commands in entries]

One unfortunate aspect commonly found in \BIBTEX\ files is that they may contain
\TEX\ commands. Even worse is that there is no standard on what these commands
can be and what they mean, at least not formally, as \BIBTEX\ is a program
intended to be used with many variants of \TEX\ style: plain, \LATEX, and others.
This means that we need to define our use of these typesetting commands. (In
particular, one might need to redefine those that are too \LATEX|-|centric.)
However, in most cases, they are just abbreviations or font switches and these
are often well known. Therefore, \CONTEXT\ will try to resolve them before
reporting an issue. The log file will announce the commands that have been seen
in the loaded databases. For instance, loading \Tindex {tugboat.bib} (distributed
with \TEXLIVE) gives a long list of commands of which we show a small set of the
five most frequently encountered ones here:

\startbuffer
\definebtxdataset[tugboat]
\usebtxdataset[tugboat][tugboat.bib]
\stopbuffer

\getbuffer

\starttyping
publications > tugboat  tt     134 known
publications > tugboat  Dash   136 unknown
publications > tugboat  acro   137 known
publications > tugboat  LaTeX  209 known
publications > tugboat  TeX    856 known
\stoptyping

Some are flagged as known and others as unknown. You can define unknown commands,
or overload existing definitions in the standard way (\emphasis {e.g.} \TEXcode
{\def\Dash{—}}), the \CONTEXT\ way (\TEXcode {\define\Dash{—}}) or,
alternatively, in the following way:

\cindex {definebtxcommand}

\startTEX
\definebtxcommand\TUB {TUGboat}
\definebtxcommand\MP  {METAPOST}
\definebtxcommand\sltt{\tt}
\definebtxcommand\<#1>{\type{#1}}
\stopTEX

\definebtxcommand\MP  {METAPOST} % to be used silently below

Custom commands created using \Cindex {definebtxcommand} have the advantage of
using a separate name space thus allowing \Index {isolation} from other \CONTEXT\
commands. (The \Index {isolation} of \Cindex {btxcommand} allows the \Tindex
{.bib} files to safely contain \TEX\ and \LATEX\ idiosyncrasies that might
conflict with proper \CONTEXT\ syntax.) Unknown commands do not stall processing,
but their names are then typeset in a mono|-|spaced font so they probably stand
out for proofreading. You can access the commands using \index
{btxcommand}\TEXcode {\btxcommand{...}} (or \Cindex {btxcmd}), as in:

\startbuffer
commands like \btxcommand{MySpecialCommand} are handled in an indirect way
\stopbuffer

\cindex {btxcommand}

\typeTEXbuffer

As this is an undefined command we get: \quotation {\inlinebuffer}.

Often, these embedded \TEX\ commands are present in \Tindex {.bib} files in order
to trick \BIBTEX\ into certain behavior. Since this will generally not be
necessary here, we strongly encourage users to clean|-|up such unnecessary
extras. Indeed, the idea is to keep the data clean, using styles and parameter
settings instead to handle rendering issues. Indeed, we don't see it as challenge
nor as a duty to support all kinds of messy definitions. Of course, we try to be
somewhat tolerant, but you will be sure to get better results if you use nicely
setup, consistent databases.

Finally, the \BIBTEX\ entry \tindex {@string}\BTXcode {@String{}} is preprocessed
as expected.

\tindex {@string}

\startTEX
@String{j-TUGboat = "TUGboat"}
\stopTEX

\startaside
Notice that \Tindex {tugboat.bib} also contains: \tindex {@preamble}
\startBTX
@Preamble{"\input tugboat.def"}
@Preamble{"\input path.sty"}
\stopBTX
These are silently ignored as many such commands are most likely not to be
compatible with \CONTEXT. Indeed, the examples shown here are not!
\stopaside

\stopsection

\startsection[title=\MKII\ definitions]

In the old \MKII\ setup we have two kinds of entries: the ones that come from the
\BIBTEX\ run and additional user|-|supplied ones. We no longer rely on \BIBTEX\
output but we do still support the user supplied definitions. These were in fact
prepared in a way that suits the processing of the \BIBTEX\ generated entries;
The next variant reflects the \CONTEXT\ recoding of the old \BIBTEX\ output. For
this reason, some users refer to this as \Tindex {.bbl} format.

\cindex {startpublication}
\cindex {stoppublication}

\startTEX
\startpublication[k=Hagen:Second,t=article,a={Hans Hagen},y=2013,s=HH01]
    \artauthor[] {Hans}[H.]{}{Hagen}
    \arttitle {Who knows more?}
    \journal  {MyJournal}
    \pubyear  {2013}
    \month    {8}
    \volume   {1}
    \issue    {3}
    \issn     {1234-5678}
    \pages    {123--126}
\stoppublication
\stopTEX

The split \TEXcode {\artauthor} fields will be collapsed into a single \TEXcode
{author} field as we handle the splitting later when it gets parsed in \LUA. The
\TEXcode {\artauthor} syntax is only kept around for backward compatibility with
the previous use of \BIBTEX.

In the new setup we support these variants:

\cindex {startpublication}
\cindex {stoppublication}

\startTEX
\startpublication[k=Hagen:Third,t=article]
    \author{Hans Hagen}
    \title {Who knows who?}
    ...
\stoppublication
\stopTEX

as well as

\cindex {startpublication}
\cindex {stoppublication}

\startTEX
\startpublication[tag=Hagen:Third,category=article]
    \author{Hans Hagen}
    \title {Who knows who?}
    ...
\stoppublication
\stopTEX

and

\cindex {startpublication}
\cindex {stoppublication}

\startTEX
\startpublication
    \tag     {Hagen:Third}
    \category{article}
    \author  {Hans Hagen}
    \title   {Who knows who?}
    ...
\stoppublication
\stopTEX

The use of this format will be illustrated later a means to export the database
which may be of great use in converting collections of \MKII\ bibliography files.

\showsetup[startpublication]

\stopsection

\startsection[title=\LUA\ tables]

Because internally the entries are \index [LUA table] {\LUA\ table}\LUA\ tables,
we also support the loading of \LUA\ based definitions:

\startLUA
return {
    ["Hagen:First"] = {
        author   = "Hans Hagen",
        category = "article",
        issn     = "1234-5678",
        issue    = "3",
        journal  = "MyJournal",
        month    = "8",
        pages    = "123--126",
        tag      = "Hagen:First",
        title    = "Who knows nothing?",
        volume   = "1",
        year     = "2013",
    },
}
\stopLUA

Notice that the \Index {tag} is redundantly specified; it is \quote {pushed} into
the table so that one can access it without having to know the \Index {tag} of the
original table.

\stopsection

\startsection[title=\XML]

The following \index [XML] {\XML}\XML\ input is rather close in structure, and is
also accepted as input.

\startXML
<?xml version="2.0" standalone="yes" ?>
<bibtex>
    <entry tag="Hagen:First" category="article">
        <field name="author">Hans Hagen</field>
        <field name="category">article</field>
        <field name="issn">1234-5678</field>
        <field name="issue">3</field>
        <field name="journal">MyJournal</field>
        <field name="month">8</field>
        <field name="pages">123--126</field>
        <field name="tag">Hagen:First</field>
        <field name="title">Who knows nothing?</field>
        <field name="volume">1</field>
        <field name="year">2013</field>
    </entry>
</bibtex>
\stopXML

We shall focus on the use of \BIBTEX\ \Tindex {.bib} files as the input data
format of reference. Keep in mind, however, that the \index [LUA table] {\LUA\
table}\LUA\ table format and the \index [XML] {\XML}\XML\ format might prove to
be more flexible for future expansion of functionality.

\stopsection

\startsection[title=Other formats]

Various other bibliographic data file formats are in common use, such as:

\starttabulate [|Tl|p|]
\NC savedrecs.txt     \NC Institute of Scientific Information (ISI) tagged format
                          (e.g. Thomson Reuters™ Web of Science™), \NC \NR
\NC filename.enw      \NC Thomson Reuters™ Endnote™ export format
                          (there is also an Endnote \type {.xml} export), \NC \NR
\NC filename.ris      \NC Research Information Systems, Incorporated, now
                          Thomson Reuters™ Reference Manager™, and \NC \NR
\NC pubmed_result.txt \NC The National Library of Medicine® (NLM®)
                          MEDLINE®|/|PubMed® data format \NC \NR
\stoptabulate

just to name a few (amongst many more). Filters can be easily written in \LUA\ to
read these and other bibliography data formats, although no such filters are
provided. This is because the user has a choice of a certain number of
bibliography database management programs that can easily convert from these to
the \BIBTEX\ format. (Notable, open source examples are \index {jabref} \goto
{jabref} [url(http://jabref.sourceforge.net)] and \index {zotero} \goto {zotero}
[url(http://www.zotero.org)].) Indeed, it is not the vocation of the present
\CONTEXT\ bibliography subsystem to fully manage the bibliography data sources,
only to be able to use such data in the production of documents.

\startaside
\emphasis {A note on database management programs:} these are very valuable tools
for the manipulation of bibliography database information, which is why the
\BIBTEX\ format has so much importance for us here. However, one must be aware
that these programs are not standards and many of them may introduce invalid
extensions that might not even be handled correctly by \BIBTEX\ itself.
\stopaside

\stopsection

\stopchapter

\stopcomponent