\environment publications-style \startcomponent publications-database \startchapter[title=The database] The bibliography subsystem uses a database (or a set of databases) to construct a list of citations to be used in a scholarly work. However, it will be shown later that the database system can be used (and abused) to many ends having little or nothing at all to do with citations and bibliographies. Nevertheless, at first we shall remain focused on the use of bibliography databases. The data to be used must have a source and a structure. In the next sections we describe the possible input. \startsection[title=\BibTeX] The \BIBTEX\ format is rather popular in the \TEX\ community and even with its shortcomings it will stay around for a while. Many publication websites can export and many tools are available to work with this database format. It is rather simple and looks a bit like \index [LUA table] {\LUA\ table}\LUA\ tables. Indeed, it is said that the \BIBTEX\ format was one of the inspirations for the constructor syntax in \LUA\ \cite [alternative=num, righttext={\btxcomma Chapter\nbsp 12.}] [default::Ierusalimschy2006]. Unfortunately the content can be (and usually is) polluted with non|-|standardized \TEX\ commands which complicates pre- or post|-|processing outside \TEX. In that sense a \BIBTEX\ database is often not coded neutrally. Some limitations, like the use of commands to encode accented characters root in the \ASCII\ world and can be bypassed by using \index [UTF] {\UTF}\UTF\ instead (as handled somewhat in \LATEX\ through extensions such as \Tindex {bibtex8}). The normal way to deal with a bibliography is to refer to entries using a unique \Index {tag} or key. When a text containing a list of entries is typeset, this reference can be used for linking purposes. The list can be processed and sorted using the \Tindex {bibtex} program that converts the database into something more \TEX\ friendly (a \Tindex {.bbl} file). In \CONTEXT\ we no longer use the (external) \goto {\Tindex {bibtex} program} [url(https://www.ctan.org/pkg/bibtex)] at all: we simply parse the database files in \LUA\ and deal with the necessary manipulations directly in \CONTEXT. One or more such databases can be used and combined with additional entries defined within the document. We can have several such datasets active at the same time. \startaside \emphasis {On the name \Tindex {btx}:} many of the \CONTEXT\ commands that will be used in the following contain the label \TEXcode {btx} in their name. This identifier was retained despite the fact that \CONTEXT\ \MKIV\ is now completely independent of \BIBTEX; it reflects the role still played by \BIBTEX\ data as a preferred source format and serves as a handy, unique identifier, both internally in the programming as well as for the user. This three|-|letter label is systematically used in commands that otherwise attempt to avoid cryptic|-|styled names. \stopaside A \BIBTEX\ file entry looks like this: \startBTX @Article {sometag, author = "An Author and Another One", title = "A hopefully meaningful title", journal = maps, volume = "25", number = "2", pages = "5--9", month = mar, year = "2013", ISSN = "1234-5678", } \stopBTX Entries are of the form: \index {category}\BTXcode {@category{...}} Anything outside of a valid \BTXcode {@category{...}} construction is ignored and is taken to be a comment. Within an entry, there are to be no comments but one can prefix field names, for example, to have them ignored. There is a special entry type named \index {@comment}\BTXcode {@comment{...}}. The main use of such an entry type is to comment a large part of the bibliography easily, since anything outside an entry is already a comment, and commenting out one entry may be achieved by just removing its initial~\BTXcode {@}. — The \index {@comment}\BTXcode {@comment{...}} entry is perhaps of some use, although this is not very elegant! As one can input multiple bibliography data files, as will be seen below, it is much better practice to split datafiles for optional loading. Many \BIBTEX\ data management tools such as \Tindex {jabref} (see below) will ignore and then throw|-|away all such handily|-|crafted comments and data entries turned into comments. So one must beware! The field names are all cast to lowercase so capitalization is irrelevant; Spacing is not important and should be used advantageously for readability. The leading \Index {tag} (\BTXcode {sometag} in the example above) cannot contain spaces and \emphasis {must} be followed by a comma. The entry \Index {tag} (\BTXcode {@category{sometag,...}}) is not to be confused with the optional field \BTXcode {key=sortkey,} that may also be present. Normally a value is given between quotes (or curly brackets) but single words are also valid (as there is no real benefit in not using quotes or curly brackets, we advise to always use them, contrary to our example above). The order of the fields in an entry is inconsequential and there can be many more fields than those shown above. Instead of string values one can also use predefined shortcuts. The title for example might quite often contain \TEX\ macros, and some fields, like \BTXcode {pages} have funny characters such as the endash (typically entered as \BTXcode {--}) so we have a mixture of data and typesetting directives. Furthermore, if you are covering non||English references, you often need characters that are not in the \ASCII\ subset. Note that \CONTEXT\ is quite happy with \UTF, but if your database file uses old|-|fashioned \TEX\ accent combinations then these will be internally converted automatically to \UTF. Commands (macros) found in a database file are converted to an indirect call, which is quite robust. The use of commands in the database file will be described in \in {section} [sec:Commands]. The \Tindex {author} (and \Tindex {editor}) fields are parsed separating multiple authors identified by the conjunction \quote {and}. Each name is assumed to be in the form: \definetyping [NameSyntax] [margin=1em] \startNameSyntax Firstname(s) Lastname \stopNameSyntax \seeindex {vons} {particule} where \type {Lastname} is a single word but may include an optional (nobility) \Index {particule}: lower|-|case word(s) such as \quotation {von}, \quotation {de}, \quotation {de la}, etc.) \emphasis {unless} specifically in the two- or three|-|token form: \index {suffix} \startNameSyntax Lastname(s), Firstname(s) Lastnames(s), Suffix(es), Firstname(s) \stopNameSyntax separated explicitly using comma(s) thus allowing multi|-|word \type {Lastnames}. \startaside An \BTXcode {author} field is sometimes abused in traditional \BIBTEX\ usage to hold not a name but rather an entity. Other fields, such as \BTXcode {organization} or \BTXcode {collaboration}, for example, should be used in such cases. \stopaside \BIBTEX\ also (obscurely) supports the syntax: \seeindex {juniors}{suffix} \index {suffix} \startNameSyntax Firstname(s) \{Lastname(s), Suffix(es)\} \stopNameSyntax we may (or may not) support this in the future, so don't use this! We extend \BIBTEX\ by optionally parsing each name in terms of four or five tokens: \index {particule} \index {suffix} \index {initial} \startNameSyntax Particule(s), Lastname(s), Suffix(es), Firstname(s) Particule(s), Lastname(s), Suffix(es), Firstname(s), Initial(s) \stopNameSyntax in order to allow a free form for the particules, irrespective of capitalization, thus avoiding the need to resort to any sort of \TEX\ trickery \cite [num] [default::Patashnik1988,Markey2009]. In fact, an optional sixth token is parsed whose meaning is presently reserved for future directives describing how the name is to be interpreted: \index {particule} \index {suffix} \index {initial} \startNameSyntax Particule(s), Lastname(s), Suffix(es), Firstname(s), Initial(s), directives \stopNameSyntax \BIBTEX\ additionally accepts the special token \Tindex {others} to be used (sparingly) to indicate an incomplete author list. Note that most style specifications will handle the truncation of long author lists in a systematic fashion. The \index [others] {\tt and others}\BTXcode {and others} construction finds its use when the complete author list is not well known or ill|-|defined. Sometimes, or even often, the database might contain variants of an author's name that we would like to identify as a single, unique author. Indeed, certain bibliographic styles (as will be seen later) as well as an index of authors, for example, will depend on this identification. A command \Cindex {btxremapauthor} allows establishing this identity: \startbuffer \btxremapauthor [Donald Knuth] [Donald E. Knuth] \btxremapauthor [Don Knuth] [Donald E. Knuth] \stopbuffer \getbuffer \cindex {btxremapauthor} \typebuffer [option=TEX] Fields other than \Tindex {author} and \Tindex {editor}, for example \Tindex {artist} or \Tindex {director} if one desires, can be declared to be of type \quote {author} and thus interpreted as names, but this is a subject for specialists. The \BTXcode {keywords} field can also be split into tokens separated by semicolons (keyword; keyword; \unknown). This can be useful, as will be seen later, in the creation of keyword indexes, for example. Other string values such as \BTXcode {title} are kept literally (except for an internal automatic conversion to \UTF\ of certain \TEX\ strings such as accent combinations, endash, quotations, etc.). Note that the bibliography rendering style (see below) might specify a capitalization of the title (using the \CONTEXT\ commands \TEXcode {\Word} or \TEXcode {\Words}, for example). Capitalized Names and acronyms are respected removing a need for the \BIBTEX\ practice of \quote {protecting} such words or letters with surrounding curly brackets (which here are simply stripped off). (Furthermore, since \CONTEXT\ uses \UTF, it does not suffer from all of the complicated \Index {sorting} issues that plague \BIBTEX|/|\LATEX.) As some styles might not specify the capitalization of words in the title whereas other styles might, it is recommended that strings be written in lower case except where upper case is explicitly required so as to be compatible with all such capitalization styles. \startaside Some bibliographic database sources can be quite sloppy and return strings (titles and even authors) in all capitals, for example. We have made the design choice \emphasis {not} to follow the \BIBTEX\ practice/feature of explicitly formatting all string values, as we did not want to require the protection through enclosing curly brackets that would have been a necessary consequence. Thus, some cleaning of these database files might be needed. Furthermore, we attempt to use all the power of \CONTEXT\ and \LUA, thus making unnecessary much (most?) of the \TEX-like encoding of the data. We encourage users to clean|-|up their \Tindex {.bib} database files as much as possible so that they contain only the necessary data, with a minimum of explicit formatting directives. \stopaside String values, as described above, can be enclosed indifferently between matching curly brackets: \BTXcode {{}} or pairs of quotation marks: \BTXcode {""}. Multiple string values can be \index {string concatenation}concatenated using the operator \BTXcode {\#}, as will be illustrated in \in {table} [tab:mkiv-publications.bib]. Everything outside of a valid entry is ignored and treated as a \Index {comment}. Syntactic errors (such as a missing comma or some unbalanced quotes or parenthesis) are also skipped over, i.e. ignored. This is to attempt to continue on to valid data but may lead to unexpected results. It is therefore the user's responsibility to insure the correctness of the data files. Whereas some checks and warnings are issued, the system is purposefully not too verbose. Data is handled on a \quote {first come, first served} basis: duplicate \index {duplicate+fields}\emphasis {fields} in an entry are ignored \startfootnote Note that some \BIBTEX\ practice allows for the concatenation of duplicate name \index {duplicate+fields}fields (i.e. \BTXcode {author} and \BTXcode {editor}) through \BTXcode {and}, but (silently) ignores duplicate other fields. We choose to have a consistant behavior and disallow duplicate field occurrences. \stopfootnote though duplicate \index {duplicate+entries}\emphasis {entries} (having the same \index {duplicate+tags}tag) are retained, but the subsequent identical \Index {tag}s will be modified by adding a suffix $-n$ for the $n$\high {th} duplicate. The presence of duplicate \index {duplicate+fields}fields or \index {duplicate+tags}tags will be flagged as such with warnings in the log file. Duplicate \index {duplicate+entries}entries using different \Index {tag}s will not be treated as duplicates. A special provision has been made to declare author \Index {synonyms}, that is names that might occur with a variation of spellings or aliases. This shall be discussed later. We have attempted to remain compatible with the \BIBTEX\ format, and any new bibliography extensions that we introduce here were designed in a way to remain compatible with \BIBTEX, being simply ignored rather than potentially generating a \BIBTEX\ error. The \BIBTEX\ files are loaded in memory as \LUA\ table but can be converted to \XML\ so that we can access them in a more flexible way, but that is another subject for specialists. \stopsection \startsection [reference=sec:Commands,title=Commands in entries] One unfortunate aspect commonly found in \BIBTEX\ files is that they may contain \TEX\ commands. Even worse is that there is no standard on what these commands can be and what they mean, at least not formally, as \BIBTEX\ is a program intended to be used with many variants of \TEX\ style: plain, \LATEX, and others. This means that we need to define our use of these typesetting commands. (In particular, one might need to redefine those that are too \LATEX|-|centric.) However, in most cases, they are just abbreviations or font switches and these are often well known. Therefore, \CONTEXT\ will try to resolve them before reporting an issue. The log file will announce the commands that have been seen in the loaded databases. For instance, loading \Tindex {tugboat.bib} (distributed with \TEXLIVE) gives a long list of commands of which we show a small set of the five most frequently encountered ones here: \startbuffer \definebtxdataset[tugboat] \usebtxdataset[tugboat][tugboat.bib] \stopbuffer \getbuffer \starttyping publications > tugboat tt 134 known publications > tugboat Dash 136 unknown publications > tugboat acro 137 known publications > tugboat LaTeX 209 known publications > tugboat TeX 856 known \stoptyping Some are flagged as known and others as unknown. You can define unknown commands, or overload existing definitions in the standard way (\emphasis {e.g.} \TEXcode {\def\Dash{—}}), the \CONTEXT\ way (\TEXcode {\define\Dash{—}}) or, alternatively, in the following way: \cindex {definebtxcommand} \startTEX \definebtxcommand\TUB {TUGboat} \definebtxcommand\MP {METAPOST} \definebtxcommand\sltt{\tt} \definebtxcommand\<#1>{\type{#1}} \stopTEX \definebtxcommand\MP {METAPOST} % to be used silently below Custom commands created using \Cindex {definebtxcommand} have the advantage of using a separate name space thus allowing \Index {isolation} from other \CONTEXT\ commands. (The \Index {isolation} of \Cindex {btxcommand} allows the \Tindex {.bib} files to safely contain \TEX\ and \LATEX\ idiosyncrasies that might conflict with proper \CONTEXT\ syntax.) Unknown commands do not stall processing, but their names are then typeset in a mono|-|spaced font so they probably stand out for proofreading. You can access the commands using \index {btxcommand}\TEXcode {\btxcommand{...}} (or \Cindex {btxcmd}), as in: \startbuffer commands like \btxcommand{MySpecialCommand} are handled in an indirect way \stopbuffer \cindex {btxcommand} \typeTEXbuffer As this is an undefined command we get: \quotation {\inlinebuffer}. Often, these embedded \TEX\ commands are present in \Tindex {.bib} files in order to trick \BIBTEX\ into certain behavior. Since this will generally not be necessary here, we strongly encourage users to clean|-|up such unnecessary extras. Indeed, the idea is to keep the data clean, using styles and parameter settings instead to handle rendering issues. Indeed, we don't see it as challenge nor as a duty to support all kinds of messy definitions. Of course, we try to be somewhat tolerant, but you will be sure to get better results if you use nicely setup, consistent databases. Finally, the \BIBTEX\ entry \tindex {@string}\BTXcode {@String{}} is preprocessed as expected. \tindex {@string} \startTEX @String{j-TUGboat = "TUGboat"} \stopTEX \startaside Notice that \Tindex {tugboat.bib} also contains: \tindex {@preamble} \startBTX @Preamble{"\input tugboat.def"} @Preamble{"\input path.sty"} \stopBTX These are silently ignored as many such commands are most likely not to be compatible with \CONTEXT. Indeed, the examples shown here are not! \stopaside \stopsection \startsection[title=\MKII\ definitions] In the old \MKII\ setup we have two kinds of entries: the ones that come from the \BIBTEX\ run and additional user|-|supplied ones. We no longer rely on \BIBTEX\ output but we do still support the user supplied definitions. These were in fact prepared in a way that suits the processing of the \BIBTEX\ generated entries; The next variant reflects the \CONTEXT\ recoding of the old \BIBTEX\ output. For this reason, some users refer to this as \Tindex {.bbl} format. \cindex {startpublication} \cindex {stoppublication} \startTEX \startpublication[k=Hagen:Second,t=article,a={Hans Hagen},y=2013,s=HH01] \artauthor[] {Hans}[H.]{}{Hagen} \arttitle {Who knows more?} \journal {MyJournal} \pubyear {2013} \month {8} \volume {1} \issue {3} \issn {1234-5678} \pages {123--126} \stoppublication \stopTEX The split \TEXcode {\artauthor} fields will be collapsed into a single \TEXcode {author} field as we handle the splitting later when it gets parsed in \LUA. The \TEXcode {\artauthor} syntax is only kept around for backward compatibility with the previous use of \BIBTEX. In the new setup we support these variants: \cindex {startpublication} \cindex {stoppublication} \startTEX \startpublication[k=Hagen:Third,t=article] \author{Hans Hagen} \title {Who knows who?} ... \stoppublication \stopTEX as well as \cindex {startpublication} \cindex {stoppublication} \startTEX \startpublication[tag=Hagen:Third,category=article] \author{Hans Hagen} \title {Who knows who?} ... \stoppublication \stopTEX and \cindex {startpublication} \cindex {stoppublication} \startTEX \startpublication \tag {Hagen:Third} \category{article} \author {Hans Hagen} \title {Who knows who?} ... \stoppublication \stopTEX The use of this format will be illustrated later a means to export the database which may be of great use in converting collections of \MKII\ bibliography files. \showsetup[startpublication] \stopsection \startsection[title=\LUA\ tables] Because internally the entries are \index [LUA table] {\LUA\ table}\LUA\ tables, we also support the loading of \LUA\ based definitions: \startLUA return { ["Hagen:First"] = { author = "Hans Hagen", category = "article", issn = "1234-5678", issue = "3", journal = "MyJournal", month = "8", pages = "123--126", tag = "Hagen:First", title = "Who knows nothing?", volume = "1", year = "2013", }, } \stopLUA Notice that the \Index {tag} is redundantly specified; it is \quote {pushed} into the table so that one can access it without having to know the \Index {tag} of the original table. \stopsection \startsection[title=\XML] The following \index [XML] {\XML}\XML\ input is rather close in structure, and is also accepted as input. \startXML Hans Hagen article 1234-5678 3 MyJournal 8 123--126 Hagen:First Who knows nothing? 1 2013 \stopXML We shall focus on the use of \BIBTEX\ \Tindex {.bib} files as the input data format of reference. Keep in mind, however, that the \index [LUA table] {\LUA\ table}\LUA\ table format and the \index [XML] {\XML}\XML\ format might prove to be more flexible for future expansion of functionality. \stopsection \startsection[title=Other formats] Various other bibliographic data file formats are in common use, such as: \starttabulate [|Tl|p|] \NC savedrecs.txt \NC Institute of Scientific Information (ISI) tagged format (e.g. Thomson Reuters™ Web of Science™), \NC \NR \NC filename.enw \NC Thomson Reuters™ Endnote™ export format (there is also an Endnote \type {.xml} export), \NC \NR \NC filename.ris \NC Research Information Systems, Incorporated, now Thomson Reuters™ Reference Manager™, and \NC \NR \NC pubmed_result.txt \NC The National Library of Medicine® (NLM®) MEDLINE®|/|PubMed® data format \NC \NR \stoptabulate just to name a few (amongst many more). Filters can be easily written in \LUA\ to read these and other bibliography data formats, although no such filters are provided. This is because the user has a choice of a certain number of bibliography database management programs that can easily convert from these to the \BIBTEX\ format. (Notable, open source examples are \index {jabref} \goto {jabref} [url(http://jabref.sourceforge.net)] and \index {zotero} \goto {zotero} [url(http://www.zotero.org)].) Indeed, it is not the vocation of the present \CONTEXT\ bibliography subsystem to fully manage the bibliography data sources, only to be able to use such data in the production of documents. \startaside \emphasis {A note on database management programs:} these are very valuable tools for the manipulation of bibliography database information, which is why the \BIBTEX\ format has so much importance for us here. However, one must be aware that these programs are not standards and many of them may introduce invalid extensions that might not even be handled correctly by \BIBTEX\ itself. \stopaside \stopsection \stopchapter \stopcomponent