Thursday, September 10, 2015

Hoyle Bibliography: technology update

While it has been ages since I last posted on this blog, I have been monumentally busy with Hoyle. Last November, I announced that I would be starting an online descriptive bibliography of Hoyle. This post highlights the progress I have made, both with content, and with the supporting technology.

Underlying Approach

The bibliography nears a major milestone. I have completed descriptions of all but a handful of the 18th century editions of Hoyle, the task I had originally contemplated. Inevitably, my scope has expanded, and I’m well into the 19th century. It is difficult to find a graceful stopping point. In addition to the content, the bibliography is a significant and apparently unique effort in the digital humanities. So far I have created 170 bibliographical descriptions, storing each in a file for validation, processing, and display. The programming effort has been substantial and is ongoing, but continues to pay for itself many times over. This blog essay discusses the technology I have developed in the course of compiling the Hoyle bibliography.

My primary goal was to create bibliographical descriptions of the books that could be presented in multiple ways—initially as a web site and then in a word processing document leading to print publication. I expect that others will be able to extract data programmatically from my descriptions if  desired—perhaps a library wishes to update its catalogue or a collector wishes to build a checklist. This goal, one data source with multiple presentations, dictated storing the descriptions in a highly-structured way.

A second goal was to avoid errors and inconsistencies in bibliographical descriptions and their presentation. As to the descriptions, collation formulas and pagination statements should total to the same number of leaves. Deletions and signing errors should refer to leaves actually in the collation formula. Signature references and page references should point to the same page. I have seen each of these errors in printed bibliographies—mistakes are inevitable. Formatting is equally error prone. Fredson Bowers’ Principles of Bibliographical Description is the standard for descriptive bibliography, including the collation formula and pagination statement. Bowers requires dexterous use of brackets, italics, commas, semicolons, superscripts and subscripts. Proofreading is hardly...foolproof. It seemed as though there should be better solutions.

The desire to avoid errors led to the same design decision suggested earlier, highly structured data. Following other digital humanities projects, particularly TEI (about which more below), I chose XML as an underlying technology. A brief excerpt from one of my book descriptions will show how structured XML data can reduce error. Consider Whist.3 (my description is online here), which has one of the simpler collation formulas:

12o: A–D12 E4 [$½ (-A2,B2) signed; missigning B4 as B5]; 52 leaves, pp. [8] [1] 2–96

The data used to produce the collation formula is:

        <collation>
            <format>12</format>
            <collationFormula>
                <gatherings>
                    <gatheringRange signed="true">
                        <sigStart>A</sigStart>
                        <sigEnd>D</sigEnd>
                        <leaves>12</leaves>
                    </gatheringRange>
                    <gatheringRange signed="true">
                        <sigStart>E</sigStart>
                        <leaves>4</leaves>
                    </gatheringRange>
                </gatherings>
                <signatureLeaves>$½</signatureLeaves>
                <anomSignatures>
                    <anomSignature>
                        <anomType>-</anomType>
                        <sigRef>A2</sigRef>
                    </anomSignature>
                    <anomSignature>
                        <anomType>-</anomType>
                        <sigRef>B2</sigRef>
                    </anomSignature>
                </anomSignatures>
                <signingErrors>
                    <signingError>
                        <sigRef>B4</sigRef>
                        <badSig>B5</badSig>
                    </signingError>
                </signingErrors>
            </collationFormula>
            <totalLeaves>52</totalLeaves>
            <pagination>
                <pageRanges>
                    <pageRange numbered="false" range="true">
                        <start>8</start>
                    </pageRange>
                    <pageRange numbered="false">
                        <start>1</start>
                    </pageRange>
                    <pageRange numbered="true">
                        <start>2</start>
                        <end>96</end>
                    </pageRange>
                </pageRanges>
            </pagination>
        </collation>

XML is a hierarchical structure: elements have values (the book's format is 12, a duodecimo) and attributes have values (page 1 is unnumbered, pages 2-96 are). Everything is text and therefore readable by humans, particularly when indented in an outline form that reveals the structure. In the example above, the collation consists of format, collation formula, total leaves, and pagination. The collation formula consists of gatherings, signature leaves (indicating normal signing), and anomalous signatures. Each gathering range within the gatherings has a starting signature (sigStart), an optional ending signature, and a number of leaves. A gathering range may be signed or unsigned. The pagination section is similar. More complicated books will use other optional elements.

How does this encoding help avoid error? First, the data it is validated against an XML schema I created. The schema is formal description of the rules for describing a book. The schema requires elements such as collation, collation formula, signature leaves, etc. The element anomalous signatures is optional, as are elements for signing errors, duplicated signatures, doubled alphabets, insertions, deletions, and free form notes. Failure to include a required element or inclusion of an unexpected element will generate an error.

Moreover, the XML schema restricts each element as to allowed types values. For example format is limited to a small set of values such as 8 for octavo, 12 for duodecimo, etc. Entering 13 into the format field will generate an error. The schema is rather complex, but does an admirable job of preventing errors.

One might expect that all of the tags, required structure, and rules for allowed values would add substantial effort when inputting data. Indeed the above snippet of XML for Whist.3 is much more verbose than the collation formula. Surprisingly, there is much less data entry. Much less. Modern tools will read the required structure contained in the XML schema, insert most of the tags, and suggest allowed values for the data. Most of the typing is done for you. And as we shall see below, you don’t have to worry about brackets, italics, superscripts and the like—that is handled elsewhere.

Once the data is structured and individual elements are known to have valid values, it is possible to check them for internal consistency. For example, I have written a program to read the collation formula statement, count the number of leaves it implies, and compare it with the element signature leaves, flagging any discrepancy as an error. Similarly, the pagination statement implies a total number of pages that is expected to be twice the number of leaves. In the example above, there are four gatherings of 12 leaves and one of 4, totaling 52 leaves and 104 pages. Check.

Much more validation is possible. For example, I give references in terms of both signature and page, such as A5v–E4r (2–95) for a range or E4v (96) for a page. Once we are certain that the collation formula and pagination statements are consistent, we know the page number for each leaf. I was able to write a program that verifies that leaf A5v is page 2, E4r is page 95, and E4v is page 96. It is no exaggeration to say that the program has detected hundreds of errors. Perhaps thousands. By entering both the signature and page reference, I have to make two errors that are consistent with one another before mistakes of reference appear in the bibliography.

I only wish there were a similar way to validate quasi-facsimile transcription!

XML works with another language called XSLT (Extensible Stylesheet Language Transformations) to render XML in other formats such as text or HTML. It is an XSLT stylesheet that transforms the collation as expressed in XML into Bowers format. All the “knowledge” of Bowers' rules is in one program. As a result, when entering the collation for a book, I do not have to type brackets, italics, or superscripts—a major time saving for data entry.

An amusing example demonstrates the strength of the approach. At Rare Book School, I learned to describe signing errors by saying “missigning B4 as B5”. Bowers prefers “misprinting B4 as B5” and has no objection to quoting the erroneous signature, writing “misprinting B4 as ‘B5’”. (See Bowers p270). Regardless of which is preferred, I can change the output for all 170 book descriptions by making a minor change to one XSLT stylesheet and not all 170 descriptions. Neat!

A third goal was to automate the production of indices for the bibliography. The top-level index classifies works as (a) separate works; (b) publishers’ collections of works published separately; and (c) collected editions. It is produced programmatically. Other programs produce other indices:
  • An index of short titles and short imprints
  • A chronological list of all editions and issues
  • A list of games and subjects treated in Hoyle with a chronological list of books for each game or subject
  • An index by publisher or printer
  • A list of institutions holding copies of Hoyle (see here, for example, for libraries in the British Isles) and the books held at a given library (for example, the Bodleian, which has the largest collection of Hoyles in the world)
  • Lists of Hoyles in each of the standard gaming bibliographies, such as Horr, Jessel, and Rather and Goldwater.
Each time I add a new bibliographical description, I can regenerate all of the indices and indeed the entire website by running one program.

The final goal is perhaps the most ambitious—to develop a platform that other bibliographers can use. I have no intention of turning the technology into a commercial product, and have built it with laser-like focus on my needs rather than as a general solution to bibliographical description.  I would expect, however, that hobbyist programmers familiar with the technologies I used should have little difficulty extending it to their needs. I would be eager to hear from anyone who is interested.

Afterword: A Note on Technology

I initially explored the Text Encoding Initiative (TEI) as a way to encode book descriptions. I found that, as the name suggests, the standards were focused on encoding text and other contents, rather than encoding characteristics of the physical book. A TEI Work Group on physical bibliography made a good start at encoding a collation statement, but their work did not proceed to completion and did not become part of a TEI release.  I used theirs as a starting point for my own work. 

I am using early and well-supported versions of products in the XML suite: XML 1.0, XML Schema 1.0, and XSLT 1.0 (including XPath 1.0). While there are some attractions to using later versions, they are not always supported by browsers, and I wanted the web version to work with Firefox, Chrome, Internet Explorer, and Safari, not all of which support more recent versions of XSLT and XPath.

I use oXygen XML Editor 17 as an XML development environment. It is an awesome tool and I fear that I am only using a fraction of its capabilities.

Where I need to insure consistency of the book descriptions beyond what XML Schema provides, I write programs in Python 3.4. Python also creates the various indices I described earlier. Python is a general purpose programming language that excels in handling text and has excellent libraries for reading and writing XML files.

2 comments:

  1. Hi David,

    If you still wanted to use the TEI then you should know that, unlike other standards, it is possible to extend it to cope with things it hasn't dealt with yet. This would mean you'd be able to add your elements to TEI files and still have them validate. You could then also use this as a proposal to the TEI Technical Council for inclusion in the Guidelines. As you've seen there is a general desire to extend things to cope with physical bibliography better, but the people in that workgroup seem to have got busy with other things and disbanded. I'd be happy to work with you on a TEI customization which extends it to cover collation formulas... maybe we could get some interest from the TEI-MS mailing list (since I note your post to TEI-L in 2014 didn't get lots of attention...but that may have just been a busy time of year).

    -James Cummings

    ReplyDelete
  2. James, did you get my email reply? Best, David

    ReplyDelete