Metadata
The addition of metadata to the contents of ncse was challenging. Not only was there around 100,000 pages of content to which we had to add information, but this content was structured in a complex hierarchy, creating tiers of metadata at edition, title, volume, department and article level. This page gives an account of how we addressed this challenge, as well as full details of the resulting ncse metadata schema.
ncse Metadata Schema
The existing ncse metadata scheme is as follows:
Field: | Description: | Method | Applies to: |
---|---|---|---|
Uniform periodical title | This describes the label given to the whole of a title, regardless of changes over the run. | hand | issue > department > item |
Actual periodical title | Gives the title as it appears on the masthead. Equivalent to Dublin Core ‘Title.’ | hand | issue > department > item |
Date | Gives the date of publication (as far as known). Equivalent to Dublin Core ‘Date.’ | hand | issue > department > item |
Source | Labels content as part of ncse Equivalent to Dublin Core ‘Source.’ | hand | issue > department > item |
Volume number | Gives the volume and series number of the issue | hand | issue > department > item |
Issue number | Gives the number of the issue | hand | issue > department > item |
Edition | Labels the issue as being a particular edition (town, country, 1-9) | hand | issue > department > item |
Number of editions | Give the number of editions of an issue. | hand | issue > department > item |
Image description | Gives a number of keywords that describe what the image contains. Drawn from DMVI schema. | hand | item |
Price | Gives the price of an issue if known. | hand | issue > department > item |
Bibliographic location | Labels an article as appearing in the wrapper or in the issue itself. | hand | department > item |
Size | Gives the paper size of an issue. Equivalent to Dublin Core ‘Size.’ | hand | issue > department > item |
Editor | Labels issues with the name of their editor. | hand | not used |
Publisher | Labels issues with the name of their publisher. | hand | not used |
Department genre | Labels departments with a genre descriptor. | hand / text mining | not used |
Department title | Labels departments with a title. | hand | not used |
Page number | provides a page label that corresponds to that printed | hand / script | page |
Persons | Identifies all the people mentioned in the edition | named entity extraction | item |
Institutions | Identifies all the institutions mentioned in the edition | named entity extraction | item |
Places | Identifies all the places mentioned in the edition | named entity extraction | item |
Genre | Marks up items that correspond to a list of predefined genre | text mining | item |
Events | Marks up items that correspond to a list of predefined events | text mining | item |
Subject | Labels items with subject keywords. Drawn from USAS. | USAS semantic tagger | item |
When making the decision whether or not to include multiple editions (for information on this decision click here), we also explored a variety of strategies to automate as much of the metadata entry as possible. There were three moments when metadata was added to content, and at each moment a combination of human and automated input was employed. Where a field was not used, this was usually because experiments proved unsuccessful or we did not have the time to do the necessary research. Full details of these methods are below; however, one of the strategies we implemented in order to save labour is relevant here. In order to reduce the amount of manual data entry we designed the facility for metadata to be inherited from issue to department and then to item level. In the schema above, you can see the ways in which we took advantage of this so as not to have to fill the same field for each item within a department or issue.
History of the ncse metadata schema
In the early stages of the project we undertook a survey of the material to try and understand what possible data categories we could identify in serials and how they related to each other. The result was a very complex diagram that can be downloaded here. Even given our initial estimates of the scale of ncse we recognized that this was an unreasonable amount of data and set about creating a more manageable schema.
Our early designs broke the diagram down into a number of areas: bibliographic metadata; structural metadata; generic metadata; advanced metadata; and concept mapping:
Bibliographic metadata
This applied both to the serials and the digital resource. Fields included ‘title’ (of article), ‘creator’ (of article or digital resource), ‘date’ (of article or digital resource), ‘publisher’ (name and place of publisher of article), ‘printer’ (name and place of printer of article), ‘editor’ (or article or digital resource), ‘pagination’ (span of the article), ‘price’ (of the issue), ‘creator’ (of digital resource), ‘format’ (of digital resource), ‘origin’ (of digital resource, i.e. ncse). In creating these fields we referred to the Dublin Core schema in an attempt to ensure compatibility.
Structural metadata
The structural metadata fields were designed to indicate where a particular item occurred in the edition and what its relationships to other constituent parts were. Fields included ‘given title’ (of whole run), ‘actual title’ (i.e. that printed on the masthead of a single issue), ‘series number’, ‘volume number’, ‘edition’ (intended to label which edition a particular component occurred within), ‘prelims / numbers’ (designed to distinguish between items that appeared on front matter and those within issues themselves), ‘department’, ‘item’.
Formal metadata
This category was principally designed to accommodate images. Initially we were simply concerned to specify a field that would mark up images, but when combined with bibliographic and advanced metadata we would achieve a fuller description.
Generic metadata
This was a single field that would label each item with a genre such as advertisements, obituaries, correspondence, leading articles, news etc. We kept this field and used it to explore text mining techniques for metadata entry.
Advanced Metadata
Advanced metadata referred to those categories that described the content of articles. Initially we conceived these fields as being a form of index, populated by the content of the periodicals. The categories were ‘people’, ‘places’, ‘events’, ‘objects’, ‘publications’, and ‘institutions.’ Although we attempted some experiments to see how long these indexes would take to create by hand, our decision to include multiple editions and so edit an edition of c100,000 pages rendered this impractical. We pursued the advanced metadata categories through other means, however, as you can read in named entity extraction, text mining and semantic tagging.
Concept Mapping
Concept mapping was intended to map thematic concepts across different types of content in the edition. For a description of our work in this area click here. For an account of how we actually implemented thematic metadata go to named entity extraction, text mining and semantic tagging.
This metadata system was fairly loosely conceived as we designed it alongside experiments in segmentation. Without a clear idea of how we were going to produce digital copy, the form this would take, and the way different components were linked to each other, we could not design a metadata schema or the means for implementing it.
As we began to progress with the preparation of content and the segmentation, we also began to refine the metadata schema. Over the course of Autumn 2006 we developed it into a form that more closely resembles the one given above. To download this earlier schema, click here; to see a visual representation of it click here. As you may note, we had already separated the advanced metadata categories out, and had begun to think carefully about the values that would appear in each field. This was especially problematic for subject and image. At this stage we were not sure how we could classify the content for each category and were exploring various existing schemes, as well as our own concept maps, in order to decide on a strategy that was suitable for the project’s requirements. Accounts of how we developed subject metadata can be read here and image metadata here.
Adding metadata during segmentation
Once we established the methodology for the production of ncse, we began to analyze the points at which we could implement our metadata schema. Having a more developed workflow allowed us to begin to conceptualize which elements of the structure would be encoded into the xml, and what relationships needed to be labelled with metadata. The first place where we could begin to add metadata was when working with the segmented pdfs to amend the segmentation applied by Olive Software. At this stage we were mainly working to correct the content: making sure the right pages were bound into the right issues; that items were correctly distinguished from each other; and that departments were correctly labelled. The Olive plugin for Adobe Acrobat permits the addition of metadata but, rather than add all the metadata at this stage, we simply used the plugin to make any changes to the page numbers that were allocated to the pages of each pdf document. For more information about page numbers, click here .
The Olive Administrator Application is a web application that allows you to both organize the content of data repositories and add metadata to parts of it. While the pdfs were being segmented, output by Olive in Israel, and the resulting data then being uploaded onto the server at King’s, we went over the content in order to resolve any outstanding values that needed to be finalized. These lists of values were then loaded into the Olive Administrator Application, allowing our editorial assistants to insert metadata into the xml through a relatively easy to use interface and reducing the amount of metadata entered without a controlled vocabulary. We conceived of the metadata task as two sweeps: one for bibliographic metadata and one for image metadata. We began adding metadata at the end of November 2007 and this took a team of six part time editorial assistants until March 2008 to complete. For more information about the generation of the vocabulary for the addition of image metadata, click here.
The fields completed at this stage were ‘Uniform periodical title’, ‘Actual periodical title’, ‘Date’, ‘Source’, ‘Volume number’, ‘Issue number’, ‘Edition’, ‘Number of editions’, ‘Image description’, ‘Price’, ‘Bibliographic location’, ‘Size.’ As you can see from the schema above and from Viewpoint, this is the bulk of the metadata in ncse, and all the metadata entered by hand. These fields attach important bibliographic metadata to all items within the edition that labels them as to their relationship with each other, allowing complex searches across the edition and the production of bibliographic citations in search results.
Creating advanced metadata through named entity extraction, text mining and semantic tagging
The fields that remained were those that started out as advanced metadata and concept mapping. For an account of concept mapping click here. Of the advanced metadata categories - ‘people’, ‘places’, ‘events’, ‘objects’, ‘publications’, and ‘institutions’ – we had selected ‘people’ (as ‘persons’), ‘places’, ‘institutions’, ‘genre’, ‘events’, and ‘subject’ to pursue. We used GATE to extract a list of proper nouns, which we sorted using a combination of sources including the indices from John North’s Waterloo Directory of English Periodicals and Newspapers, 1800-1900 . On the basis of this work we were confident on producing indexes of persons, places and institutions. We attempted to use text mining techniques to see if we could find a way of marking up ‘events’ and ‘genre’, but were unable to obtain results that could be applied across the edition. For ‘subjects’ we used UCREL’s (University Centre for Corpus Research on Language) USAS (UCREL Semantic Analysis System) tagger to provide semantic tags for individual articles, which we then refined to present to users. For a fuller account of this research click here.