ncse is one of the first editorial projects to edit a corpus of periodicals and newspapers of this size so closely. Editing serials is complex due to their mutability over the course of their run and the hierarchical nature of the archive that results. In order to conceptualize such a diversity of form, we came up with a systematic vocabulary that we have used throughout the resource. Below is a glossary that provides definitions of the terms that we employ.

 

Concept map

The original project brief specified concept maps as a way of tracing thematic information across the edition. We constructed a concept map for the edition in a series of seminars throughout 2005 with a view to it providing the mechanism for grouping content by concept. To read more about this process click here. As the map had been constructed from the texts up, we did not feel confident about its ability to capture all of the content in ncse. This, combined with the fact that the map required human intervention to link concepts and texts, meant that we were not able to apply it to the edition. Instead, we used a semantic tagger to produce a keyword subject index.

 

Issue

By issue we usually refer to the portions of a serial that are issued regularly under the same masthead with a date and a number. Although issues have beginnings and endings (they are usually constrained to a predetermined amount of paper) they often have content that looks back over previous issues or forward to the next one. This continuity between issues is important for producing a recognizable identity, and is also established through the repetition of the basic structure of the issue and certain visual features. When we analyzed the potential content of ncse, issues were relatively easy to identify. However, supplements, front matter, wrappers, and other types of content also had to be accommodated. Olive software conceives of serials as containing a series of documents, with no discrimination between issues and supplements. This meant that, when creating content from single page images, we had to treat this extra material like the regular issues, eventually discriminating between them at a structural level.

Items and Departments

An item is the smallest editorial unit in ncse. It corresponds in most cases to a single article, but also includes any other component on the page. a department is a section of an issue that groups certain articles together. departments are similar to what might be called 'sections' today: for instance, the Leader's 'Portfolio' department included fiction, poetry and essays; its 'Literature' department contained reviews of recent publications.

 

Named entity extraction

This is a process carried out to obtain a list of all the proper nouns within the textual transcript of ncse. We used GATE to identify and then discriminate between personal names, places and institutions within our corpus. These lists were also evaluated against other extant authority lists, and were then used to popular the ‘Names’, ‘Institutions’ and ‘Places’ metadata fields in the keywords section of the resource.

Number

This term was a common one in the nineteenth century for referring to a single issue. Although we have tried to use nineteenth-century terminology where relevant, we attempted not to use this term to avoid confusion. This was particularly the case in the metadata, where ‘number’ referred to the number of that particular issue. We substituted ‘issue number’ instead.

Run

A run is a series of issues. However, this is complicated by the existence of content other than regular dated and numbered issues. As we recognize that serial texts constitute more than these regular issues, when we refer to a run we generally include all such supplementary material within it. The term is also complicated by the way the identity of serials change over the course of their publication. It is often difficult to decide when a particular title becomes something else. Equally, because serials might start up again at any moment, no serial publication ever really comes to a certain end. These factors make the concept of a run a provisional one, and that is often how we use it in ncse: either to refer to the totality of content grouped together as a particular title, or as a specific period in its life.

Segmentation

Segmentation refers to the process of discriminating between items on the page. This allows individual items to be associated with a set of coordinates on the page image. The team in London worked closely with Olive Software to explore the extent to which this could be done using software. Although we found it was possible to obtain some good results, the preparation and evaluation of the process was too time-consuming, particularly when applied to material that changes over time like nineteenth-century serials. Ultimately, Olive processed the pdfs using segmentation policies prepared by us. We then went over the segmented pdfs and made sure that all items were correctly delineated and that any items that marked the beginning of a headline were identified.

 

Semantic tagger

Once we had decided to include the multiple editions in ncse, it became apparent that we did not have the resources to enter metadata at item level. Our concept map required human interpretation of the text and the decision to expand the edition made this methodology impractical. Instead, we ran the UCREL’s (University Centre for Computer Corpus Research on Language) USAS (UCREL Semantic Analysis System) tagger. This provided sets of semantic tags for every article within ncse. After evaluation, these were used to provide subject metadata values in the keywords section of the resource. To read more about this, click here.

Serials, periodicals and newspapers

A serial is any publication that is published in parts. It thus includes parts that result in a book as well as periodicals and newspapers. However, it is important to recognize that there are different genres of serials. A periodical is usually open-ended, published until it becomes impractical to continue doing so, whereas a book issued in parts usually has a pre-defined end-point, whether because it has been written in advance or has to fill a set number of instalments. Equally, a newspaper, whether issued daily or weekly, is different from a periodical: whereas newspapers are focused around a very delimited notion of the present and are designed to be superseded once that moment has passed, a periodical – despite also being predicated on the notion of the moment – tends to provide apparatus that is oriented to its continuing relevance in the future. Although both newspapers and serials are date-stamped, feature regular departments and foster links between present and past numbers, periodicals offer themselves as having relevance beyond the moment of reading. Of course, in practice many readers discarded periodicals after reading them just as they would a newspaper; equally, as many newspapers were considered documents of record they were often bound and preserved for future reference. The important point is that both terms exist as points of reference in print culture, and individual titles took aspects of each in order to configure themselves to the perceived demands of readers.

ncse includes one weekly that unmistakeably resembles a newspaper, the Northern Star, a six-columned broadsheet, with numerous multiple editions. Other weeklies in the edition, such as the Leader and the Tomahawk are weekly periodicals that remain imbricated in the discourse of news (and political news at that), the main orientation of the newspaper press. Both titles are registered as newspapers, although they are folio in format, and contain others kinds of specialised matter, with the Leader including a political front and an arts back, and the Tomahawk being an illustrated satiric paper. In nineteenth-century parlance, they are class papers. As in these examples, some weeklies are safely categorised as newspapers, while others are hybrids, including elements of newspapers and more specialised class papers.

Supplement

A supplement is anything that is issued in addition to the regular dated and numbered issues. For more on supplements click here.

Text mining

Text mining refers to strategies that seek to identify features about a text by its linguistic constituents. We explored text mining methods for our advanced metadata (‘Names’, ‘Places’, ‘Institutions’, ‘Subjects’, ‘Genre’ and ‘Events’), and obtained good results for most of these categories. For ‘Names’, ‘Places’, and ‘Institutions’ we used named entity extraction; for ‘Subjects’ we used the USAS semantic tagger. We were not so successful with ‘Genre’ and ‘Events.’ Events are difficult as they are often discussed in texts in a variety of different ways and, the more abstract, the more to difficult to recover. For ‘Genre’, we attempted to see whether certain genre could be identified by characteristic word use. To use these statistical methods requires the production of substantial amounts of training data: work that will continue for us after the project launch in May 2008. To read more about this, click here.

Title

‘Title’ refers to the newspaper or periodical as a whole. It is used to encapsulate the corporate identity of a publication over its run, and so includes any supplementary matter that may be part of it. We use this generic title irrespective of any title changes that the serial undergoes; what we call ‘actual periodical titles’ in the metadata are linked to specific, historically based issues.

Volume

A volume is the bound form of a serial. Ostensibly book-like, in hard covers, often with front matter, indices, and a continuous pagination sequence running from beginning to end, a volume is geared to permanence unlike a single issue. . Although it is easy to think of volumes as simply collections of individual issues, it is important to remember the changes these issues must undergo – not to mention any additional material that is added – for them to become volumes. The volume is therefore a different form of the serial, and reading issues in volume form is not the same as encountering them individually. For editorial projects such as ncse this poses significant problems. We wanted to record all the different levels of content within the edition, but were limited in that the majority of our material was in volume form. Not only did this over-privilege this particular manifestation of the serial, but the volumes that we had were not bound in a consistent fashion. For our editorial policy regarding this, click here.