Finding Aid for The Equity

This site was completed thanks to the George Garth Graham Undergraduate Digital History Research Fellowship.

Items you can find:

Jeff Blackadar
I welcome your feedback and I can be reached at jeffblackadar( at) gmail( dot )com

Abstract

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers. Their collection includes The Equity, published in Shawville, Quebec since 1883 and is resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) process. Also, as a corpus of weekly newspapers spanning 1883-2010, there is a large amount of content for researchers to analyze.

This digital history project applied natural language processing in an R language computer program and stored the results in a database to create a new and useful index of this corpus of digitized content despite OCR related errors. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it.

The resulting index or finding aid allows researchers to access the Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each one of them links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Entities with spelling errors due to OCR are listed along with correctly spelled entities so that they are taken into account for research.

Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files.

Github repository for this project.
Blog entries about this project.